Vallevand, Mark K | 18 Aug 16:59 2015
Picon

Quick question about node offline

I have a report from a user about both nodes in a cluster being offline.  The user explicitly issued a ‘crm node standby’ for one node.  (Part of our testing.)  There was some error with our resource so it didn’t stop correctly on that node.  Then, the user noticed that both nodes were offline.  I don’t have good logs from this incident.

 

My quick question:  Will pacemaker/cman/corosync take a node offline without the user requesting it?

 

Obviously, I need to get good logs and dig deeper.  But, a quick answer is greatly appreciated.

 

Regards.
Mark K Vallevand   Mark.Vallevand <at> Unisys.com
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Megan . | 1 Jul 04:51 2015
Picon

new cluster setup error

Good Evening!


Anyone seen this before?  I just setup these boxes and i'm trying to create a new cluster.  I set the ricci password on all of the nodes, started ricci.  I try to create cluster and i get the below.

Thanks!


Centos 6.6

 2.6.32-504.23.4.el6.x86_64 



ccs-0.16.2-75.el6_6.2.x86_64

ricci-0.16.2-75.el6_6.1.x86_64

cman-3.0.12.1-68.el6.x86_64




[root <at> admin1-dit cluster]# ccs --createcluster test

Traceback (most recent call last):

  File "/usr/sbin/ccs", line 2450, in <module>

    main(sys.argv[1:])

  File "/usr/sbin/ccs", line 286, in main

    if (createcluster): create_cluster(clustername)

  File "/usr/sbin/ccs", line 939, in create_cluster

    elif get_cluster_conf_xml() != f.read():

  File "/usr/sbin/ccs", line 884, in get_cluster_conf_xml

    xml = send_ricci_command("cluster", "get_cluster.conf")

  File "/usr/sbin/ccs", line 2340, in send_ricci_command

    dom = minidom.parseString(res[1].replace('\t',''))

  File "/usr/lib64/python2.6/xml/dom/minidom.py", line 1928, in parseString

    return expatbuilder.parseString(string)

  File "/usr/lib64/python2.6/xml/dom/expatbuilder.py", line 940, in parseString

    return builder.parseString(string)

  File "/usr/lib64/python2.6/xml/dom/expatbuilder.py", line 223, in parseString

    parser.Parse(string, True)

xml.parsers.expat.ExpatError: no element found: line 1, column 0

[root <at> admin1-dit cluster]# 

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Daniel Dehennin | 30 Jun 21:37 2015

Finding the bottleneck between SAN and GFS2

Hello,

We are experiencing slow VMs on our OpenNebula architecture:

- two Dell PowerEdge M620
  + Intel(R) Xeon(R) CPU E5-2620 v2  <at>  2.10GHz
  + 96GB RAM
  + 2x146Go SAS drives

- 2TB SAN LUN to store qcow2 images with GFS2 over cLVM

We made some tests, installing Linux OS in parallel and we did not find
any issues with performance.

Since 3 weeks, 17 users use ±60 VMs and everything became slow.

The SAN administrator complain about very high IO/s so we limited each
VM to 80 IO/s with the libvirt configuration

#+begin_src xml
<total_iops_sec>80</total_bytes_sec>
#+end_src

But it did not get better

Today I ran some benchmark to try to find out what happens.

Checking plocks/s
=================

I started with ping_pong[1] to see how many locks per second the GFS2
can sustain.

I use it as describe on the samba wiki[2], here are the results:

- starting ”ping_pong /var/lib/one/datastores/test_plock 3” on first
  node display around 4k plocks/s

- then starting ”ping_pong /var/lib/one/datastores/test_plock 3” on the
  second node display around 2k on each node

For the single node process, I was expecting an much higher rate, they
speak about 500k to 1M locks/s.

Do my numbers looks strange?

Checking fileio
===============

I use “sysbench --test=fileio” to check inside the VM and outside (on
bare metal node), with files in cache or cache dropped.

The short result is that bare metal access to the GFS2 without any cache
is terribly slow, around 2Mb/s and 90 requests/s.

Is there a way to find out if the problem comes from my
GFS2/corosync/pacemaker configuration or from the SAN?

Regards.

Following are the full sysbench results

In the VM, qemu disk cache disabled, total_iops_sec = 0
-------------------------------------------------------

I try with the IO limit but the difference is minimal:

- the request/s drop to ±80
- the Mb/s is around 1.2Mb/s

    root <at> vm:~# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw prepare
    sysbench 0.4.12:  multi-threaded system evaluation benchmark

    128 files, 73728Kb each, 9216Mb total
    Creating files for the test...

    root <at> vm:~# sysbench --num-threads=16 --test=fileio --file-total-size=9G --file-test-mode=rndrw run
    sysbench 0.4.12:  multi-threaded system evaluation benchmark

    Running the test with following options:
    Number of threads: 16

    Extra file open flags: 0
    128 files, 72Mb each
    9Gb total file size
    Block size 16Kb
    Number of random requests for random IO: 10000
    Read/Write ratio for combined random IO test: 1.50
    Periodic FSYNC enabled, calling fsync() each 100 requests.
    Calling fsync() at the end of test, Enabled.
    Using synchronous I/O mode
    Doing random r/w test
    Threads started!
    Done.

    Operations performed:  6034 Read, 4019 Write, 12808 Other = 22861 Total
    Read 94.281Mb  Written 62.797Mb  Total transferred 157.08Mb  (1.4318Mb/sec)
       91.64 Requests/sec executed

    Test execution summary:
        total time:                          109.7050s
        total number of events:              10053
        total time taken by event execution: 464.7600
        per-request statistics:
             min:                                  0.01ms
             avg:                                 46.23ms
             max:                              11488.59ms
             approx.  95 percentile:             125.81ms

    Threads fairness:
        events (avg/stddev):           628.3125/59.81
        execution time (avg/stddev):   29.0475/6.34

On the bare metal node, with the caches dropped
-----------------------------------------------

After creating the 128 files, I drop the caches to get “from SAN” results.

    root <at> nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio
--file-total-size=9G --file-test-mode=rndrw prepare
    sysbench 0.4.12:  multi-threaded system evaluation benchmark

    128 files, 73728Kb each, 9216Mb total
    Creating files for the test...

    # DROP CACHES
    root <at> nebula1: echo 3 > /proc/sys/vm/drop_caches

    root <at> nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio
--file-total-size=9G --file-test-mode=rndrw run
    sysbench 0.4.12:  multi-threaded system evaluation benchmark

    Running the test with following options:
    Number of threads: 16

    Extra file open flags: 0
    128 files, 72Mb each
    9Gb total file size
    Block size 16Kb
    Number of random requests for random IO: 10000
    Read/Write ratio for combined random IO test: 1.50
    Periodic FSYNC enabled, calling fsync() each 100 requests.
    Calling fsync() at the end of test, Enabled.
    Using synchronous I/O mode
    Doing random r/w test
    Threads started!
    Done.

    Operations performed:  6013 Read, 3999 Write, 12800 Other = 22812 Total
    Read 93.953Mb  Written 62.484Mb  Total transferred 156.44Mb  (1.5465Mb/sec)
       98.98 Requests/sec executed

    Test execution summary:
        total time:                          101.1559s
        total number of events:              10012
        total time taken by event execution: 1109.0862
        per-request statistics:
             min:                                  0.01ms
             avg:                                110.78ms
             max:                              13098.27ms
             approx.  95 percentile:             164.52ms

    Threads fairness:
        events (avg/stddev):           625.7500/114.50
        execution time (avg/stddev):   69.3179/6.54

On the bare metal node, with the test files filled in the cache
---------------------------------------------------------------

I run md5sum on all the files to let the kernel cache them.

    # Load files in cache
    root <at> nebula1:/var/lib/one/datastores/bench# md5sum test*

    root <at> nebula1:/var/lib/one/datastores/bench# sysbench --num-threads=16 --test=fileio
--file-total-size=9G --file-test-mode=rndrw run
    sysbench 0.4.12:  multi-threaded system evaluation benchmark

    Running the test with following options:
    Number of threads: 16

    Extra file open flags: 0
    128 files, 72Mb each
    9Gb total file size
    Block size 16Kb
    Number of random requests for random IO: 10000
    Read/Write ratio for combined random IO test: 1.50
    Periodic FSYNC enabled, calling fsync() each 100 requests.
    Calling fsync() at the end of test, Enabled.
    Using synchronous I/O mode
    Doing random r/w test
    Threads started!
    Done.

    Operations performed:  6069 Read, 4061 Write, 12813 Other = 22943 Total
    Read 94.828Mb  Written 63.453Mb  Total transferred 158.28Mb  (54.896Mb/sec)
     3513.36 Requests/sec executed

    Test execution summary:
        total time:                          2.8833s
        total number of events:              10130
        total time taken by event execution: 16.3824
        per-request statistics:
             min:                                  0.01ms
             avg:                                  1.62ms
             max:                                760.53ms
             approx.  95 percentile:               5.51ms

    Threads fairness:
        events (avg/stddev):           633.1250/146.90
        execution time (avg/stddev):   1.0239/0.33

Footnotes: 
[1]  https://git.samba.org/?p=ctdb.git;a=blob;f=utils/ping_pong/ping_pong.c

[2]  https://wiki.samba.org/index.php/Ping_pong

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Megan . | 3 Jun 16:31 2015
Picon

Error: ClientSocket(String): connect() failed: No such file or directory

Anybody ever seen "Error: ClientSocket(String): connect() failed: No such file or directory" when doing a start all?  Something seems to have broken with our closer.  Our UAT setup works as expected.  I looked at tcpdumps the best that i could (i'm not a network person though) and i didn't see anything obvious.  I shutdown iptables on all nodes.

We are running Centos 6,6, ccs-0.16.2-75.el6_6.1.x86_64
cman-3.0.12.1-68.el6.x86_64.  We have a 12 node cluster in production that allows us to share gfs2 iscsi mounts.  no other services are used.  clvmd -R runs fine at this time.  ccs -h node --sync --activate also runs fine.

[root <at> admin1 ~]# ccs -h admin1-ops --startall

Unable to start map1-ops, possibly due to lack of quorum, try --startall

Error: ClientSocket(String): connect() failed: No such file or directory

Started cache2-ops

Unable to start data1-ops, possibly due to lack of quorum, try --startall

Error: ClientSocket(String): connect() failed: No such file or directory

Started map2-ops

Unable to start archive1-ops, possibly due to lack of quorum, try --startall

Error: ClientSocket(String): connect() failed: No such file or directory

Started data3-ops

Started mgmt1-ops

Unable to start admin1-ops, possibly due to lack of quorum, try --startall

Error: ClientSocket(String): connect() failed: No such file or directory

Started data2-ops

Started cache1-ops

[root <at> admin1 ~]# 


I have quorum:

[root <at> admin1 ~]# clustat

Cluster Status for bitsops <at> Wed Jun  3 02:13:08 2015

Member Status: Quorate


 Member Name                                                     ID   Status

 ------ ----                                                     ---- ------

 admin1-ops                                                          1 Online, Local

 mgmt1-ops                                                           2 Online

 archive1-ops                                                        3 Online

 map1-ops                                                            4 Online

 map2-ops                                                            5 Online

 cache1-ops                                                          6 Online

 cache2-ops                                                          7 Online

 data1-ops                                                           8 Online

 data2-ops                                                           9 Online

 data3-ops                                                          10 Online




Here is what I expect, and what UAT gives me:

[root <at> admin1-uat ~]# ccs -h admin1-uat --startall

Started mgmt1-uat

Started data1-uat

Started data2-uat

Started admin1-uat

Started tools-uat

Started map1-uat

Started archive1-uat

Started cache2-uat

Started cache1-uat

Started map2-uat

[root <at> admin1-uat ~]# 


--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Umar Draz | 21 May 13:01 2015
Picon

iscsi-stonith-device stopped

Hi

I have created 2 node clvm cluster, everything apparently running file, but when I did 

pcs status

it always display this

 Clone Set: dlm-clone [dlm]
     Started: [ clvm-1 clvm-2 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ clvm-1 clvm-2 ]
 iscsi-stonith-device   (stonith:fence_scsi):   Stopped

Failed actions:
    iscsi-stonith-device_start_0 on clvm-1 'unknown error' (1): call=40, status=Error, exit-reason='none', last-rc-change='Thu May 21 05:52:23 2015', queued=0ms, exec=1154ms
    iscsi-stonith-device_start_0 on clvm-2 'unknown error' (1): call=38, status=Error, exit-reason='none', last-rc-change='Thu May 21 05:52:26 2015', queued=0ms, exec=1161ms


PCSD Status:
  clvm-1: Online
  clvm-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Would you please help me why iscsi-stonith device stopped, and how I can solve this issue.

Br.

Umar
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
gianpietro.sella | 10 May 11:28 2015
Picon

nfs cluster, problem with delete file in the failover case

Hi, sorry for my bad english.
I testing nfs cluster active/passsive (2 nodes).
I use the next instruction for nfs:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Administration/s1-resourcegroupcreatenfs-HAAA.html

I use centos 7.1 on the nodes.
The 2 node of the cluster share the same iscsi volume.
The nfs cluster is very good.
I have only one problem.
I mount the nfs cluster exported folder on my client node (nfsv3 protocol).
I write on the nfs folder an big data file (70GB):
dd if=/dev/zero bs=1M count=70000 > /Instances/output.dat
Before write is finished I put the active node in standby status.
then the resource migrate in the other node.
when the dd write finish the file is ok.
I delete the file output.dat.
Now the file output.dat is not present in the nfs folder, it is correctly
erased.
but the space in the nfs volume is not free.
If I execute an df command on the client (and on the new active node) I
see 70GB on used space in the exported volume disk.
Now if I put the new active node in standby status (migrate the resource
in the first node where start writing file), and the other node is now the
active node, the space of the deleted output.dat file is now free.
It is very strange.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Robert Jacobson | 29 Apr 14:48 2015
Picon

cluster not fencing after filesystem failure


Hi,

I'm having a problem on CentOS 6.5 with a two-node cluster for HA NFS. 
Here's the cluster.conf:  http://pastebin.com/aVAuUDtc

The cluster nodes are VMware guests.  Occasionally the node providing
the NFS service has a problem accessing the disk device (I'm working
with VMware on that...), but long story short -- the kernel shuts down
the XFS filesystem:

Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): metadata I/O
error: block 0x170013a900 ("xlog_iodone") error 5 buf count 65536
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10):
xfs_do_force_shutdown(0x2) called from line 1062 of file
fs/xfs/xfs_log.c.  Return address = 0xffffffffa027f131
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): Log I/O Error
Detected.  Shutting down filesystem
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: nfsd: non-standard errno: 5
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): Please umount the
filesystem and rectify the problem(s)

rgmanager noticed the filesystem problem (see log at
http://pastebin.com/mPPBP2HY ), and marked "HA_nfs" service in a failed
state.

What I'm confused about is why the fencing is not taking place in the
above scenario.  I'm guessing I have either a misunderstanding or
misconfiguration.
At this point I'd like the other node to fence the failed one and take
over.  Or, the failed node to fence itself.

I've tested fencing from the command line and it works:
fence_vmware_soap --ip 192.168.50.9 --username ddsfence --password
secret -z --action reboot -U  "423d288c-03ff-74bf-9a4f-bf661f8ed87b"

I'd appreciate any help with this.

package versions, if it matters:

rgmanager-3.0.12.1-19.el6.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Robert Jacobson               Robert.C.Jacobson <at> nasa.gov
Lead System Admin       Solar Dynamics Observatory (SDO)
Bldg 14, E222                             (301) 286-1591 

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Jatin Davey | 24 Apr 14:12 2015
Picon

Working of a two-node cluster

Hi

I am using a two node cluster using RHEL 6.5. I have a very fundamental question.

For the two node cluster to work , Is it mandatory that both the nodes are "online" and communicating with each other ?

What i can see is that if there is communication failure between them then either both the nodes are fenced or the cluster gets into a "stopped" state (Seen from output of clustat command).

Apologies if my questions are naive. I am just starting to work with RHEL cluster add-on.

Thanks
Jatin
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Thorvald Hallvardsson | 23 Apr 12:11 2015
Picon

GFS2 over NFS4

Hi guys,

I need some help and answers related to share GFS2 file system over NFS. I have read the RH documentation but still some things are a bit unclear to me.

First of all I need to build a POC for the shared storage cluster which initially will contain 3 nodes in the storage cluster. This is all going to run as a VM environment on Hyper-V. Generally the idea is to share virtual VHDX across 3 nodes, put LVM and GFS2 on top of it and then share it via NFS to the clients. I have got the initial cluster built on Centos 7 using pacemaker. I generally followed RH docs to build it so I ended up with the simple GFS2 cluster and pacemaker managing fencing and floating VIP resource.

Now I'm wondering about the NFS. RedHat documentation is a bit conflicting or rather unclear in some places and I found quite few manuals on the internet about similar configuration and generally some of them suggest to mount the NFS share on the clients with nolock option RH docs mention local flock and I got confused about what supposed to be where. Of course I don't know if my understanding is correct but the reason to "disable" NFS locking is because GFS2 is already doing it anyway via DLM so there is no need for NFS to do same thing what eventually mean that I will have some sort of double locking mechanism in place. So first question is where I suppose to setup locks or rather no locks and how the export should look like ?

Second thing is I was thinking about going a step forward and use NFS4 for the exports. However from what I have read about NFS4 it does locking by default and there is no way to disable them. Does that mean NFS4 is not suitable in this case at all ?

That's all for now.

I appreciate your help.

Thank you.
TH
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Daniel Dehennin | 1 Apr 14:47 2015

[ClusterLabs] dlm_controld and fencing issue

Hello,

On a 4 nodes OpenNebula cluster, running Ubuntu Trusty 14.04.2, with:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.3
- dlm 4.0.1-0ubuntu1

Here is the node list with their IDs, to follow the logs:

- 1084811137 nebula1
- 1084811138 nebula2
- 1084811139 nebula3
- 1084811140 nebula4 (the actual DC)

I have an issue where fencing is working but dlm always wait for
fencing, I needed to manually run “dlm_tool fence_ack 1084811138” this
morning, here are the logs:

Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time
1427844569 fence_all dlm_stonith
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
[...]
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing

The stonith actually worked:

Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to
fence (reboot) 'nebula2' with device '(any)'
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote
operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula1-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula2-IPMILAN can fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend
can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula3-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by
nebula3 for crmd.6490 <at> nebula4.39eaf3a2: OK

I attache the logs of the DC nebula4 around from 01:29:03, where
everything worked fine (Got 4 replies, expecting: 4) to a little bit
after.

To me, it looks like:

- dlm ask for fencing directly at 01:29:29, the node was fenced since it
  had garbage in its /var/log/syslog exactely at 01:29.29, plus its
  uptime, but did not get a good response

- pacemaker fence nebula2 at 01:29:30 because it's not part of the
  cluster anymore (since 01:29:26 [TOTEM ] ... Members left: 1084811138)
  This fencing works.

Do you have any idea?

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 31
Apr  1 01:29:03 nebula4 lvm[6759]: check_all_clvmds_running
Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811138, state = 3
Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
Apr  1 01:29:03 nebula4 lvm[6759]: Writing status 0 down pipe 13
Apr  1 01:29:03 nebula4 lvm[6759]: Waiting to do post command - state = 0
Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: distribute command: XID = 43973, flags=0x0 ()
Apr  1 01:29:03 nebula4 lvm[6759]: num_nodes = 4
Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218f100. client=0x218eab0, msg=0x218ebc0,
len=31, csid=(nil), xid=43973
Apr  1 01:29:03 nebula4 lvm[6759]: Sending message to all cluster nodes
Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: local
Apr  1 01:29:03 nebula4 lvm[6759]: process_local_command: SYNC_NAMES (0x2d) msg=0x218ed00, msglen
=31, client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
Apr  1 01:29:03 nebula4 lvm[6759]: Got 1 replies, expecting: 4
Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 0. len 31
Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811140. len 18
Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e782: 0 bytes
Apr  1 01:29:03 nebula4 lvm[6759]: Got 2 replies, expecting: 4
Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811140. len 18
Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e783: 0 bytes
Apr  1 01:29:03 nebula4 lvm[6759]: Got 3 replies, expecting: 4
Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811140. len 18
Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e781: 0 bytes
Apr  1 01:29:03 nebula4 lvm[6759]: Got 4 replies, expecting: 4
Apr  1 01:29:03 nebula4 lvm[6759]: Got post command condition...
Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 30
Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
Apr  1 01:29:03 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-2' at 6 (client=0x218eab0)
Apr  1 01:29:03 nebula4 lvm[6759]: unlock_resource: V_vg-one-2 lockid: 1
Apr  1 01:29:03 nebula4 lvm[6759]: Writing status 0 down pipe 13
Apr  1 01:29:03 nebula4 lvm[6759]: Waiting to do post command - state = 0
Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: distribute command: XID = 43974, flags=0x1 (LOCAL)
Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x218eab0, msg=0x218ebc0,
len=30, csid=(nil), xid=43974
Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: local
Apr  1 01:29:03 nebula4 lvm[6759]: process_local_command: LOCK_VG (0x33) msg=0x218ed40, msglen =30, client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: do_lock_vg: resource 'V_vg-one-2', cmd = 0x6 LCK_VG (UNLOCK|VG),
flags = 0x0 ( ), critical_section = 0
Apr  1 01:29:03 nebula4 lvm[6759]: Invalidating cached metadata for VG vg-one-2
Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
Apr  1 01:29:03 nebula4 lvm[6759]: Got 1 replies, expecting: 1
Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:03 nebula4 lvm[6759]: Got post command condition...
Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 0
Apr  1 01:29:03 nebula4 lvm[6759]: EOF on local socket: inprogress=0
Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for child thread
Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
Apr  1 01:29:03 nebula4 lvm[6759]: Subthread finished
Apr  1 01:29:03 nebula4 lvm[6759]: Joined child thread
Apr  1 01:29:03 nebula4 lvm[6759]: ret == 0, errno = 0. removing client
Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ebc0. client=0x218eab0, msg=(nil),
len=0, csid=(nil), xid=43974
Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: free fd -1
Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
39602 on node 40a8e782
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
44354 on node 40a8e781
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
39605 on node 40a8e782
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
44357 on node 40a8e781
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
39608 on node 40a8e782
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
44360 on node 40a8e781
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
44363 on node 40a8e781
Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
Apr  1 01:29:23 nebula4 lvm[6759]: Got new connection on fd 5
Apr  1 01:29:23 nebula4 lvm[6759]: Read on local socket 5, len = 30
Apr  1 01:29:23 nebula4 lvm[6759]: creating pipe, [12, 13]
Apr  1 01:29:23 nebula4 lvm[6759]: Creating pre&post thread
Apr  1 01:29:23 nebula4 lvm[6759]: Created pre&post thread, state = 0
Apr  1 01:29:23 nebula4 lvm[6759]: in sub thread: client = 0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 1 (client=0x218eab0)
Apr  1 01:29:23 nebula4 lvm[6759]: lock_resource 'V_vg-one-0', flags=0, mode=3
Apr  1 01:29:23 nebula4 lvm[6759]: lock_resource returning 0, lock_id=1
Apr  1 01:29:23 nebula4 lvm[6759]: Writing status 0 down pipe 13
Apr  1 01:29:23 nebula4 lvm[6759]: Waiting to do post command - state = 0
Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: distribute command: XID = 43975, flags=0x1 (LOCAL)
Apr  1 01:29:23 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x218eab0, msg=0x218ebc0,
len=30, csid=(nil), xid=43975
Apr  1 01:29:23 nebula4 lvm[6759]: process_work_item: local
Apr  1 01:29:23 nebula4 lvm[6759]: process_local_command: LOCK_VG (0x33) msg=0x218ed40, msglen =30, client=0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: do_lock_vg: resource 'V_vg-one-0', cmd = 0x1 LCK_VG (READ|VG), flags =
0x0 ( ), critical_section = 0
Apr  1 01:29:23 nebula4 lvm[6759]: Invalidating cached metadata for VG vg-one-0
Apr  1 01:29:23 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
Apr  1 01:29:23 nebula4 lvm[6759]: Got 1 replies, expecting: 1
Apr  1 01:29:23 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:23 nebula4 lvm[6759]: Got post command condition...
Apr  1 01:29:23 nebula4 lvm[6759]: Waiting for next pre command
Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: Send local reply
Apr  1 01:29:23 nebula4 lvm[6759]: Read on local socket 5, len = 31
Apr  1 01:29:23 nebula4 lvm[6759]: check_all_clvmds_running
Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811138, state = 3
Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
Apr  1 01:29:23 nebula4 lvm[6759]: Got pre command condition...
Apr  1 01:29:23 nebula4 lvm[6759]: Writing status 0 down pipe 13
Apr  1 01:29:23 nebula4 lvm[6759]: Waiting to do post command - state = 0
Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: distribute command: XID = 43976, flags=0x0 ()
Apr  1 01:29:23 nebula4 lvm[6759]: num_nodes = 4
Apr  1 01:29:23 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218f100. client=0x218eab0, msg=0x218ebc0,
len=31, csid=(nil), xid=43976
Apr  1 01:29:23 nebula4 lvm[6759]: Sending message to all cluster nodes
Apr  1 01:29:23 nebula4 lvm[6759]: process_work_item: local
Apr  1 01:29:23 nebula4 lvm[6759]: process_local_command: SYNC_NAMES (0x2d) msg=0x218ed00, msglen
=31, client=0x218eab0
Apr  1 01:29:23 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:23 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
Apr  1 01:29:23 nebula4 lvm[6759]: Got 1 replies, expecting: 4
Apr  1 01:29:23 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:26 nebula4 corosync[6411]:   [TOTEM ] A processor failed, forming new configuration.
Apr  1 01:29:29 nebula4 corosync[6411]:   [TOTEM ] A new membership (192.168.231.129:1204) was formed.
Members left: 1084811138
Apr  1 01:29:29 nebula4 lvm[6759]: confchg callback. 0 joined, 1 left, 3 members
Apr  1 01:29:29 nebula4 crmd[6490]:  warning: match_down_event: No match for shutdown action on 1084811138
Apr  1 01:29:29 nebula4 crmd[6490]:   notice: peer_update_callback: Stonith/shutdown of nebula2 not matched
Apr  1 01:29:29 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Apr  1 01:29:29 nebula4 corosync[6411]:   [QUORUM] Members[3]: 1084811137 1084811139 1084811140
Apr  1 01:29:29 nebula4 corosync[6411]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr  1 01:29:29 nebula4 crmd[6490]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node
nebula2[1084811138] - state is now lost (was member)
Apr  1 01:29:29 nebula4 crmd[6490]:  warning: match_down_event: No match for shutdown action on 1084811138
Apr  1 01:29:29 nebula4 pacemakerd[6483]:   notice: crm_update_peer_state: pcmk_quorum_notification:
Node nebula2[1084811138] - state is now lost (was member)
Apr  1 01:29:29 nebula4 crmd[6490]:   notice: peer_update_callback: Stonith/shutdown of nebula2 not matched
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 0. len 31
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 0. len 31
Apr  1 01:29:29 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x6a1d60,
msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
Apr  1 01:29:29 nebula4 lvm[6759]: process_work_item: remote
Apr  1 01:29:29 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
43802 on node 40a8e783
Apr  1 01:29:29 nebula4 lvm[6759]: Syncing device names
Apr  1 01:29:29 nebula4 lvm[6759]: LVM thread waiting for work
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811140. len 18
Apr  1 01:29:29 nebula4 lvm[6759]: Reply from node 40a8e783: 0 bytes
Apr  1 01:29:29 nebula4 lvm[6759]: Got 2 replies, expecting: 4
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811140. len 18
Apr  1 01:29:29 nebula4 lvm[6759]: Reply from node 40a8e781: 0 bytes
Apr  1 01:29:29 nebula4 lvm[6759]: Got 3 replies, expecting: 4
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811139. len 18
Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811139. len 18
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time
1427844569 fence_all dlm_stonith
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140
walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: pe_fence_node: Node nebula2 will be fenced because the
node is no longer part of the cluster
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: determine_online_status: Node nebula2 is unclean
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula3: unknown error (1)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula1: unknown error (1)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula2: unknown error (1)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula2 after 1000000 failures (max=1000000)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_dlm:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_dlm:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_clvm:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_clvm:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_vg_one:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_vg_one:3_stop_0 on nebula2 is
unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_fs_one-datastores:3_stop_0
on nebula2 is unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_fs_one-datastores:3_stop_0
on nebula2 is unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action
stonith-nebula1-IPMILAN_stop_0 on nebula2 is unrunnable (offline)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: stage6: Scheduling Node nebula2 for STONITH
Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_dlm:3#011(nebula2)
Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_clvm:3#011(nebula2)
Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_vg_one:3#011(nebula2)
Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_fs_one-datastores:3#011(nebula2)
Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Move    stonith-nebula1-IPMILAN#011(Started
nebula2 -> nebula3)
Apr  1 01:29:30 nebula4 pengine[6489]:  warning: process_pe_message: Calculated Transition 101: /var/lib/pacemaker/pengine/pe-warn-22.bz2
Apr  1 01:29:30 nebula4 crmd[6490]:   notice: te_fence_node: Executing reboot fencing operation (98) on
nebula2 (timeout=30000)
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to
fence (reboot) 'nebula2' with device '(any)'
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote
operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula1-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula2-IPMILAN can fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend
can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device:
stonith-nebula3-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by
nebula3 for crmd.6490 <at> nebula4.39eaf3a2: OK
Apr  1 01:29:32 nebula4 crmd[6490]:   notice: tengine_stonith_callback: Stonith operation
2/98:101:0:28913388-04df-49cb-9927-362b21a74014: OK (0)
Apr  1 01:29:32 nebula4 crmd[6490]:   notice: tengine_stonith_notify: Peer nebula2 was terminated
(reboot) by nebula3 for nebula4: OK (ref=39eaf3a2-d7e0-417d-8a01-d2f373973d6b) by client crmd.6490
Apr  1 01:29:32 nebula4 crmd[6490]:   notice: te_rsc_command: Initiating action 91: start
stonith-nebula1-IPMILAN_start_0 on nebula3
Apr  1 01:29:33 nebula4 crmd[6490]:   notice: run_graph: Transition 101 (Complete=13, Pending=0,
Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-22.bz2): Stopped
Apr  1 01:29:33 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula3: unknown error (1)
Apr  1 01:29:33 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula1: unknown error (1)
Apr  1 01:29:33 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
Apr  1 01:29:33 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
Apr  1 01:29:33 nebula4 pengine[6489]:   notice: process_pe_message: Calculated Transition 102: /var/lib/pacemaker/pengine/pe-input-129.bz2
Apr  1 01:29:33 nebula4 crmd[6490]:   notice: te_rsc_command: Initiating action 88: monitor
stonith-nebula1-IPMILAN_monitor_3600000 on nebula3
Apr  1 01:29:34 nebula4 crmd[6490]:   notice: run_graph: Transition 102 (Complete=1, Pending=0, Fired=0,
Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-129.bz2): Complete
Apr  1 01:29:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr  1 01:30:01 nebula4 CRON[44640]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run
apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12
>/dev/null; fi)
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing
Apr  1 01:30:29 nebula4 lvm[6759]: Request timed-out (send: 1427844563, now: 1427844629)
Apr  1 01:30:29 nebula4 lvm[6759]: Request timed-out. padding
Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e781
Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e783
Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811138, state = 1
Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e784
Apr  1 01:30:29 nebula4 lvm[6759]: Got post command condition...
Apr  1 01:30:29 nebula4 lvm[6759]: Waiting for next pre command
Apr  1 01:30:29 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
Apr  1 01:30:29 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
Apr  1 01:30:29 nebula4 lvm[6759]: Send local reply
Apr  1 01:30:29 nebula4 lvm[6759]: Read on local socket 5, len = 30
Apr  1 01:30:29 nebula4 lvm[6759]: Got pre command condition...
Apr  1 01:30:29 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 6 (client=0x218eab0)
Apr  1 01:30:29 nebula4 lvm[6759]: unlock_resource: V_vg-one-0 lockid: 1
Apr  1 01:40:01 nebula4 CRON[47640]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run
apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12
>/dev/null; fi)
Apr  1 01:44:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Apr  1 01:44:34 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula3: unknown error (1)
Apr  1 01:44:34 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for
stonith-nebula4-IPMILAN on nebula1: unknown error (1)
Apr  1 01:44:34 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
Apr  1 01:44:34 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing
stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
Apr  1 01:44:34 nebula4 pengine[6489]:   notice: process_pe_message: Calculated Transition 103: /var/lib/pacemaker/pengine/pe-input-130.bz2
Apr  1 01:44:34 nebula4 crmd[6490]:   notice: run_graph: Transition 103 (Complete=0, Pending=0, Fired=0,
Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-130.bz2): Complete
Apr  1 01:44:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr  1 01:45:01 nebula4 CRON[49089]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run
apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12
>/dev/null; fi)
Apr  1 01:46:01 nebula4 CRON[570]: (root) CMD (if test -x /usr/sbin/apticron; then /usr/sbin/apticron
--cron; else true; fi)
Apr  1 01:49:20 nebula4 lvm[6759]: Got new connection on fd 17
Apr  1 01:49:20 nebula4 lvm[6759]: Read on local socket 17, len = 30
Apr  1 01:49:20 nebula4 lvm[6759]: creating pipe, [18, 19]
Apr  1 01:49:20 nebula4 lvm[6759]: Creating pre&post thread
Apr  1 01:49:20 nebula4 lvm[6759]: Created pre&post thread, state = 0
Apr  1 01:49:20 nebula4 lvm[6759]: in sub thread: client = 0x218f1f0
Apr  1 01:49:20 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 1 (client=0x218f1f0)
Apr  1 01:49:20 nebula4 lvm[6759]: lock_resource 'V_vg-one-0', flags=0, mode=3
_______________________________________________
Users mailing list: Users <at> clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Frederik Ferner | 11 Mar 19:07 2015
Picon

Re: shutdown work (plan)

All,

just a reminder and update of this planned work for this shutdown (with 
the obvious beamline typo fixed). The main new detail really is that 
IIRC i02 will want us to power off everything before Saturday (i.e. some 
time on Friday.) and it will remain powered of for a while (week or so.)

IMHO it would be good if we can start considering the merge to UserMode 
tomorrow to make sure we are ready to check it out on Friday.

The Lustre I/O errors are hopefully going to go away with the Lustre 
upgrade on the servers, so no need to reproduce this until after the 
Lustre maintenance.

Lustre and GPFS client and kernel modules have been built, cfengine has 
been updated with the correct versions (I hope).

Rsyncing /dls/i18 to Lustre might have to be delayed slightly as 
lustre03 is currently ~88% full and I'd like to clear some/most of this 
first...

Richard, did you have more details when various switch upgrades etc have 
been scheduled?

Cheers,
Frederik

On 04/03/15 14:47, Frederik Ferner wrote:
> All,
>
> as the shutdown is only really 2.5 weeks (+easter weekend) and we're
> relatively short of people (+ HEPIX conference...) and I've been left in
> charge of planning the shutdown, I thought I'd suggest a draft plan for
> your consideration. As usual we'll have tomorrows meeting to discuss
> details...
>
> Richard, this doesn't yet include anything you may planned, I think I
> lost track...
>
> Basic dates:
> first day of shutdown: Friday March 13th.
> First machine startup day: Tuesday April 7th
> beamline startup: Thursday April 9th.
>
> Bank Holidays: April 3rd, April 6th.
> HEPIX (aka Greg, Tina and me away in Oxford, available in emergency):
> March 23rd-27th
>
> Beamline updates have been scheduled for 17th+18th March
> (Tuesday+Wednesday).
>
> I'm hoping that we'll manage to allocate (and setup) any requested IP
> addresses on the primary network early in the shutdown...
>
> So, initial plan:
>
> Friday 13th:
> * allocation IP addresses on Primary network (various FPs)
> * merge cfengine trunk to usermode early in the morning (and check out)
> * update stable repositories in the afternoon
> * start rsync for i18 to lustre
>
> Monday 16th:
> * generate final list of beamline machines to update
> * allocate who starts on which beamline
> * verify that usermode checkout has worked, check package installs on
> and cfengine on selected central servers, at least one Lustre
> client/GPFS client, maybe even one or two cluster nodes
> * switch /dls/i18 to lustre (needs to be arranged with beamline)
>
> Tuesday 17th and Wednesday 18th:
> * beamline updates
>
> Thursday 19th:
> * GPFS and Lustre at risk periods, server upgrades etc,[1]
> * start rolling upgrade of clusters?
> * central servers...
>
> Friday 20th:
> * mop up, general stuff, I'm sure there'll be loads...
> * more central servers...
>
> Week 23rd-27th:
> * primary archiver RAM+localhome expansion
> * ws in CIA23
> * investigate/fix b18-ws* powersave issue
> * DMZ work (mount production FS if they aren't already...)
> * install beamline servers[2]
> * sr06i-di-serv-01 re-install
> * beamline workstations installs/replacements: I'm sure there are some
>
> Week 30th-2nd:
> * anything that's left/come in during the shutdown...
>
>
> Additional stuff which could be started early:
> * attempt to reproduce cluster node slowdown (Intel want more data...)
> * attempt to reproduce NFS I/O errors on Lustre (again: Intel want more
> data)
> * archiving
> * data purging
>
> Preparation that still needs to happen:
> * compile Lustre client modules for target kernel
> * compile GPFS client modules for target kernel
> * update cfengine with kernel version to be installed
> * anything I've forgotten...
>
> [1] or do people feel this should be done before we forcefully reboot
> beamline machines?
> [2] if anyone has a list of all beamline servers we promised to
> install/replace, let me know, otherwise I'll try to generate the list
> before tomorrows meeting
>

-- 
Frederik Ferner
Senior Computer Systems Administrator   phone: +44 1235 77 8624
Diamond Light Source Ltd.               mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are
for the use of the intended addressee only. If you are not the intended addressee or an authorised
recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light
Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we
cannot accept liability for any damage which you may sustain as a result of software viruses which may be
transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered
office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Gmane