Neale Ferguson | 28 Aug 21:11 2014
Picon

Delaying fencing during shutdown

Hi,
 In a two node cluster I shutdown one of the nodes and the other node notices the shutdown but on rare occasions
that node will then fence the node that is shutting down. I assume Is this a situation where setting
post_fail_delay would be useful or setting the totem timeout to something higher than its default.

Neale

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Ferenc Wagner | 26 Aug 22:42 2014
Picon

locating a starting resource

Hi,

crm_resource --locate finds the hosting node of a running (successfully
started) resource just fine.  Is there a way to similarly find out the
location of a resource *being* started, ie. whose resource agent is
already running the start action, but that action is not finished yet?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Vasil Valchev | 26 Aug 08:56 2014
Picon

totem token & post_fail_delay question

Hello,

I have a cluster that sometimes has intermittent network issues on the heartbeat network.
Unfortunately improving the network is not an option, so I am looking for a way to tolerate longer interruptions.

Previously it seemed to me the post_fail_delay option is suitable, but after some research it might not be what I am looking for.

If I am correct, when a member leaves (due to token timeout) the cluster will wait the post_fail_delay before fencing. If the member rejoins before that, it will still be fenced, because it has previous state?
From a recent fencing on this cluster there is a strange message:

Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl despite it rejoining the cluster with existing state, it has a lower node ID

What does this mean?

And lastly is increasing the totem token timeout the way to go?


Thanks,
Vasil Valchev
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Ferenc Wagner | 22 Aug 02:37 2014
Picon

on exiting maintenance mode

Hi,

While my Pacemaker cluster was in maintenance mode, resources were moved
(by hand) between the nodes as I rebooted each node in turn.  In the end
the crm status output became perfectly empty, as the reboot of a given
node removed from the output the resources which were located on the
rebooted node at the time of entering maintenance mode.  I expected full
resource discovery on exiting maintenance mode, but it probably did not
happen, as the cluster started up resources already running on other
nodes, which is generally forbidden.  Given that all resources were
running (though possibly migrated during the maintenance), what would
have been the correct way of bringing the cluster out of maintenance
mode?  This should have required no resource actions at all.  Would
cleanup of all resources have helped?  Or is there a better way?
-- 
Thanks,
Feri.

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Neale Ferguson | 20 Aug 06:45 2014
Picon

clvmd not terminating

We have a sporadic situation where we are attempting to shutdown/restart both nodes of a two node cluster.
One shutdowns completely but one sometimes hangs with:

[root <at> aude2mq036nabzi ~]# service cman stop
Stopping cluster:
   Leaving fence domain... found dlm lockspace /sys/kernel/dlm/clvmd
fence_tool: cannot leave due to active systems
[FAILED]

When the other node is brought back up it has problems with clvmd:

># pvscan
  connect() failed on local socket: Connection refused
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.

Sometimes it works fine but very occasionally we get the above situation. I've encountered the fence
message before, usually when the fence devices were incorrectly configured but it would always fail
because of this. Before I get too far into investigation mode I wondered if the above symptoms ring any
bells for anyone.

Neale

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Laszlo Budai | 29 Jul 18:58 2014
Picon

RHEL 6.3 - qdisk lost

Dear All,

I have a two node cluster (rgmanager-3.0.12.1-17.el6.x86_64) with a 
shared storage. The storage contains the quorum disk also.
there are some services, and there are some dependencies set.
We are testing what is happening if the storage is disconnected from one 
node (to see the cluster response for such a failure).
So we start from a good cluster (all is OK) and we disconnect the 
storage from the first node.

What I have observed:
1. the cluster is fencing node 1
2. node 2 is trying to start the services, but even if we have 3 
services (let's say B,C,D) which are depending on a fourth one (say A) 
the cluster is trying to start the services in this order: B,C,D,A. 
Obviously it fails for B,C,D and gives us the following messages:

Jul 29 15:49:54 node1 rgmanager[5135]: service:B is not runnable; 
dependency not met
Jul 29 15:49:54 node2 rgmanager[5135]: Not stopping service:B: service 
is failed
Jul 29 15:49:54 node2 rgmanager[5135]: Unable to stop RG service:B in 
recoverable state

it will leave them in "recoverable" state even if service A will start 
successfully (so the dependency would be met now). Why is this happening?

I would expect the rgmanager to start the services in an order that 
would satisfy the dependency relationships. Or if it is not doing that , 
then at least to react to the service state change event (service A has 
started, so dependencies should be evaluated again).

What can be done about it?

Thank you in advance,
Laszlo

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

C. Handel | 23 Jul 17:53 2014

corosync ring failure

hi,

i run a cluster with two corosync rings. One of the rings is marked
faulty every fourty seconds, to immediately recover a second later.
the other ring is stable

i have no idea how i should debug this.

we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
cluster consists of three machines. Ring1 is running on 10gigbit
interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
respective switch.

corosync communication is udpu, rrp_mode is passive

cluster.conf:

<cluster config_version="30" name="aslfile">

<cman transport="udpu">
</cman>

<fence_daemon post_join_delay="120" post_fail_delay="30"/>

<fencedevices>
        <fencedevice name="pcmk" agent="fence_pcmk" action="off"/>
</fencedevices>

<quorumd
   cman_label="qdisk"
   device="/dev/mapper/mpath-091quorump1"
   min_score="1"
   votes="2"
   >
</quorumd>

<clusternodes>
<clusternode name="asl430m90" nodeid="430">
        <altname name="asl430"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl430m90"/>
                </method>
        </fence>
</clusternode>
<clusternode name="asl431m90" nodeid="431">
        <altname name="asl431"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl431m90"/>
                </method>
        </fence>
</clusternode>
<clusternode name="asl432m90" nodeid="432">
        <altname name="asl432"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl432m90"/>
                </method>
        </fence>
</clusternode>
</clusternodes>
</cluster>

syslog

Jul 23 17:48:34 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:14 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1

Greetings
   Christoph

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Devin A. Bougie | 15 Jul 17:36 2014
Picon

mixed 6.4 and 6.5 cluster - delays accessing mpath devices and clustered lvm's

We have a cluster of EL6.4 servers, with one server at fully updated EL6.5.  After upgrading to 6.5, we see
unreasonably long delays accessing some mpath devices and clustered lvm's on the 6.5 member.  There are no
problems with the 6.4 members.

This can be seen by strace'ing lvscan.  In the following example, syscall time is at the end of the line,
reads with ascii text are mpath devices, the rest are volumes:

------
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.467385>
16241 read(5, "\17u\21^ LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 4096) = 4096 <1.760943>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.164032>
16241 read(5, "gment1 {\nstart_extent = 0\nextent"..., 4096) = 4096 <2.859972>
16241 read(5,
"\353H\220\20\216\320\274\0\260\270\0\0\216\330\216\300\373\276\0|\277\0\6\271\0\2\363\244\352!\6\0"...,
4096) = 4096 <1.717222>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.476014>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.800225>
16241 read(5,
"3\300\216\320\274\0|\216\300\216\330\276\0|\277\0\6\271\0\2\374\363\244Ph\34\6\313\373\271\4\0"...,
4096) = 4096 <2.008620>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <2.021734>
16241 read(5,
"3\300\216\320\274\0|\216\300\216\330\276\0|\277\0\6\271\0\2\374\363\244Ph\34\6\313\373\271\4\0"...,
4096) = 4096 <2.126359>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <2.036027>
16241 read(5, "\1\4\0\0\21\4\0\0!\4\0\0\331[\362\37\2\0\4\0\0\0\0\0\0\0\0\0\356\37U\23"...,
4096) = 4096 <1.330302>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.381982>
16241 read(5, "vgift3 {\nid = \"spdYGc-5hqc-ejzd-"..., 8192) = 8192 <0.922098>
16241 read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <2.440282>
16241 read(6, "vgift3 {\nid = \"spdYGc-5hqc-ejzd-"..., 8192) = 8192 <1.158817>
16241 read(5, "gment1 {\nstart_extent = 0\nextent"..., 4096) = 4096 <0.941814>
16241 read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <1.518448>
16241 read(6, "gment1 {\nstart_extent = 0\nextent"..., 20480) = 20480 <2.006777>
------

The delay can also be seen in the syslog  messages we receive after restarting clvmd with debugging enabled.

------
Jul 14 11:47:58 lnx05 lvm[13423]: Got new connection on fd 5
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 28
Jul 14 11:48:03 lnx05 lvm[13423]: creating pipe, [11, 12]
Jul 14 11:48:03 lnx05 lvm[13423]: Creating pre&post thread
Jul 14 11:48:03 lnx05 lvm[13423]: Created pre&post thread, state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: in sub thread: client = 0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: doing PRE command LOCK_VG 'V_vgift5' at 1 (client=0x13e7460)
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: 'V_vgift5' mode:3 flags=0
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: returning lkid 24c0008
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3443, flags=0x1 (LOCAL)
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=28, csid=(nil), xid=3443
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: LOCK_VG (0x33) msg=0x13e7110, msglen =28, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: do_lock_vg: resource 'V_vgift5', cmd = 0x1 LCK_VG (READ|VG), flags =
0x4 ( DMEVENTD_MONITOR ), critical_section = 0
Jul 14 11:48:03 lnx05 lvm[13423]: Invalidating cached metadata for VG vgift5
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 1
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 31
Jul 14 11:48:03 lnx05 lvm[13423]: check_all_clvmds_running
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3444, flags=0x0 ()
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=31, csid=(nil), xid=3444
Jul 14 11:48:03 lnx05 lvm[13423]: Sending message to all cluster nodes
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: SYNC_NAMES (0x2d) msg=0x13e7110, msglen
=31, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx01-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 2 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx02-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 3 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx04-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 4 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx07-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 5 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx06-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 6 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx08-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 7 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx09-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 8 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx03-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 9 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 28
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: doing PRE command LOCK_VG 'V_vgift5' at 6 (client=0x13e7460)
Jul 14 11:48:03 lnx05 lvm[13423]: sync_unlock: 'V_vgift5' lkid:24c0008
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3445, flags=0x1 (LOCAL)
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=28, csid=(nil), xid=3445
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: LOCK_VG (0x33) msg=0x13e7110, msglen =28, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: do_lock_vg: resource 'V_vgift5', cmd = 0x6 LCK_VG (UNLOCK|VG), flags =
0x4 ( DMEVENTD_MONITOR ), critical_section = 0
Jul 14 11:48:03 lnx05 lvm[13423]: Invalidating cached metadata for VG vgift5
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 1
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 28
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: doing PRE command LOCK_VG 'V_vgift3' at 1 (client=0x13e7460)
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: 'V_vgift3' mode:3 flags=0
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: returning lkid 166000b
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3446, flags=0x1 (LOCAL)
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=28, csid=(nil), xid=3446
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: LOCK_VG (0x33) msg=0x13e7110, msglen =28, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: do_lock_vg: resource 'V_vgift3', cmd = 0x1 LCK_VG (READ|VG), flags =
0x4 ( DMEVENTD_MONITOR ), critical_section = 0
Jul 14 11:48:03 lnx05 lvm[13423]: Invalidating cached metadata for VG vgift3
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 1
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 31
Jul 14 11:48:03 lnx05 lvm[13423]: check_all_clvmds_running
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3447, flags=0x0 ()
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=31, csid=(nil), xid=3447
Jul 14 11:48:03 lnx05 lvm[13423]: Sending message to all cluster nodes
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: SYNC_NAMES (0x2d) msg=0x13e7110, msglen
=31, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx01-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 2 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx02-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 3 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx04-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 4 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx07-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 5 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx06-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 6 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx08-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 7 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx09-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 8 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx03-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 9 replies, expecting: 9
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 28
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: doing PRE command LOCK_VG 'V_vgift3' at 6 (client=0x13e7460)
Jul 14 11:48:03 lnx05 lvm[13423]: sync_unlock: 'V_vgift3' lkid:166000b
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3448, flags=0x1 (LOCAL)
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=28, csid=(nil), xid=3448
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: LOCK_VG (0x33) msg=0x13e7110, msglen =28, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: do_lock_vg: resource 'V_vgift3', cmd = 0x6 LCK_VG (UNLOCK|VG), flags =
0x4 ( DMEVENTD_MONITOR ), critical_section = 0
Jul 14 11:48:03 lnx05 lvm[13423]: Invalidating cached metadata for VG vgift3
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 1
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:03 lnx05 lvm[13423]: Read on local socket 5, len = 28
Jul 14 11:48:03 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: doing PRE command LOCK_VG 'V_vgift2' at 1 (client=0x13e7460)
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: 'V_vgift2' mode:3 flags=0
Jul 14 11:48:03 lnx05 lvm[13423]: sync_lock: returning lkid 3b20007
Jul 14 11:48:03 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: distribute command: XID = 3449, flags=0x1 (LOCAL)
Jul 14 11:48:03 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=28, csid=(nil), xid=3449
Jul 14 11:48:03 lnx05 lvm[13423]: process_work_item: local
Jul 14 11:48:03 lnx05 lvm[13423]: process_local_command: LOCK_VG (0x33) msg=0x13e7110, msglen =28, client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: do_lock_vg: resource 'V_vgift2', cmd = 0x1 LCK_VG (READ|VG), flags =
0x4 ( DMEVENTD_MONITOR ), critical_section = 0
Jul 14 11:48:03 lnx05 lvm[13423]: Invalidating cached metadata for VG vgift2
Jul 14 11:48:03 lnx05 lvm[13423]: Reply from node lnx05-p12: 0 bytes
Jul 14 11:48:03 lnx05 lvm[13423]: Got 1 replies, expecting: 1
Jul 14 11:48:03 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:03 lnx05 lvm[13423]: Got post command condition...
Jul 14 11:48:03 lnx05 lvm[13423]: Waiting for next pre command
Jul 14 11:48:03 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:03 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:03 lnx05 lvm[13423]: Send local reply
Jul 14 11:48:04 lnx05 lvm[13423]: Read on local socket 5, len = 31
Jul 14 11:48:04 lnx05 lvm[13423]: check_all_clvmds_running
Jul 14 11:48:04 lnx05 lvm[13423]: Got pre command condition...
Jul 14 11:48:04 lnx05 lvm[13423]: Writing status 0 down pipe 12
Jul 14 11:48:04 lnx05 lvm[13423]: Waiting to do post command - state = 0
Jul 14 11:48:04 lnx05 lvm[13423]: read on PIPE 11: 4 bytes: status: 0
Jul 14 11:48:04 lnx05 lvm[13423]: background routine status was 0, sock_client=0x13e7460
Jul 14 11:48:04 lnx05 lvm[13423]: distribute command: XID = 3450, flags=0x0 ()
Jul 14 11:48:04 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e2820. client=0x13e7460, msg=0x13e27f0,
len=31, csid=(nil), xid=3450
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25821 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25832 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25844 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25857 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25905 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
Jul 14 11:48:14 lnx05 lvm[13423]: add_to_lvmqueue: cmd=0x13e27f0. client=0x6c60c0,
msg=0x7fffd749a7dc, len=31, csid=0x7fffd749a75c, xid=0
Jul 14 11:48:14 lnx05 lvm[13423]: process_work_item: remote
Jul 14 11:48:14 lnx05 lvm[13423]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID
25914 on node lnx04-p12
Jul 14 11:48:14 lnx05 lvm[13423]: Syncing device names
Jul 14 11:48:14 lnx05 lvm[13423]: LVM thread waiting for work
------

Before we upgrade all cluster members to 6.5, we'd like to be reasonably certain that it will fix the problem
rather than spread it to the entire cluster.  Any help would be greatly appreciated.

Many thanks,
Devin

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

abdul mujeeb Siddiqui | 15 Jul 16:18 2014
Picon

Basename mismatch

Hello, I have to implemented  red hat linux 6.4 cluster suite and trying to use Oracle11gr2 on it.But oracle service is unable to start.
Listener isnot starting.
Anyone have implemented oracle11gr2 so please
Send me cluster.conf and oracledb.sh and also listener.ora and tnsnames.ora files pls.
Thanks in advanced

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Laszlo Budai | 10 Jul 13:49 2014
Picon

Cman not start when quorum disk is not available

Dear all,

we have a RHEL 6.3 cluster of two nodes and a quorum disk.
We are testing the cluster against different failures. We have a problem 
when the shared storage is disconnected from one of the nodes. The node 
that has lost contact with the storage is fenced, but when restarting 
the machine cman will not start up (it will try to start but it will stop):

Jul  9 17:55:54 clnode1p kdump: started up
Jul  9 17:55:54 clnode1p kernel: bond0: no IPv6 routers present
Jul  9 17:55:54 clnode1p kernel: DLM (built Jun 13 2012 18:26:45) installed
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Corosync Cluster 
Engine ('1.4.1'): started and ready to provide service.
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Corosync built-in 
features: nss dbus rdma snmp
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Successfully read 
config from /etc/cluster/cluster.conf
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Successfully parsed 
cman config
Jul  9 17:55:55 clnode1p corosync[2514]:   [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Jul  9 17:55:55 clnode1p corosync[2514]:   [TOTEM ] Initializing 
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul  9 17:55:55 clnode1p corosync[2514]:   [TOTEM ] The network 
interface [172.16.255.1] is now up.
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Using quorum 
provider quorum_cman
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync cluster quorum service v0.1
Jul  9 17:55:55 clnode1p corosync[2514]:   [CMAN  ] CMAN 3.0.12.1 (built 
May  8 2012 12:22:26) started
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync CMAN membership service 2.90
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: openais checkpoint service B.01.01
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync extended virtual synchrony service
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync configuration service
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync cluster closed process group service v1.01
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync cluster config database access v1.01
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync profile loading service
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Using quorum 
provider quorum_cman
Jul  9 17:55:55 clnode1p corosync[2514]:   [SERV  ] Service engine 
loaded: corosync cluster quorum service v0.1
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Compatibility mode 
set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul  9 17:55:55 clnode1p corosync[2514]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Members[1]: 1
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Members[1]: 1
Jul  9 17:55:55 clnode1p corosync[2514]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.16.255.1) ; members(old:0 left:0)
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul  9 17:55:55 clnode1p corosync[2514]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.
Jul  9 17:55:55 clnode1p corosync[2514]:   [CMAN  ] quorum regained, 
resuming activity
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] This node is within 
the primary component and will provide service.
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Members[2]: 1 2
Jul  9 17:55:55 clnode1p corosync[2514]:   [QUORUM] Members[2]: 1 2
Jul  9 17:55:55 clnode1p corosync[2514]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.16.255.1) ; members(old:1 left:0)
Jul  9 17:55:55 clnode1p corosync[2514]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul  9 17:55:59 clnode1p kernel: bond1: no IPv6 routers present
Jul  9 17:55:59 clnode1p qdiskd[2564]: Loading dynamic configuration
Jul  9 17:55:59 clnode1p qdiskd[2564]: Setting votes to 1
Jul  9 17:55:59 clnode1p qdiskd[2564]: Loading static configuration
Jul  9 17:55:59 clnode1p qdiskd[2564]: Timings: 8 tko, 1 interval
Jul  9 17:55:59 clnode1p qdiskd[2564]: Timings: 2 tko_up, 4 master_wait, 
2 upgrade_wait
Jul  9 17:55:59 clnode1p qdiskd[2564]: Heuristic: '/bin/ping -c1 -w1 
clswitch1m' score=1 interval=2 tko=4
Jul  9 17:55:59 clnode1p qdiskd[2564]: Heuristic: '/bin/ping -c1 -w1 
clswitch2m' score=1 interval=2 tko=4
Jul  9 17:55:59 clnode1p qdiskd[2564]: 2 heuristics loaded
Jul  9 17:55:59 clnode1p qdiskd[2564]: Quorum Daemon: 2 heuristics, 1 
interval, 8 tko, 1 votes
Jul  9 17:55:59 clnode1p qdiskd[2564]: Run Flags: 00000271
Jul  9 17:55:59 clnode1p qdiskd[2564]: stat
Jul  9 17:55:59 clnode1p qdiskd[2564]: qdisk_validate: No such file or 
directory
Jul  9 17:55:59 clnode1p qdiskd[2564]: Specified partition 
/dev/mapper/apsto1-vd01-v001 does not have a qdisk label
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Unloading all 
Corosync service engines.
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync extended virtual synchrony service
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync configuration service
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync cluster closed process group service v1.01
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync cluster config database access v1.01
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync profile loading service
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: openais checkpoint service B.01.01
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync CMAN membership service 2.90
Jul  9 17:56:01 clnode1p corosync[2514]:   [SERV  ] Service engine 
unloaded: corosync cluster quorum service v0.1
Jul  9 17:56:01 clnode1p corosync[2514]:   [MAIN  ] Corosync Cluster 
Engine exiting with status 0 at main.c:1864.

And it will remain in this state even if the storage is reattached later 
on.  So now I have only one functioning node.
What can be done to fix this (to have the cluster framework started)?

Thank you,
Laszlo

-- 
Acceleris System Integration | and IT works

Laszlo Budai | Technical Consultant
Bvd. Barbu Vacarescu 80 | RO-020282 Bucuresti
t +40 21 23 11 538
laszlo.budai <at> acceleris.ro | www.acceleris.ro

Acceleris Offices are in:
Basel | Bucharest | Zollikofen | Renens | Kloten

--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Amjad Syed | 1 Jul 10:27 2014
Picon

Virtual IP service

Hello

I am trying to start a virtual ip service on my 2 node cluster.

Here are the details of the network setting and configuration.

1 : Bond(heartbeat) . This is a private network with no switch involved. Not available to public
  node1: 192.168.10.11
  node2  : 192.168.10.10 

2. Fencing (Ilo) .This one goes through a switch
   node 1: 10.10.63.92 
   node2 :  10.10.63.93

3) Public ip addresses
   10.10.5.100 : node1
   10.10.5.20    node2 .

I  have set Virtual IP  as  10.10.5.23 in cluster.conf
  <service autostart="1" exclusive="0" name="IP" recovery="relocate">
                <ip address="10.10.5.23" monitor_link="on" sleeptime="10"/>

However, this Virtual IP does not work since the cman communication is on 192 network. When i try to set cman to 10.10.5.X network, the nodes go into fence loop, i,e they fence each other

So i am asking, is there a "network-preference option" etc in cluster.conf that can map virtual IP to private network addresses.

Thank you
--

-- 
Linux-cluster mailing list
Linux-cluster <at> redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Gmane