Fabio Massimo Di Nitto | 1 Nov 2009 16:34
Picon
Favicon

Re: mount.gfs2 hangs on cluster-3.0.3

Dan Candea wrote:
> hi all
> 
> I really need some help.
> 
> I have set up a cluster 3.0.3 with 2.6.31 kernel
> All went well until I tried a gfs2 mount. The mount hangs without an error
> gfs_control dump reports nothing:
> 
> gfs_control dump
> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/gfs_controld.log
> 1256941054 gfs_controld 3.0.3 started
> 1256941054 /cluster/gfs_controld/ <at> plock_ownership is 1
> 1256941054 /cluster/gfs_controld/ <at> plock_rate_limit is 0
> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/gfs_controld.log
> 1256941054 group_mode 3 compat 0
> 
> 

Can you please provide your cluster.conf and setup information please?

If you add:

<cluster..>
  <logging debug="on"/>
[other bits of the config here]

can you please tar /var/log/cluster/* and send it to us?
(Continue reading)

Dan Candea | 1 Nov 2009 18:32
Picon

Re: mount.gfs2 hangs on cluster-3.0.3

Fabio Massimo Di Nitto wrote:
> Dan Candea wrote:
>> hi all
>>
>> I really need some help.
>>
>> I have set up a cluster 3.0.3 with 2.6.31 kernel
>> All went well until I tried a gfs2 mount. The mount hangs without an 
>> error
>> gfs_control dump reports nothing:
>>
>> gfs_control dump
>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/gfs_controld.log
>> 1256941054 gfs_controld 3.0.3 started
>> 1256941054 /cluster/gfs_controld/ <at> plock_ownership is 1
>> 1256941054 /cluster/gfs_controld/ <at> plock_rate_limit is 0
>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/gfs_controld.log
>> 1256941054 group_mode 3 compat 0
>>
>>
>
> Can you please provide your cluster.conf and setup information please?
>
> If you add:
>
> <cluster..>
>  <logging debug="on"/>
> [other bits of the config here]
(Continue reading)

Guido Günther | 1 Nov 2009 18:42
Gravatar

Re: ccs_config_validate in cluster 3.0.X

Hi Fabio,
On Sat, Oct 31, 2009 at 08:18:06AM +0100, Fabio Massimo Di Nitto wrote:
> Guido Günther wrote:
> >On Wed, Oct 28, 2009 at 11:36:30AM +0100, Fabio M. Di Nitto wrote:
> >>Hi everybody,
> >>
> >>as briefly mentioned in 3.0.4 release note, a new system to validate the
> >>configuration has been enabled in the code.
> >>
> >>What it does
> >>------------
> >>
> >>The general idea is to be able to perform as many sanity checks on the
> >>configuration as possible. This check allows us to spot the most common
> >>mistakes, such as typos or possibly invalid values, in cluster.conf.
> >This is great. For what it's worth: I've pushed Cluster 3.0.4 into
> >Debian experimental a couple of days ago.
> >Cheers,
> > -- Guido
> >
> 
> Hi Guido,
> 
> thanks for pushing the packages to Debian.
> 
> Please make sure to forward bugs related to this check so we can
> address them quickly.
Sure. Thanks.

> Lon update the FAQ on our wiki to help debugging issues related to RelaxNG.
(Continue reading)

Fabio Massimo Di Nitto | 2 Nov 2009 08:37
Picon
Favicon

Re: mount.gfs2 hangs on cluster-3.0.3

Dan Candea wrote:
> Fabio Massimo Di Nitto wrote:
>> Dan Candea wrote:
>>> hi all
>>>
>>> I really need some help.
>>>
>>> I have set up a cluster 3.0.3 with 2.6.31 kernel
>>> All went well until I tried a gfs2 mount. The mount hangs without an 
>>> error
>>> gfs_control dump reports nothing:
>>>
>>> gfs_control dump
>>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
>>> /var/log/cluster/gfs_controld.log
>>> 1256941054 gfs_controld 3.0.3 started
>>> 1256941054 /cluster/gfs_controld/ <at> plock_ownership is 1
>>> 1256941054 /cluster/gfs_controld/ <at> plock_rate_limit is 0
>>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
>>> /var/log/cluster/gfs_controld.log
>>> 1256941054 group_mode 3 compat 0
>>>
>>>
>>
>> Can you please provide your cluster.conf and setup information please?
>>
>> If you add:
>>
>> <cluster..>
>>  <logging debug="on"/>
(Continue reading)

Dan Candea | 2 Nov 2009 10:59
Picon

Re: mount.gfs2 hangs on cluster-3.0.3

On Monday 02 November 2009 09:37:47 Fabio Massimo Di Nitto wrote:
> Dan Candea wrote:
> > Fabio Massimo Di Nitto wrote:
> >> Dan Candea wrote:
> >>> hi all
> >>>
> >>> I really need some help.
> >>>
> >>> I have set up a cluster 3.0.3 with 2.6.31 kernel
> >>> All went well until I tried a gfs2 mount. The mount hangs without an
> >>> error
> >>> gfs_control dump reports nothing:
> >>>
> >>> gfs_control dump
> >>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
> >>> /var/log/cluster/gfs_controld.log
> >>> 1256941054 gfs_controld 3.0.3 started
> >>> 1256941054 /cluster/gfs_controld/ <at> plock_ownership is 1
> >>> 1256941054 /cluster/gfs_controld/ <at> plock_rate_limit is 0
> >>> 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6
> >>> /var/log/cluster/gfs_controld.log
> >>> 1256941054 group_mode 3 compat 0
> >>
> >> Can you please provide your cluster.conf and setup information please?
> >>
> >> If you add:
> >>
> >> <cluster..>
> >>  <logging debug="on"/>
> >> [other bits of the config here]
(Continue reading)

Steven Whitehouse | 2 Nov 2009 12:42
Picon
Favicon

Re: GFS2 processes getting stuck in WCHAN=dlm_posix_lock

Hi,

On Fri, 2009-10-30 at 19:27 -0400, Allen Belletti wrote:
> Hi All,
> 
> As I've mentioned before, I'm running a two-node clustered mail server 
> on GFS2 (with RHEL 5.4)  Nearly all of the time, everything works 
> great.  However, going all the way back to GFS1 on RHEL 5.1 (I think it 
> was), I've had occasional locking problems that force a reboot of one or 
> both cluster nodes.  Lately I've paid closer attention since it's been 
> happening more often.
> 
> I'll notice the problem when the load average starts rising.  It's 
> always tied to "stuck" processes, and I believe always tied to IMAP 
> clients (I'm running Dovecot.)  It seems like a file belonging to user 
> "x" (in this case, "jforrest" will become locked in some way, such that 
> every IMAP process tied that user will get stuck on the same thing.  
> Over time, as the user keeps trying to read that file, more & more 
> processes accumulate.  They're always in state "D" (uninterruptible 
> sleep), and always on "dlm_posix_lock" according to WCHAN.  The only way 
> I'm able to get out of this state is to reboot.  If I let it persist for 
> too long, I/O generally stops entirely.
> 
> This certainly seems like it ought to have a definite solution, but I've 
> no idea what it is.  I've tried a variety of things using "find" to 
> pinpoint a particular file, but everything belonging to the affected 
> user seems just fine.  At least, I can read and copy all of the files, 
> and do a stat via ls -l.
> 
> Is it possible that this is a bug, not within GFS at all, but within 
(Continue reading)

Gianluca Cecchi | 2 Nov 2009 15:09
Picon

share experience migrating cluster suite from centos 5.3 to centos 5.4

Hello,
sorry for the long e-mail in advance.
trying to do on a test environment what in subject and I think it could be useful for others too, both in RH EL and in CentOS.
I have configured two ip+fs services and HA-LVM

Starting point is CentOS 5.3 updated at these components:
cman-2.0.98-1.el5_3.1
openais-0.80.3-22.el5_3.4
rgmanager-2.0.46-1.el5.centos.3
luci-0.12.1-7.3.el5.centos.1
ricci-0.12.1-7.3.el5.centos.1
lvm2-2.02.40-6.el5
device-mapper-multipath-0.4.7-23.el5_3.4

Target would be:
cman-2.0.115-1.el5_4.3
openais-0.80.6-8.el5_4.1
rgmanager-2.0.52-1.el5.centos.2
luci-0.12.2-6.el5.centos
ricci-0.12.2-6.el5.centos
lvm2-2.02.46-8.el5_4.1
device-mapper-multipath-0.4.7-30.el5_4.2

they are guests in Qemu-KVM environment and I have a backup of the starting situation, so that I can reply and change eventually order of operations.

node1 is mork, node2 is mindy
Attempt of approach:
- services are on node2 (mindy)
- shutdown ad restart node1 in single user mode
- activate network and update node1 with:
  yum clean all
  yum update glibc\*
  yum update yum\* rpm\* python\*
  yum clean all
  yum update
  shutdown -r now and start in single user mode to check correct start and so on
- init 3 for node1 and join to cluster

QUESTION1: are there any incompatibilities in this first join of the cluster, based on the different components' versions?
Would it be better in your opinion to make a shutdown of node2 and then have node1 start alone and take the services and then upgrade node2 and have the first contemporary two-nodes join with aligned versions of clusterware software?

Now, following my approach, after the init 3 on node1 all was ok with cluster join, but I forgot to do a touch of the initrd file of the updated kernel,
due to de-optimized check in HA-LVM service comparing timestamp of initrd of running kernel and lvm.conf
So clurgmgrd complains having
-rw-r--r-- 1 root root 16433 Nov  2 12:28 /etc/lvm/lvm.conf
newer than initrd that is dated end of September..... (see below)

Nov  2 12:41:00 mork kernel: DLM (built Sep 30 2009 12:53:28) installed
Nov  2 12:41:00 mork kernel: GFS2 (built Sep 30 2009 12:54:10) installed
Nov  2 12:41:00 mork kernel: Lock_DLM (built Sep 30 2009 12:54:16) installed
Nov  2 12:41:00 mork ccsd[2290]: Starting ccsd 2.0.115:
Nov  2 12:41:00 mork ccsd[2290]:  Built: Oct 26 2009 22:01:34
Nov  2 12:41:00 mork ccsd[2290]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
Nov  2 12:41:00 mork ccsd[2290]: cluster.conf (cluster name = clumm, version = 5) found.
Nov  2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node.
Nov  2 12:41:00 mork ccsd[2290]:  Local version # : 5
Nov  2 12:41:00 mork ccsd[2290]:  Remote version #: 5
Nov  2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node.
Nov  2 12:41:00 mork ccsd[2290]:  Local version # : 5
Nov  2 12:41:00 mork ccsd[2290]:  Remote version #: 5
Nov  2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node.
Nov  2 12:41:00 mork ccsd[2290]:  Local version # : 5
Nov  2 12:41:00 mork ccsd[2290]:  Remote version #: 5
Nov  2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node.
Nov  2 12:41:00 mork ccsd[2290]:  Local version # : 5
Nov  2 12:41:00 mork ccsd[2290]:  Remote version #: 5
Nov  2 12:41:00 mork openais[2302]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Nov  2 12:41:00 mork openais[2302]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Nov  2 12:41:00 mork openais[2302]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Nov  2 12:41:00 mork openais[2302]: [MAIN ] AIS Executive Service: started and ready to provide service.
Nov  2 12:41:00 mork openais[2302]: [MAIN ] Using default multicast address of 239.192.12.183
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Token Timeout (162000 ms) retransmit timeout (8019 ms)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] token hold (6405 ms) retransmits before loss (20 retrans)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500
s)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] send threads (0 threads)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] RRP token expired timeout (8019 ms)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] RRP token problem counter (2000 ms)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] RRP threshold (10 problem count)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] RRP mode set to none.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] heartbeat_failures_allowed (0)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] max_network_delay (50 ms)
Nov  2 12:41:00 mork openais[2302]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Nov  2 12:41:00 mork openais[2302]: [TOTEM] The network interface [172.16.0.11] is now up.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Created or loaded sequence id 336.172.16.0.11 for this ring.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering GATHER state from 15.
Nov  2 12:41:00 mork openais[2302]: [CMAN ] CMAN 2.0.115 (built Oct 26 2009 22:01:42) started
Nov  2 12:41:00 mork openais[2302]: [MAIN ] Service initialized 'openais CMAN membership service 2.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais extended virtual synchrony service'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster membership service B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais availability management framework B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais checkpoint service B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais event service B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais distributed locking service B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais message service B.01.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais configuration service'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster closed process group service v1.01'
Nov  2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster config database access v1.01'
Nov  2 12:41:00 mork openais[2302]: [SYNC ] Not using a virtual synchrony filter.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Creating commit token because I am the rep.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Saving state aru 0 high seq received 0
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Storing new sequence id for ring 154
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering COMMIT state.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering RECOVERY state.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] position [0] member 172.16.0.11:
Nov  2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 336 rep 172.16.0.11
Nov  2 12:41:00 mork openais[2302]: [TOTEM] aru 0 high delivered 0 received flag 1
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Did not need to originate any messages in recovery.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Sending initial ORF token
Nov  2 12:41:00 mork openais[2302]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:41:00 mork openais[2302]: [CLM  ] New Configuration:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Left:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Joined:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:41:00 mork openais[2302]: [CLM  ] New Configuration:
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.11) 
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Left:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Joined:
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.11) 
Nov  2 12:41:00 mork openais[2302]: [SYNC ] This node is within the primary component and will provide service.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering OPERATIONAL state.
Nov  2 12:41:00 mork openais[2302]: [CLM  ] got nodejoin message 172.16.0.11
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering GATHER state from 11.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Creating commit token because I am the rep.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Saving state aru a high seq received a
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Storing new sequence id for ring 158
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering COMMIT state.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering RECOVERY state.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] position [0] member 172.16.0.11:
Nov  2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 340 rep 172.16.0.11
Nov  2 12:41:00 mork openais[2302]: [TOTEM] aru a high delivered a received flag 1
Nov  2 12:41:00 mork openais[2302]: [TOTEM] position [1] member 172.16.0.12:
Nov  2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 340 rep 172.16.0.12
Nov  2 12:41:00 mork openais[2302]: [TOTEM] aru d high delivered d received flag 1
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Did not need to originate any messages in recovery.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] Sending initial ORF token
Nov  2 12:41:00 mork openais[2302]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:41:00 mork openais[2302]: [CLM  ] New Configuration:
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.11) 
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Left:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Joined:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:41:00 mork openais[2302]: [CLM  ] New Configuration:
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.11) 
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.12) 
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Left:
Nov  2 12:41:00 mork openais[2302]: [CLM  ] Members Joined:
Nov  2 12:41:00 mork openais[2302]: [CLM  ]     r(0) ip(172.16.0.12) 
Nov  2 12:41:00 mork openais[2302]: [SYNC ] This node is within the primary component and will provide service.
Nov  2 12:41:00 mork openais[2302]: [TOTEM] entering OPERATIONAL state.
Nov  2 12:41:00 mork openais[2302]: [CMAN ] quorum regained, resuming activity
Nov  2 12:41:00 mork openais[2302]: [CLM  ] got nodejoin message 172.16.0.11
Nov  2 12:41:00 mork openais[2302]: [CLM  ] got nodejoin message 172.16.0.12
Nov  2 12:41:00 mork openais[2302]: [CPG  ] got joinlist message from node 2
Nov  2 12:41:01 mork ccsd[2290]: Initial status:: Quorate
uorum
Nov  2 12:41:01 mork qdiskd[2331]: <info> Quorum Daemon Initializing
Nov  2 12:41:02 mork qdiskd[2331]: <info> Heuristic: 'ping -c1 -w1 192.168.122.1' UP
Nov  2 12:41:12 mork modclusterd: startup succeeded
Nov  2 12:41:12 mork kernel: dlm: Using TCP for communications
Nov  2 12:41:12 mork kernel: dlm: connecting to 2
Nov  2 12:41:12 mork kernel: dlm: got connection from 2
Nov  2 12:41:12 mork clurgmgrd[2886]: <notice> Resource Group Manager Starting
Nov  2 12:41:13 mork oddjobd: oddjobd startup succeeded
Nov  2 12:41:13 mork saslauthd[3338]: detach_tty      : master pid is: 3338
Nov  2 12:41:13 mork saslauthd[3338]: ipc_init        : listening on socket: /var/run/saslauthd/mux
Nov  2 12:41:14 mork ricci: startup succeeded
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> HA LVM:  Improper setup detected
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> HA LVM:  Improper setup detected
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> - initrd image needs to be newer than lvm.conf
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> - initrd image needs to be newer than lvm.conf
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> WARNING: An improper setup can cause data corruption!
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err> WARNING: An improper setup can cause data corruption!
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err>   node2   owns vg_cl1/lv_cl1 unable to stop
Nov  2 12:41:14 mork clurgmgrd: [2886]: <err>   node2   owns vg_cl2/lv_cl2 unable to stop
Nov  2 12:41:14 mork clurgmgrd[2886]: <notice> stop on lvm "CL2" returned 1 (generic error)
Nov  2 12:41:14 mork clurgmgrd[2886]: <notice> stop on lvm "CL1" returned 1 (generic error)
Nov  2 12:41:31 mork qdiskd[2331]: <info> Node 2 is the master
Nov  2 12:42:21 mork qdiskd[2331]: <info> Initial score 1/1
Nov  2 12:42:21 mork qdiskd[2331]: <info> Initialization complete
Nov  2 12:42:21 mork openais[2302]: [CMAN ] quorum device registered
Nov  2 12:42:21 mork qdiskd[2331]: <notice> Score sufficient for master operation (1/1; required=1); upgrading

Note that a clustat of both nodes gives correct results (in the sens of nodes taking part in the cluster and rgmanager active on both and quorum disk).

At this point, after touching initrd file, I think to do a shutdown -r of mork again and see if all goes well.
It seems so, as I get again:
...
Nov  2 12:46:23 mork openais[2278]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:46:23 mork openais[2278]: [CLM  ] New Configuration:
Nov  2 12:46:23 mork openais[2278]: [CLM  ]     r(0) ip(172.16.0.11) 
Nov  2 12:46:23 mork openais[2278]: [CLM  ]     r(0) ip(172.16.0.12) 
Nov  2 12:46:23 mork openais[2278]: [CLM  ] Members Left:
Nov  2 12:46:23 mork openais[2278]: [CLM  ] Members Joined:
Nov  2 12:46:23 mork openais[2278]: [CLM  ]     r(0) ip(172.16.0.12) 
Nov  2 12:46:23 mork openais[2278]: [SYNC ] This node is within the primary component and will provide service.
Nov  2 12:46:23 mork openais[2278]: [TOTEM] entering OPERATIONAL state.
Nov  2 12:46:23 mork openais[2278]: [CMAN ] quorum regained, resuming activity
Nov  2 12:46:23 mork openais[2278]: [CLM  ] got nodejoin message 172.16.0.11
Nov  2 12:46:23 mork openais[2278]: [CLM  ] got nodejoin message 172.16.0.12
Nov  2 12:46:23 mork openais[2278]: [CPG  ] got joinlist message from node 2
Nov  2 12:46:24 mork ccsd[2267]: Initial status:: Quorate
uorum
Nov  2 12:46:25 mork qdiskd[2310]: <info> Quorum Daemon Initializing
Nov  2 12:46:26 mork qdiskd[2310]: <info> Heuristic: 'ping -c1 -w1 192.168.122.1' UP
...
Nov  2 12:46:35 mork modclusterd: startup succeeded
Nov  2 12:46:35 mork kernel: dlm: Using TCP for communications
Nov  2 12:46:35 mork kernel: dlm: connecting to 2
Nov  2 12:46:36 mork oddjobd: oddjobd startup succeeded
Nov  2 12:46:36 mork saslauthd[2990]: detach_tty      : master pid is: 2990
Nov  2 12:46:36 mork saslauthd[2990]: ipc_init        : listening on socket: /var/run/saslauthd/mux
Nov  2 12:46:36 mork ricci: startup succeeded
Nov  2 12:46:55 mork qdiskd[2310]: <info> Node 2 is the master
Nov  2 12:47:45 mork qdiskd[2310]: <info> Initial score 1/1
Nov  2 12:47:45 mork qdiskd[2310]: <info> Initialization complete
Nov  2 12:47:45 mork openais[2278]: [CMAN ] quorum device registered
Nov  2 12:47:45 mork qdiskd[2310]: <notice> Score sufficient for master operation (1/1; required=1); upgrading

but instead, on mindy I get this error and the node goes out of memory and I have to power off it....
Nov  2 12:47:54 mindy kernel: dlm: connect from non cluster node

Donna if the problem with cluster is the cause or the effect of the problem....

In particular, these are messages on mindy, during the first join of the cluster and the reboot of mork:
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] entering GATHER state from 11.
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] Saving state aru d high seq received d
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] Storing new sequence id for ring 158
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] entering COMMIT state.
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] entering RECOVERY state.
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] position [0] member 172.16.0.11:
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] previous ring seq 340 rep 172.16.0.11
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] aru a high delivered a received flag 1
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] position [1] member 172.16.0.12:
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] previous ring seq 340 rep 172.16.0.12
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] aru d high delivered d received flag 1
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] Did not need to originate any messages in recovery.
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] New Configuration:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.12) 
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] Members Left:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] Members Joined:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] New Configuration:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.11) 
Nov  2 12:42:20 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.12) 
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] Members Left:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] Members Joined:
Nov  2 12:42:20 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.11) 
Nov  2 12:42:20 mindy openais[2465]: [SYNC ] This node is within the primary component and will provide service.
Nov  2 12:42:20 mindy openais[2465]: [TOTEM] entering OPERATIONAL state.
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] got nodejoin message 172.16.0.11
Nov  2 12:42:20 mindy openais[2465]: [CLM  ] got nodejoin message 172.16.0.12
Nov  2 12:42:20 mindy openais[2465]: [CPG  ] got joinlist message from node 2
Nov  2 12:42:32 mindy kernel: dlm: connecting to 1
Nov  2 12:42:32 mindy kernel: dlm: got connection from 1
Nov  2 12:46:16 mindy clurgmgrd[3101]: <notice> Member 1 shutting down
Nov  2 12:46:26 mindy qdiskd[2508]: <info> Node 1 shutdown
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] entering GATHER state from 12.
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] Saving state aru 3e high seq received 3e
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] Storing new sequence id for ring 160
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] entering COMMIT state.
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] entering RECOVERY state.
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] position [0] member 172.16.0.11:
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] previous ring seq 348 rep 172.16.0.11
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] aru a high delivered a received flag 1
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] position [1] member 172.16.0.12:
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] previous ring seq 344 rep 172.16.0.11
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] aru 3e high delivered 3e received flag 1
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] Did not need to originate any messages in recovery.
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] New Configuration:
Nov  2 12:47:43 mindy kernel: dlm: closing connection to node 1
Nov  2 12:47:43 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.11) 
Nov  2 12:47:43 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.12) 
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] Members Left:
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] Members Joined:
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] New Configuration:
Nov  2 12:47:43 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.11) 
Nov  2 12:47:43 mindy openais[2465]: [CLM  ]    r(0) ip(172.16.0.12) 
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] Members Left:
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] Members Joined:
Nov  2 12:47:43 mindy openais[2465]: [SYNC ] This node is within the primary component and will provide service.
Nov  2 12:47:43 mindy openais[2465]: [TOTEM] entering OPERATIONAL state.
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] got nodejoin message 172.16.0.11
Nov  2 12:47:43 mindy openais[2465]: [CLM  ] got nodejoin message 172.16.0.12
Nov  2 12:47:43 mindy openais[2465]: [CPG  ] got joinlist message from node 2
Nov  2 12:47:54 mindy kernel: dlm: connect from non cluster node
Nov  2 12:59:48 mindy kernel: dlm_send invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0
Nov  2 12:59:48 mindy kernel:
Nov  2 12:59:48 mindy kernel: Call Trace:
Nov  2 12:59:48 mindy kernel:  [<ffffffff800c3a6a>] out_of_memory+0x8e/0x2f5
Nov  2 12:59:48 mindy kernel:  [<ffffffff8009dba4>] autoremove_wake_function+0x0/0x2e
Nov  2 12:59:48 mindy kernel:  [<ffffffff8000f2eb>] __alloc_pages+0x245/0x2ce
Nov  2 12:59:48 mindy kernel:  [<ffffffff8000f10b>] __alloc_pages+0x65/0x2ce
Nov  2 12:59:48 mindy kernel:  [<ffffffff80017493>] cache_grow+0x137/0x395
Nov  2 12:59:48 mindy kernel:  [<ffffffff8005bbf7>] cache_alloc_refill+0x136/0x186
Nov  2 12:59:48 mindy kernel:  [<ffffffff8000a96e>] kmem_cache_alloc+0x6c/0x76
Nov  2 12:59:48 mindy kernel:  [<ffffffff80043ae3>] sk_alloc+0x2e/0xf3
Nov  2 12:59:48 mindy kernel:  [<ffffffff80059676>] inet_create+0x137/0x267
Nov  2 12:59:49 mindy kernel:  [<ffffffff8004c9af>] __sock_create+0x170/0x27c
Nov  2 12:59:49 mindy kernel:  [<ffffffff8839086e>] :dlm:process_send_sockets+0x0/0x179
Nov  2 12:59:49 mindy kernel:  [<ffffffff883902f4>] :dlm:tcp_connect_to_sock+0x70/0x1de
Nov  2 12:59:49 mindy kernel:  [<ffffffff80063097>] thread_return+0x62/0xfe
Nov  2 12:59:49 mindy kernel:  [<ffffffff8839088e>] :dlm:process_send_sockets+0x20/0x179
Nov  2 12:59:49 mindy kernel:  [<ffffffff8839086e>] :dlm:process_send_sockets+0x0/0x179
Nov  2 12:59:49 mindy kernel:  [<ffffffff8004d159>] run_workqueue+0x94/0xe4
Nov  2 12:59:49 mindy kernel:  [<ffffffff800499da>] worker_thread+0x0/0x122
Nov  2 12:59:49 mindy kernel:  [<ffffffff8009d98c>] keventd_create_kthread+0x0/0xc4
Nov  2 12:59:49 mindy kernel:  [<ffffffff80049aca>] worker_thread+0xf0/0x122
Nov  2 12:59:49 mindy kernel:  [<ffffffff8008a4b3>] default_wake_function+0x0/0xe
Nov  2 12:59:49 mindy kernel:  [<ffffffff8009d98c>] keventd_create_kthread+0x0/0xc4
Nov  2 12:59:49 mindy kernel:  [<ffffffff8009d98c>] keventd_create_kthread+0x0/0xc4
Nov  2 12:59:49 mindy kernel:  [<ffffffff80032380>] kthread+0xfe/0x132
Nov  2 12:59:49 mindy kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  2 12:59:49 mindy kernel:  [<ffffffff8009d98c>] keventd_create_kthread+0x0/0xc4
Nov  2 12:59:49 mindy kernel:  [<ffffffff8804e024>] :ext3:ext3_journal_dirty_data+0x0/0x34
Nov  2 12:59:49 mindy kernel:  [<ffffffff80032282>] kthread+0x0/0x132
Nov  2 12:59:49 mindy kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Nov  2 12:59:49 mindy kernel:


Both nodes are Qemu-KVM x86_64 guests, each one assigned 1Gb of ram and 2 cpus
I can send copy of cluster.conf eventually

Thanks in advance for your comments.
Gianluca

<div><p>Hello,<br>sorry for the long e-mail in advance.<br>trying to do on a test environment what in subject and I think it could be useful for others too, both in RH EL and in CentOS.<br>I have configured two ip+fs services and HA-LVM<br><br>Starting point is CentOS 5.3 updated at these components:<br>cman-2.0.98-1.el5_3.1<br>openais-0.80.3-22.el5_3.4<br>rgmanager-2.0.46-1.el5.centos.3<br>luci-0.12.1-7.3.el5.centos.1<br>ricci-0.12.1-7.3.el5.centos.1<br>lvm2-2.02.40-6.el5<br>
device-mapper-multipath-0.4.7-23.el5_3.4<br><br>Target would be:<br>cman-2.0.115-1.el5_4.3<br>openais-0.80.6-8.el5_4.1<br>rgmanager-2.0.52-1.el5.centos.2<br>luci-0.12.2-6.el5.centos<br>ricci-0.12.2-6.el5.centos<br>lvm2-2.02.46-8.el5_4.1<br>
device-mapper-multipath-0.4.7-30.el5_4.2<br><br>they are guests in Qemu-KVM environment and I have a backup of the starting situation, so that I can reply and change eventually order of operations.<br><br>node1 is mork, node2 is mindy<br>
Attempt of approach:<br>- services are on node2 (mindy)<br>- shutdown ad restart node1 in single user mode<br>- activate network and update node1 with:<br>&nbsp; yum clean all<br>&nbsp; yum update glibc\*<br>&nbsp; yum update yum\* rpm\* python\*<br>
&nbsp; yum clean all<br>&nbsp; yum update<br>&nbsp; shutdown -r now and start in single user mode to check correct start and so on<br>- init 3 for node1 and join to cluster<br><br>QUESTION1: are there any incompatibilities in this first join of the cluster, based on the different components' versions?<br>
Would it be better in your opinion to make a shutdown of node2 and then have node1 start alone and take the services and then upgrade node2 and have the first contemporary two-nodes join with aligned versions of clusterware software?<br><br>Now, following my approach, after the init 3 on node1 all was ok with cluster join, but I forgot to do a touch of the initrd file of the updated kernel,<br>due to de-optimized check in HA-LVM service comparing timestamp of initrd of running kernel and lvm.conf<br>
So clurgmgrd complains having <br>-rw-r--r-- 1 root root 16433 Nov&nbsp; 2 12:28 /etc/lvm/lvm.conf<br>newer than initrd that is dated end of September..... (see below)<br><br>Nov&nbsp; 2 12:41:00 mork kernel: DLM (built Sep 30 2009 12:53:28) installed<br>
Nov&nbsp; 2 12:41:00 mork kernel: GFS2 (built Sep 30 2009 12:54:10) installed<br>Nov&nbsp; 2 12:41:00 mork kernel: Lock_DLM (built Sep 30 2009 12:54:16) installed<br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]: Starting ccsd 2.0.115: <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Built: Oct 26 2009 22:01:34 <br>
Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Copyright (C) Red Hat, Inc.&nbsp; 2004&nbsp; All rights reserved. <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]: cluster.conf (cluster name = clumm, version = 5) found. <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node. <br>
Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Local version # : 5 <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Remote version #: 5 <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node. <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Local version # : 5 <br>
Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Remote version #: 5 <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node. <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Local version # : 5 <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Remote version #: 5 <br>
Nov&nbsp; 2 12:41:00 mork ccsd[2290]: Remote copy of cluster.conf is from quorate node. <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Local version # : 5 <br>Nov&nbsp; 2 12:41:00 mork ccsd[2290]:&nbsp; Remote version #: 5 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] AIS Executive Service: started and ready to provide service. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] Using default multicast address of 239.192.12.183 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Token Timeout (162000 ms) retransmit timeout (8019 ms) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] token hold (6405 ms) retransmits before loss (20 retrans) <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 <br>
s) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] send threads (0 threads) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] RRP token expired timeout (8019 ms) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] RRP token problem counter (2000 ms) <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] RRP threshold (10 problem count) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] RRP mode set to none. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] heartbeat_failures_allowed (0) <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] max_network_delay (50 ms) <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed &gt; 0 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] The network interface [172.16.0.11] is now up. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Created or loaded sequence id 336.172.16.0.11 for this ring. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering GATHER state from 15. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CMAN ] CMAN 2.0.115 (built Oct 26 2009 22:01:42) started <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais extended virtual synchrony service' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster membership service B.01.01' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais availability management framework B.01.01' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais checkpoint service B.01.01' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais event service B.01.01' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais distributed locking service B.01.01' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais message service B.01.01' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais configuration service' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [SERV ] Service initialized 'openais cluster config database access v1.01' <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SYNC ] Not using a virtual synchrony filter. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Creating commit token because I am the rep. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Saving state aru 0 high seq received 0 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Storing new sequence id for ring 154 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering COMMIT state. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering RECOVERY state. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] position [0] member <a href="http://172.16.0.11">172.16.0.11</a>: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 336 rep 172.16.0.11 <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] aru 0 high delivered 0 received flag 1 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Did not need to originate any messages in recovery. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Sending initial ORF token <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Joined: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Left: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SYNC ] This node is within the primary component and will provide service. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering OPERATIONAL state. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] got nodejoin message 172.16.0.11 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering GATHER state from 11. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Creating commit token because I am the rep. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Saving state aru a high seq received a <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Storing new sequence id for ring 158 <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering COMMIT state. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering RECOVERY state. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] position [0] member <a href="http://172.16.0.11">172.16.0.11</a>: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 340 rep 172.16.0.11 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] aru a high delivered a received flag 1 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] position [1] member <a href="http://172.16.0.12">172.16.0.12</a>: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] previous ring seq 340 rep 172.16.0.12 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] aru d high delivered d received flag 1 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Did not need to originate any messages in recovery. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] Sending initial ORF token <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] New Configuration: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] Members Joined: <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [SYNC ] This node is within the primary component and will provide service. <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [TOTEM] entering OPERATIONAL state. <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CMAN ] quorum regained, resuming activity <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] got nodejoin message 172.16.0.11 <br>Nov&nbsp; 2 12:41:00 mork openais[2302]: [CLM&nbsp; ] got nodejoin message 172.16.0.12 <br>
Nov&nbsp; 2 12:41:00 mork openais[2302]: [CPG&nbsp; ] got joinlist message from node 2 <br>Nov&nbsp; 2 12:41:01 mork ccsd[2290]: Initial status:: Quorate <br>uorum <br>Nov&nbsp; 2 12:41:01 mork qdiskd[2331]: &lt;info&gt; Quorum Daemon Initializing <br>
Nov&nbsp; 2 12:41:02 mork qdiskd[2331]: &lt;info&gt; Heuristic: 'ping -c1 -w1 192.168.122.1' UP <br>Nov&nbsp; 2 12:41:12 mork modclusterd: startup succeeded<br>Nov&nbsp; 2 12:41:12 mork kernel: dlm: Using TCP for communications<br>
Nov&nbsp; 2 12:41:12 mork kernel: dlm: connecting to 2<br>Nov&nbsp; 2 12:41:12 mork kernel: dlm: got connection from 2<br>Nov&nbsp; 2 12:41:12 mork clurgmgrd[2886]: &lt;notice&gt; Resource Group Manager Starting <br>Nov&nbsp; 2 12:41:13 mork oddjobd: oddjobd startup succeeded<br>
Nov&nbsp; 2 12:41:13 mork saslauthd[3338]: detach_tty&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : master pid is: 3338<br>Nov&nbsp; 2 12:41:13 mork saslauthd[3338]: ipc_init&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : listening on socket: /var/run/saslauthd/mux<br>Nov&nbsp; 2 12:41:14 mork ricci: startup succeeded<br>
Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; HA LVM:&nbsp; Improper setup detected <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; HA LVM:&nbsp; Improper setup detected <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; - initrd image needs to be newer than lvm.conf <br>
Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; - initrd image needs to be newer than lvm.conf <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; WARNING: An improper setup can cause data corruption! <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt; WARNING: An improper setup can cause data corruption! <br>
Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt;&nbsp;&nbsp; node2&nbsp;&nbsp; owns vg_cl1/lv_cl1 unable to stop <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd: [2886]: &lt;err&gt;&nbsp;&nbsp; node2&nbsp;&nbsp; owns vg_cl2/lv_cl2 unable to stop <br>Nov&nbsp; 2 12:41:14 mork clurgmgrd[2886]: &lt;notice&gt; stop on lvm "CL2" returned 1 (generic error) <br>
Nov&nbsp; 2 12:41:14 mork clurgmgrd[2886]: &lt;notice&gt; stop on lvm "CL1" returned 1 (generic error) <br>Nov&nbsp; 2 12:41:31 mork qdiskd[2331]: &lt;info&gt; Node 2 is the master <br>Nov&nbsp; 2 12:42:21 mork qdiskd[2331]: &lt;info&gt; Initial score 1/1 <br>
Nov&nbsp; 2 12:42:21 mork qdiskd[2331]: &lt;info&gt; Initialization complete <br>Nov&nbsp; 2 12:42:21 mork openais[2302]: [CMAN ] quorum device registered <br>Nov&nbsp; 2 12:42:21 mork qdiskd[2331]: &lt;notice&gt; Score sufficient for master operation (1/1; required=1); upgrading <br><br>Note that a clustat of both nodes gives correct results (in the sens of nodes taking part in the cluster and rgmanager active on both and quorum disk).<br><br>At this point, after touching initrd file, I think to do a shutdown -r of mork again and see if all goes well.<br>
It seems so, as I get again:<br>...<br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>
Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>
Nov&nbsp; 2 12:46:23 mork openais[2278]: [SYNC ] This node is within the primary component and will provide service. <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [TOTEM] entering OPERATIONAL state. <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CMAN ] quorum regained, resuming activity <br>
Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] got nodejoin message 172.16.0.11 <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CLM&nbsp; ] got nodejoin message 172.16.0.12 <br>Nov&nbsp; 2 12:46:23 mork openais[2278]: [CPG&nbsp; ] got joinlist message from node 2 <br>
Nov&nbsp; 2 12:46:24 mork ccsd[2267]: Initial status:: Quorate <br>uorum <br>Nov&nbsp; 2 12:46:25 mork qdiskd[2310]: &lt;info&gt; Quorum Daemon Initializing <br>Nov&nbsp; 2 12:46:26 mork qdiskd[2310]: &lt;info&gt; Heuristic: 'ping -c1 -w1 192.168.122.1' UP <br>
...<br>Nov&nbsp; 2 12:46:35 mork modclusterd: startup succeeded<br>Nov&nbsp; 2 12:46:35 mork kernel: dlm: Using TCP for communications<br>Nov&nbsp; 2 12:46:35 mork kernel: dlm: connecting to 2<br>Nov&nbsp; 2 12:46:36 mork oddjobd: oddjobd startup succeeded<br>
Nov&nbsp; 2 12:46:36 mork saslauthd[2990]: detach_tty&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : master pid is: 2990<br>Nov&nbsp; 2 12:46:36 mork saslauthd[2990]: ipc_init&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; : listening on socket: /var/run/saslauthd/mux<br>Nov&nbsp; 2 12:46:36 mork ricci: startup succeeded<br>
Nov&nbsp; 2 12:46:55 mork qdiskd[2310]: &lt;info&gt; Node 2 is the master <br>Nov&nbsp; 2 12:47:45 mork qdiskd[2310]: &lt;info&gt; Initial score 1/1 <br>Nov&nbsp; 2 12:47:45 mork qdiskd[2310]: &lt;info&gt; Initialization complete <br>Nov&nbsp; 2 12:47:45 mork openais[2278]: [CMAN ] quorum device registered <br>
Nov&nbsp; 2 12:47:45 mork qdiskd[2310]: &lt;notice&gt; Score sufficient for master operation (1/1; required=1); upgrading <br><br>but instead, on mindy I get this error and the node goes out of memory and I have to power off it....<br>
Nov&nbsp; 2 12:47:54 mindy kernel: dlm: connect from non cluster node<br><br>Donna if the problem with cluster is the cause or the effect of the problem....<br><br>In particular, these are messages on mindy, during the first join of the cluster and the reboot of mork:<br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] entering GATHER state from 11. <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] Saving state aru d high seq received d <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] Storing new sequence id for ring 158 <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] entering COMMIT state. <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] entering RECOVERY state. <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] position [0] member <a href="http://172.16.0.11">172.16.0.11</a>: <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] previous ring seq 340 rep 172.16.0.11 <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] aru a high delivered a received flag 1 <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] position [1] member <a href="http://172.16.0.12">172.16.0.12</a>: <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] previous ring seq 340 rep 172.16.0.12 <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] aru d high delivered d received flag 1 <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] Did not need to originate any messages in recovery. <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] Members Left: <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [SYNC ] This node is within the primary component and will provide service. <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [TOTEM] entering OPERATIONAL state. <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] got nodejoin message 172.16.0.11 <br>
Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CLM&nbsp; ] got nodejoin message 172.16.0.12 <br>Nov&nbsp; 2 12:42:20 mindy openais[2465]: [CPG&nbsp; ] got joinlist message from node 2 <br>Nov&nbsp; 2 12:42:32 mindy kernel: dlm: connecting to 1<br>Nov&nbsp; 2 12:42:32 mindy kernel: dlm: got connection from 1<br>
Nov&nbsp; 2 12:46:16 mindy clurgmgrd[3101]: &lt;notice&gt; Member 1 shutting down <br>Nov&nbsp; 2 12:46:26 mindy qdiskd[2508]: &lt;info&gt; Node 1 shutdown <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] entering GATHER state from 12. <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] Saving state aru 3e high seq received 3e <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] Storing new sequence id for ring 160 <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] entering COMMIT state. <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] entering RECOVERY state. <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] position [0] member <a href="http://172.16.0.11">172.16.0.11</a>: <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] previous ring seq 348 rep 172.16.0.11 <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] aru a high delivered a received flag 1 <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] position [1] member <a href="http://172.16.0.12">172.16.0.12</a>: <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] previous ring seq 344 rep 172.16.0.11 <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] aru 3e high delivered 3e received flag 1 <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] Did not need to originate any messages in recovery. <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] New Configuration: <br>Nov&nbsp; 2 12:47:43 mindy kernel: dlm: closing connection to node 1<br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] Members Joined: <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] CLM CONFIGURATION CHANGE <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] New Configuration: <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.11)&nbsp; <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ]&nbsp;&nbsp;&nbsp; r(0) ip(172.16.0.12)&nbsp; <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] Members Left: <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] Members Joined: <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [SYNC ] This node is within the primary component and will provide service. <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [TOTEM] entering OPERATIONAL state. <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] got nodejoin message 172.16.0.11 <br>
Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CLM&nbsp; ] got nodejoin message 172.16.0.12 <br>Nov&nbsp; 2 12:47:43 mindy openais[2465]: [CPG&nbsp; ] got joinlist message from node 2 <br>Nov&nbsp; 2 12:47:54 mindy kernel: dlm: connect from non cluster node<br>
Nov&nbsp; 2 12:59:48 mindy kernel: dlm_send invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0<br>Nov&nbsp; 2 12:59:48 mindy kernel: <br>Nov&nbsp; 2 12:59:48 mindy kernel: Call Trace:<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff800c3a6a&gt;] out_of_memory+0x8e/0x2f5<br>
Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff8009dba4&gt;] autoremove_wake_function+0x0/0x2e<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff8000f2eb&gt;] __alloc_pages+0x245/0x2ce<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff8000f10b&gt;] __alloc_pages+0x65/0x2ce<br>
Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff80017493&gt;] cache_grow+0x137/0x395<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff8005bbf7&gt;] cache_alloc_refill+0x136/0x186<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff8000a96e&gt;] kmem_cache_alloc+0x6c/0x76<br>
Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff80043ae3&gt;] sk_alloc+0x2e/0xf3<br>Nov&nbsp; 2 12:59:48 mindy kernel:&nbsp; [&lt;ffffffff80059676&gt;] inet_create+0x137/0x267<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8004c9af&gt;] __sock_create+0x170/0x27c<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8839086e&gt;] :dlm:process_send_sockets+0x0/0x179<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff883902f4&gt;] :dlm:tcp_connect_to_sock+0x70/0x1de<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff80063097&gt;] thread_return+0x62/0xfe<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8839088e&gt;] :dlm:process_send_sockets+0x20/0x179<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8839086e&gt;] :dlm:process_send_sockets+0x0/0x179<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8004d159&gt;] run_workqueue+0x94/0xe4<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff800499da&gt;] worker_thread+0x0/0x122<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8009d98c&gt;] keventd_create_kthread+0x0/0xc4<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff80049aca&gt;] worker_thread+0xf0/0x122<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8008a4b3&gt;] default_wake_function+0x0/0xe<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8009d98c&gt;] keventd_create_kthread+0x0/0xc4<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8009d98c&gt;] keventd_create_kthread+0x0/0xc4<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff80032380&gt;] kthread+0xfe/0x132<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8005dfb1&gt;] child_rip+0xa/0x11<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8009d98c&gt;] keventd_create_kthread+0x0/0xc4<br>
Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8804e024&gt;] :ext3:ext3_journal_dirty_data+0x0/0x34<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff80032282&gt;] kthread+0x0/0x132<br>Nov&nbsp; 2 12:59:49 mindy kernel:&nbsp; [&lt;ffffffff8005dfa7&gt;] child_rip+0x0/0x11<br>
Nov&nbsp; 2 12:59:49 mindy kernel: <br><br><br>Both nodes are Qemu-KVM x86_64 guests, each one assigned 1Gb of ram and 2 cpus <br>I can send copy of cluster.conf eventually<br><br>Thanks in advance for your comments.<br>Gianluca<br><br></p></div>
Alain RICHARD | 2 Nov 2009 16:44
Picon
Favicon

rgmanager vm.sh using virsh under RHEL5.4

I have had a look at the current vm.sh script and have found out :

a) use_virsh = 1 per default
b) that if your resource have a path attribute, vm.sh automatically revert to use_virsh=0, even if you hard code use_virsh=1 !
c) there is no option to indicate the xml file that virsh use to create the vm. It always tries "virsh create name" where name is the vm name.

The point a) is a little bit silly because if you have a RHEL 5.3 cluster that is using xm configuration files, your vm will no longer launch after upgrade because it tries to do a "virsh create name" instead of "xm create name". It would have been probably cleaner to have "use_virsh = 0" per default to keep compatibility.

The point b) will add compatibility to people that use the path attribute in order to store vm conf files in a place shared by all members of the cluster (gfs2 or nfs directory for example). It would have been clearer to document this feature because it is a little bit magical to see a resource with use_virsh=1 use in fact xm and not virsh !!!

The point c) is very silly, because it restricts the configuration to be loaded from /etc/xen even for kvm ! Also this directory is not shared on the various members of the cluster and the configuration file must have the same name as the vm name (we prefer to call it name.xml).

Also their is no problem to use a "virsh create /path/to/file.xml" under RHEL 5.4 and I have found out that the cluster 3.0 stable branch have a new vm.sh file using an xmlpath attribute to solve this problem. Why this version was not back ported to RHEL 5.4 ? Is there any plan to do it ?

Regards,

-- 

Alain RICHARD <mailto:alain.richard <at> equation.fr>

EQUATION SA <http://www.equation.fr/>

Tel : +33 477 79 48 00     Fax : +33 477 79 48 01

E-Liance, Opérateur des entreprises et collectivités,

Liaisons Fibre optique, SDSL et ADSL <http://www.e-liance.fr>



<div>I have had a look at the current vm.sh script and have found out :<div><br></div>
<div>a) use_virsh = 1 per default</div>
<div>b) that if your resource have a path attribute, vm.sh automatically revert to use_virsh=0, even if you hard code use_virsh=1 !</div>
<div>c) there is no option to indicate the xml file that virsh use to create the vm. It always tries "virsh create name" where name is the vm name.</div>
<div><br></div>
<div>The point a) is a little bit silly because if you have a RHEL 5.3 cluster that is using xm configuration files, your vm will no longer launch after upgrade because it tries to do a "virsh create name" instead of "xm create name". It would have been probably cleaner to have "use_virsh = 0" per default to keep compatibility.</div>
<div><br></div>
<div>The point b) will add compatibility to people that use the path attribute in order to store vm conf files in a place shared by all members of the cluster (gfs2 or nfs directory for example). It would have been clearer to document this feature because it is a little bit magical to see a resource with use_virsh=1 use in fact xm and not virsh !!!</div>
<div><br></div>
<div>The point c) is very silly, because it restricts the configuration to be loaded from /etc/xen even for kvm ! Also this directory is not shared on the various members of the cluster and the configuration file must have the same name as the vm name (we prefer to call it name.xml).</div>
<div><br></div>
<div>Also their is no problem to use a "virsh create /path/to/file.xml" under RHEL 5.4 and I have found out that the cluster 3.0 stable branch have a new vm.sh file using an xmlpath attribute to solve this problem. Why this version was not back ported to RHEL 5.4 ? Is there any plan to do it ?</div>
<div><br></div>
<div>Regards,</div>
<div>
<br><div>
<span class="Apple-style-span"><span class="Apple-style-span"><span class="Apple-style-span"><p>--<span class="Apple-converted-space">&nbsp;</span></p>
<p>Alain RICHARD &lt;<a href="mailto:alain.richard <at> equation.fr">mailto:alain.richard <at> equation.fr</a>&gt;</p>
<p>EQUATION SA &lt;<a href="http://www.equation.fr/">http://www.equation.fr/</a>&gt;</p>
<p>Tel : +33 477 79 48 00<span class="Apple-converted-space"><span class="Apple-converted-tab"><span class="Apple-converted-space">&nbsp;</span>&nbsp; &nbsp;</span><span class="Apple-converted-space">&nbsp;</span></span>Fax : +33 477 79 48 01</p>
<p>E-Liance, Op&eacute;rateur des entreprises et collectivit&eacute;s,</p>
<p>Liaisons Fibre optique, SDSL et ADSL &lt;<a href="http://www.e-liance.fr">http://www.e-liance.fr</a>&gt;</p>
<br class="Apple-interchange-newline"></span></span></span>
</div>
<br>
</div>
</div>
Alain RICHARD | 2 Nov 2009 17:59
Picon
Favicon

qdiskd master election and loss of quorum

I am currently using a n nodes configuration with a qdiskd process to sustain a n-1 node failure.

The simplest case is a two node :

<cluster config_version="79" name="xxx">
        <totem token="42000"/>
        <clusternodes>
        <cman expected_votes="3" two_node="0"/>
                <clusternode name="n1" nodeid="1" votes="1">
                        <fence>
...
                        </fence>
                </clusternode>
                <clusternode name="n2" nodeid="2" votes="1">
                        <fence>
...
                        </fence>
                </clusternode>
        </clusternodes>
        <quorumd cman_label="qdisk1" device="/dev/yyy" interval="2" tko="10" votes="1" reboot="0" allow_kill="0" status_file="/qdiskstat">
        </quorumd>
        <rm>
...
        </rm>
</cluster>

I am experiencing some times a loss of quorum on the over node when I shutdown gracefully a node using the following :

# service rgmanager stop
# service gfs2 stop
# service clvmd stop
# service qdiskd stop
# service cman stop


After looking more precisely to the problem, I just discover that the problem is that the node I shutdown is the master qdisk node, so when I shutdown qdiskd and cman on the first node, the second node experience a loss of qdisk vote (because the second node sees that qdisk master is not avail and start the election of the new master) and almost simultaneouly a loss of the first node vote because it has leaved the cluster.

The effect is that the second node experience a loss of quorum during about 20 seconds, the time to elect himself as qdisk master. The problem is that rgmanager sees the loss of quorum and shutdowns all the virtual machines that are under its control !!!

If I wait 20 seconds between the "service qdiskd stop" and "service cman stop", I don't get the problem because the second node get the time to elect himself master.

I was thinking qdiskd is supposed to be a process to maintain the quorum independently of the cman communication. 

Either I make a mistake or misuse of qdiskd, or there is something to change in the handling of qdiskd votes.

One solution may be for a node that was not qdiskd master, and was issuing votes to cman to maintain this vote until a new master election succeeds instead of removing its vote until the master reelection succeeds ?

Regards,

-- 

Alain RICHARD <mailto:alain.richard <at> equation.fr>

EQUATION SA <http://www.equation.fr/>

Tel : +33 477 79 48 00     Fax : +33 477 79 48 01

E-Liance, Opérateur des entreprises et collectivités,

Liaisons Fibre optique, SDSL et ADSL <http://www.e-liance.fr>



<div>I am currently using a n nodes configuration with a qdiskd process to sustain a n-1 node failure.<div><br></div>
<div>The simplest case is a two node :</div>
<div><br></div>
<div>
<div>&lt;cluster config_version="79" name="xxx"&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;totem token="42000"/&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;clusternodes&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;cman expected_votes="3" two_node="0"/&gt;</div>
<div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;clusternode name="n1" nodeid="1" votes="1"&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;fence&gt;</div>
<div>...</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;/fence&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;/clusternode&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;clusternode name="n2" nodeid="2" votes="1"&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;fence&gt;</div>
<div>...</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;/fence&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&lt;/clusternode&gt;</div>
<div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;/clusternodes&gt;</div>
<div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;quorumd cman_label="qdisk1" device="/dev/yyy" interval="2" tko="10" votes="1" reboot="0" allow_kill="0" status_file="/qdiskstat"&gt;</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;/quorumd&gt;</div>
<div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;rm&gt;</div>
<div>...</div>
<div>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&lt;/rm&gt;</div>
<div>
<div>&lt;/cluster&gt;</div>
<div><br></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div>I am experiencing some times a loss of quorum on the over node when I shutdown gracefully a node using the following :</div>
<div>
<p class="MsoNormal"># service rgmanager stop<br># service gfs2 stop<br># service clvmd stop<br># service qdiskd stop<br># service cman stop</p>
<div><br></div>
<div>After looking more precisely to the problem, I just discover that the problem is that the node I shutdown is the master qdisk node, so when I shutdown qdiskd and cman on the first node, the second node experience a loss of qdisk vote (because the second node sees that qdisk master is not avail and start the election of the new master) and almost simultaneouly a loss of the first node vote because it has leaved the cluster.</div>
<div><br></div>
<div>The effect is that the second node experience a loss of quorum during about 20 seconds, the time to elect himself as qdisk master. The problem is that rgmanager sees the loss of quorum and shutdowns all the virtual machines that are under its control !!!</div>
<div><br></div>
<div>If I wait 20 seconds between the "service qdiskd stop" and "service cman stop", I don't get the problem because the second node get the time to elect himself master.</div>
<div><br></div>
<div>I was thinking qdiskd is supposed to be a process to maintain the quorum independently of the cman communication.&nbsp;</div>
<div><br></div>
<div>Either I make a mistake or misuse of qdiskd, or there is something to change in the handling of qdiskd votes.</div>
<div><br></div>
<div>One solution may be for a node that was not qdiskd master, and was issuing votes to cman to maintain this vote until a new master election succeeds instead of removing its vote until the master reelection succeeds ?</div>
<div><br></div>
<div>Regards,</div>
<div><br></div>
</div>
<div>
<div>
<span class="Apple-style-span"><p>--<span class="Apple-converted-space">&nbsp;</span></p>
<p>Alain RICHARD &lt;<a href="mailto:alain.richard <at> equation.fr">mailto:alain.richard <at> equation.fr</a>&gt;</p>
<p>EQUATION SA &lt;<a href="http://www.equation.fr/">http://www.equation.fr/</a>&gt;</p>
<p>Tel : +33 477 79 48 00<span class="Apple-converted-space"><span class="Apple-converted-tab"><span class="Apple-converted-space">&nbsp;</span>&nbsp; &nbsp;</span><span class="Apple-converted-space">&nbsp;</span></span>Fax : +33 477 79 48 01</p>
<p>E-Liance, Op&eacute;rateur des entreprises et collectivit&eacute;s,</p>
<p>Liaisons Fibre optique, SDSL et ADSL &lt;<a href="http://www.e-liance.fr">http://www.e-liance.fr</a>&gt;</p>
<br class="Apple-interchange-newline"></span>
</div>
<br>
</div>
</div>
David Teigland | 2 Nov 2009 18:11
Picon
Favicon

Re: GFS2 processes getting stuck in WCHAN=dlm_posix_lock

On Fri, Oct 30, 2009 at 07:27:23PM -0400, Allen Belletti wrote:
> I'll notice the problem when the load average starts rising.  It's 
> always tied to "stuck" processes, and I believe always tied to IMAP 
> clients (I'm running Dovecot.)  It seems like a file belonging to user 
> "x" (in this case, "jforrest" will become locked in some way, such that 
> every IMAP process tied that user will get stuck on the same thing.  
> Over time, as the user keeps trying to read that file, more & more 
> processes accumulate.  They're always in state "D" (uninterruptible 
> sleep), and always on "dlm_posix_lock" according to WCHAN.  The only way 
> I'm able to get out of this state is to reboot.  If I let it persist for 
> too long, I/O generally stops entirely.

Next time, try to collect all the following information as soon as you can
after the first process gets stuck:

- ps showing pid of stuck/"D" process(es) and WCHAN
- which file they are stuck trying to lock
  (and the inode number of it, you may need to wait until after the
   reboot to use ls -li on the file to get the inode number)
- group_tool dump plocks <fsname> from all the nodes

I'm guessing that dovecot does some "unusual" combinations of locking,
closing, renaming, unlinking files.  Those combinations are especially
prone to races and bugs that cause posix lock state to get off.

Dave


Gmane