Tom Parker | 22 Apr 20:21 2014

SBD flipping between Pacemaker: UNHEALTHY and OK

Has anyone seen this?  Do you know what might be causing the flapping?

Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
/dev/mapper/qa-xen-sbd
Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd
uuid: ae835596-3d26-4681-ba40-206b4d51149b
Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS
quorum check enabled
Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
cluster ...
Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
/dev/watchdog
Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
seconds.
Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
cluster ...
Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY
Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY
(Continue reading)

Tom Parker | 22 Apr 20:15 2014

Resource blocked

Good morning

I am trying to restart resources on one of my clusters and I am getting
the message

pengine[13397]:   notice: LogActions: Start   domtcot1-qa        (qaxen1
- blocked)

How can I find out why this resource is blocked.

Thanks
_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

David Vossel | 22 Apr 00:56 2014
Picon

Active/Active nfs server lock recovery?

Hey,

Has anyone had any success with deploying an Active/Active NFS server?  I'm curious how lock recovery is performed.

In a typically Active/Passive scenario we have a nfs-server instance coupled with the exportfs. The nfs
lock info is stored on some shared storage that follows that nfs server and the exportfs instances around
the cluster.  This allows us to alert the nfs clients after the failover that the server rebooted and that
they need to re-establish their locks.

With an Active/Active setup, we'd have multiple nfs servers and exportfs instances, non of which are tied
to one another. Meaning that the exportfs resources could run on any of the nfs server instances within the
cluster.  On failover, if we wanted the exportfs resources on a failed node to be taken over by another
already existing nfs server on another node. In this instance does anyone know of a good way to alert the nfs
clients previously connected to the old (failed) node that they need to re-establish their locks with the
new node?  It seems like the statd info from both the failed node's nfs server and the new node's nfs server
would have to be merged or something.

any thoughts?

-- Vossel
_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Alessandro Baggi | 16 Apr 09:59 2014
Picon

Linux-HA General Problem

Hi list,
this is my first post on linux-ha ml and I'm very new to linux-ha 
environment.

I'm trying to install and run Linux-HA suite on Slackware64 14.1.

I've installed this version of software:

clusterglue-1.0.9-x86_64-2_SBo.tgz
resource-agents-v3.9.5-x86_64-1_SBo.tgz (tried also 3.1.1 and 3.9.2)
corosync-1.4.6-x86_64-1_SBo.tgz
crmsh-2.0.0-x86_64-1_SBo.tgz 		(tried also with crmsh-1.2.6)
drbd-tools-8.4.4-x86_64-1_SBo.tgz
libesmtp-1.0.4-x86_64-1_SBo.tgz
libnet-1.1.6-x86_64-1_SBo.tgz
libqb-0.17.0-x86_64-1_SBo.tgz
pacemaker-1.1.10-x86_64-2_SBo.tgz

(many packages ara tagged with SBo, but only 2/3 packages are on 
SlackBuils for slack 14.1).

I've configured corosync, for 2 nodes and configured pacemaker service 
with ver 1.

After this, I've runned:

	/etc/rc.d/rc.corosync start
	/etc/rc.d/pacemaker start

and all goes good.
(Continue reading)

Ulrich Windl | 16 Apr 09:18 2014
Picon

Q: NTP Problem after Xen live migration

Hello!

I managed to migrate my Xen virtual machines live from one node to another. Soe of the VMs have several GB of
RAM with databases, so live migration takes a while. However the typical user does not notice that the node
migrated (network connections stay alive). Unfortunately NTP seems to notice the time migration took
(77s in this case):

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l    3   64  377    0.000    0.000   0.001
 132.199.176.18  192.53.103.108   2 u  180  512  377    0.239  -77753. 41560.8
 132.199.176.153 192.53.103.104   2 u  185  512  377    0.275  -77751. 41560.1
 172.20.16.1     132.199.176.153  3 u  191  512  377    0.378  -77753. 41560.5
 172.20.16.5     132.199.176.18   3 u  251  512  377    0.107  -77754. 41560.2

So it thinks the other nodes are off by 77.7 seconds and sync to ist own clock.

Other VMs show simular results:
# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l    6   64  377    0.000    0.000   0.001
 132.199.176.18  192.53.103.108   2 u  582 1024  377    0.242  -78361. 78359.9
 132.199.176.153 192.53.103.104   2 u  546 1024  377    0.521  -78360. 78360.1

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l   64   64  377    0.000    0.000   0.001
(Continue reading)

Maloja01 | 8 Apr 11:33 2014
Picon

Re: Master Became Slave - Cluster unstable $$$

On 04/08/2014 12:18 AM, Ammar Sheikh Saleh wrote:
> yes i have the command ... its CentOS

Then please review the man page of crm_master and try to adjust the 
scores where you want to start the master and where you want to start 
the slave. Before you follow my general steps you could also ask again
on the list about using crm_master fom command line on centos - I am not
really sure if it is really the same.

1. Check the current promotion scores using the pengine:
ptest -Ls | grep promo
-> You should get a list of scores per master/slave resources and node

2. Check the set crm_master score using crm_master:
crm_master -q -G -N <node> -l reboot -r <resource-INSIDE-masterslave>

3. Adjust the master/promotion scores (this is the most tricky part)
crm_master -v <NEW_MASTER_VALUE> -l reboot -r <resource-INSIDE-masterslave>

If you do not have constraints added by bad operations before that
might help the cluster to promote the preferred site.

But my procedure is without any warranty and further support, sorry.

Maloja01

>
>
> On Mon, Apr 7, 2014 at 4:16 PM, Maloja01 <maloja01 <at> arcor.de> wrote:
>
(Continue reading)

aalishe | 7 Apr 02:02 2014
Picon

Master Became Slave - Cluster unstable $$$

Hi all,

I am new to corosync/pacemaker and  I have a 2 node "production" cluster
with (corosync+pacemaker+drbd)

Node1 = lws1h1.mydomain.com
Node2 = lws1h2.mydomain.com

they are in online/online  failover setup .....  services are only running
where DRBD resides... the other node stays online to take over if Node1
fails 

this is the SW versions:
corosync-2.3.0-1.el6.x86_64
drbd84-utils-8.4.2-1.el6.elrepo.x86_64
pacemaker-1.1.8-1.el6.x86_64
OS:  CentOS 6.4 x64bit

the cluster is configured with Quorum (not sure what that is) 

few days ago I placed one of the nodes in maintenance mode "after" services
where going bad due to a problem  .... I dont remember the details of how I
moved/migrated the resources but I usually use  LCMC GUI tool .... also I
did some restart for corosync / pacemaker in a random ways   :$

after that.. Node1 became slave and Node2 became master!

services are now sticking on Node2  and I cant migrate them even by force to
Node1  (tried command line tools and LCMC tool)

(Continue reading)

Anna Hegedus | 5 Apr 04:55 2014
Picon

"Info" messages in syslog: Are they normal?

Hi Everyone,
I have to apologize in advance. I have used other solutions before in regards to monitoring servers and
having redundancy, but this is my first experience with heartbeat.
I've inherited a set of servers from someone that is using drbd to provide a redundant installation of mysql
and another set of disks for lucene. After installing some monitoring software I started receiving these
messages in emails:
Apr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-disk:0_last_failure_0 found resource lucene-disk:0 active on staircase.bup.prod.localApr 
4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
db-master-mysql_last_failure_0 found resource db-master-mysql active on
drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-server_last_failure_0 found resource lucene-server active on drawers.bup.prod.localApr  4
22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-disk:1_last_failure_0 found resource lucene-disk:1 active in master mode on
drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-fs_last_failure_0 found res
 ource lucene-fs active on drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice:
unpack_rsc_op: Operation lucene-ip_last_failure_0 found resource lucene-ip active on drawers.bup.p
 rod.localApr  4 22:14:15 staircase pengine: [31090]: notice: LogActions: Leave  
db-master-ip#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   db-master-mysql#011(Started drawers.bup.prod.local)Apr  4 22:14:15
staircase pengine: [31090]: notice: LogActions: Leave   db-slave-ip#011(Started
staircase.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]: notice: LogActions: Leave  
db-slave-mysql#011(Started staircase.bup.prod.local))Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   lucene-disk:0#011(Slave staircase.bup.prod.local)Apr  4 22:14:15
staircase pengine: [31090]: notice: LogActions: Leave   lucene-disk:1#011(Master
drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]: notice: LogActio
 ns: Leave   lucene-fs#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   lucene-ip#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircas
 e pengine: [31090]: notice: LogActions: Leave   lucene-server#011(Started
(Continue reading)

logfacility

Hello,

If I don't specify logfacility in ha.cf, then which logfacility is used?

Thanks,
David

_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Razvan Oncioiu | 19 Mar 21:16 2014
Picon

NFS failover fails with stale file handles while migrating resources

Hello, 

Running into a bit of a problem here , I set up two servers ( Centos 6 )
with Glusterfs and a shared directory between them, I have moved the nfs
directory to the shared Gluster folder and have created a symlink on both
boxes. The machines can talk to themselves via hostnames and Gluster
replication is handled via another ethernet card between the servers. 

The problem I am having, is that even though the resources fail over
correctly ( though tit seems to come up and down a few times while failing
over ) , I can get stale nfs handles on the client. Below is my crm config;
what am I doing wrong? 

node GlusterFS01
node GlusterFS02
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.167" cidr_netmask="24" clusterip_hash="sourceip"
\
        op monitor interval="5s"
primitive exportfs ocf:heartbeat:exportfs \
        params fsid="0" directory="/GlusterFS/Files"
options="rw,sync,no_subtree_check,no_root_squash" clientspec="10.10.10.0/24"
wait_for_leasetime_on_stop="false" \
        op monitor interval="5s" \
        op start interval="0s" timeout="240s" \
        op stop interval="0s" timeout="100s" \
        meta is-managed="true" target-role="Started"
primitive nfs lsb:nfs \
        meta target-role="Started" \
        op monitor interval="5s" timeout="5s"
(Continue reading)

Ulrich Windl | 19 Mar 08:28 2014
Picon

clvmd default options: why debug?

Hi!

I wonder why writing debug logs to syslog is enabled by default for clvmd in SLES11 (SP3):
---
crm(live)# ra info clvmd
clvmd resource agent (ocf:lvm2:clvmd)

This is a Resource Agent for both clvmd and cmirrord.
It starts clvmd and cmirrord as anonymous clones.

Parameters (* denotes required, [] the default):

daemon_timeout (string, [80]): Daemon Timeout
    Number of seconds to allow the control daemon to come up and down

daemon_options (string, [-d2]): Daemon Options
    Options to clvmd. Refer to clvmd.8 for detailed descriptions.

Operations' defaults (advisory minimum):

    start         timeout=90
    stop          timeout=100
    monitor       timeout=20
---

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
(Continue reading)


Gmane