Alessandro Baggi | 16 Apr 09:59 2014
Picon

Linux-HA General Problem

Hi list,
this is my first post on linux-ha ml and I'm very new to linux-ha 
environment.

I'm trying to install and run Linux-HA suite on Slackware64 14.1.

I've installed this version of software:

clusterglue-1.0.9-x86_64-2_SBo.tgz
resource-agents-v3.9.5-x86_64-1_SBo.tgz (tried also 3.1.1 and 3.9.2)
corosync-1.4.6-x86_64-1_SBo.tgz
crmsh-2.0.0-x86_64-1_SBo.tgz 		(tried also with crmsh-1.2.6)
drbd-tools-8.4.4-x86_64-1_SBo.tgz
libesmtp-1.0.4-x86_64-1_SBo.tgz
libnet-1.1.6-x86_64-1_SBo.tgz
libqb-0.17.0-x86_64-1_SBo.tgz
pacemaker-1.1.10-x86_64-2_SBo.tgz

(many packages ara tagged with SBo, but only 2/3 packages are on 
SlackBuils for slack 14.1).

I've configured corosync, for 2 nodes and configured pacemaker service 
with ver 1.

After this, I've runned:

	/etc/rc.d/rc.corosync start
	/etc/rc.d/pacemaker start

and all goes good.
(Continue reading)

Ulrich Windl | 16 Apr 09:18 2014
Picon

Q: NTP Problem after Xen live migration

Hello!

I managed to migrate my Xen virtual machines live from one node to another. Soe of the VMs have several GB of
RAM with databases, so live migration takes a while. However the typical user does not notice that the node
migrated (network connections stay alive). Unfortunately NTP seems to notice the time migration took
(77s in this case):

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l    3   64  377    0.000    0.000   0.001
 132.199.176.18  192.53.103.108   2 u  180  512  377    0.239  -77753. 41560.8
 132.199.176.153 192.53.103.104   2 u  185  512  377    0.275  -77751. 41560.1
 172.20.16.1     132.199.176.153  3 u  191  512  377    0.378  -77753. 41560.5
 172.20.16.5     132.199.176.18   3 u  251  512  377    0.107  -77754. 41560.2

So it thinks the other nodes are off by 77.7 seconds and sync to ist own clock.

Other VMs show simular results:
# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l    6   64  377    0.000    0.000   0.001
 132.199.176.18  192.53.103.108   2 u  582 1024  377    0.242  -78361. 78359.9
 132.199.176.153 192.53.103.104   2 u  546 1024  377    0.521  -78360. 78360.1

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l   64   64  377    0.000    0.000   0.001
(Continue reading)

Maloja01 | 8 Apr 11:33 2014
Picon

Re: Master Became Slave - Cluster unstable $$$

On 04/08/2014 12:18 AM, Ammar Sheikh Saleh wrote:
> yes i have the command ... its CentOS

Then please review the man page of crm_master and try to adjust the 
scores where you want to start the master and where you want to start 
the slave. Before you follow my general steps you could also ask again
on the list about using crm_master fom command line on centos - I am not
really sure if it is really the same.

1. Check the current promotion scores using the pengine:
ptest -Ls | grep promo
-> You should get a list of scores per master/slave resources and node

2. Check the set crm_master score using crm_master:
crm_master -q -G -N <node> -l reboot -r <resource-INSIDE-masterslave>

3. Adjust the master/promotion scores (this is the most tricky part)
crm_master -v <NEW_MASTER_VALUE> -l reboot -r <resource-INSIDE-masterslave>

If you do not have constraints added by bad operations before that
might help the cluster to promote the preferred site.

But my procedure is without any warranty and further support, sorry.

Maloja01

>
>
> On Mon, Apr 7, 2014 at 4:16 PM, Maloja01 <maloja01 <at> arcor.de> wrote:
>
(Continue reading)

aalishe | 7 Apr 02:02 2014
Picon

Master Became Slave - Cluster unstable $$$

Hi all,

I am new to corosync/pacemaker and  I have a 2 node "production" cluster
with (corosync+pacemaker+drbd)

Node1 = lws1h1.mydomain.com
Node2 = lws1h2.mydomain.com

they are in online/online  failover setup .....  services are only running
where DRBD resides... the other node stays online to take over if Node1
fails 

this is the SW versions:
corosync-2.3.0-1.el6.x86_64
drbd84-utils-8.4.2-1.el6.elrepo.x86_64
pacemaker-1.1.8-1.el6.x86_64
OS:  CentOS 6.4 x64bit

the cluster is configured with Quorum (not sure what that is) 

few days ago I placed one of the nodes in maintenance mode "after" services
where going bad due to a problem  .... I dont remember the details of how I
moved/migrated the resources but I usually use  LCMC GUI tool .... also I
did some restart for corosync / pacemaker in a random ways   :$

after that.. Node1 became slave and Node2 became master!

services are now sticking on Node2  and I cant migrate them even by force to
Node1  (tried command line tools and LCMC tool)

(Continue reading)

Anna Hegedus | 5 Apr 04:55 2014
Picon

"Info" messages in syslog: Are they normal?

Hi Everyone,
I have to apologize in advance. I have used other solutions before in regards to monitoring servers and
having redundancy, but this is my first experience with heartbeat.
I've inherited a set of servers from someone that is using drbd to provide a redundant installation of mysql
and another set of disks for lucene. After installing some monitoring software I started receiving these
messages in emails:
Apr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-disk:0_last_failure_0 found resource lucene-disk:0 active on staircase.bup.prod.localApr 
4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
db-master-mysql_last_failure_0 found resource db-master-mysql active on
drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-server_last_failure_0 found resource lucene-server active on drawers.bup.prod.localApr  4
22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-disk:1_last_failure_0 found resource lucene-disk:1 active in master mode on
drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice: unpack_rsc_op: Operation
lucene-fs_last_failure_0 found res
 ource lucene-fs active on drawers.bup.prod.localApr  4 22:14:15 staircase pengine: [31090]: notice:
unpack_rsc_op: Operation lucene-ip_last_failure_0 found resource lucene-ip active on drawers.bup.p
 rod.localApr  4 22:14:15 staircase pengine: [31090]: notice: LogActions: Leave  
db-master-ip#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   db-master-mysql#011(Started drawers.bup.prod.local)Apr  4 22:14:15
staircase pengine: [31090]: notice: LogActions: Leave   db-slave-ip#011(Started
staircase.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]: notice: LogActions: Leave  
db-slave-mysql#011(Started staircase.bup.prod.local))Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   lucene-disk:0#011(Slave staircase.bup.prod.local)Apr  4 22:14:15
staircase pengine: [31090]: notice: LogActions: Leave   lucene-disk:1#011(Master
drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]: notice: LogActio
 ns: Leave   lucene-fs#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircase pengine: [31090]:
notice: LogActions: Leave   lucene-ip#011(Started drawers.bup.prod.local)Apr  4 22:14:15 staircas
 e pengine: [31090]: notice: LogActions: Leave   lucene-server#011(Started
(Continue reading)

logfacility

Hello,

If I don't specify logfacility in ha.cf, then which logfacility is used?

Thanks,
David

_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Razvan Oncioiu | 19 Mar 21:16 2014
Picon

NFS failover fails with stale file handles while migrating resources

Hello, 

Running into a bit of a problem here , I set up two servers ( Centos 6 )
with Glusterfs and a shared directory between them, I have moved the nfs
directory to the shared Gluster folder and have created a symlink on both
boxes. The machines can talk to themselves via hostnames and Gluster
replication is handled via another ethernet card between the servers. 

The problem I am having, is that even though the resources fail over
correctly ( though tit seems to come up and down a few times while failing
over ) , I can get stale nfs handles on the client. Below is my crm config;
what am I doing wrong? 

node GlusterFS01
node GlusterFS02
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.167" cidr_netmask="24" clusterip_hash="sourceip"
\
        op monitor interval="5s"
primitive exportfs ocf:heartbeat:exportfs \
        params fsid="0" directory="/GlusterFS/Files"
options="rw,sync,no_subtree_check,no_root_squash" clientspec="10.10.10.0/24"
wait_for_leasetime_on_stop="false" \
        op monitor interval="5s" \
        op start interval="0s" timeout="240s" \
        op stop interval="0s" timeout="100s" \
        meta is-managed="true" target-role="Started"
primitive nfs lsb:nfs \
        meta target-role="Started" \
        op monitor interval="5s" timeout="5s"
(Continue reading)

Ulrich Windl | 19 Mar 08:28 2014
Picon

clvmd default options: why debug?

Hi!

I wonder why writing debug logs to syslog is enabled by default for clvmd in SLES11 (SP3):
---
crm(live)# ra info clvmd
clvmd resource agent (ocf:lvm2:clvmd)

This is a Resource Agent for both clvmd and cmirrord.
It starts clvmd and cmirrord as anonymous clones.

Parameters (* denotes required, [] the default):

daemon_timeout (string, [80]): Daemon Timeout
    Number of seconds to allow the control daemon to come up and down

daemon_options (string, [-d2]): Daemon Options
    Options to clvmd. Refer to clvmd.8 for detailed descriptions.

Operations' defaults (advisory minimum):

    start         timeout=90
    stop          timeout=100
    monitor       timeout=20
---

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
(Continue reading)

Maloja01 | 14 Mar 11:32 2014
Picon

How to tell pacemaker to process a new event during a long-running resource operation

Hi all,

I have a resource which could in special cases have a very long-running 
start operation.

If I have a new event (like switching a standby node back to online) 
during the already running transition (cluster is still 
S_TRANSITION_ENGINE) I would like the cluster to process them as soon
as possible and not only after the other resource came up.

Is that possible? I tried already batch-limit but I guess this is only 
to make actions parallel in a combined transition, right?

Thanks in advance
_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Kristoffer Grönlund | 11 Mar 11:37 2014

resource-agents: exportfs: Unlocking filesystems on stop by default

Hi,

I have created a submit request for the NFS exportfs agent to change
the default for the parameter unlock_on_stop from 0 to 1.

The resource agent should really release all locks which the NFS server
holds when stopping the resource. When this option was introduced, the
default was kept as 0 for compatibility reasons with legacy kernels,
existing users and non-Linux systems.

There has been a bit of discussion in the submit request whether
changing the default makes sense, and right now things are leaning
towards making the change. [1]

Are there any objections from anyone else?

[1]: https://github.com/ClusterLabs/resource-agents/pull/394

Thanks,

--

-- 
// Kristoffer Grönlund
// kgronlund <at> suse.com
_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

(Continue reading)

Thomas Schulte | 10 Mar 15:44 2014
Picon

Clear target-role attributes for sub-resources

Hi  <at> all,

I received a question from my customer which I couln't answer, yet.
Maybe someone here could give me an explanation or a hint on this:

Imagine a simple resource like

-----
primitive pri_svc_apache lsb:apache2 \
        operations $id="pri_svc_apache-operations" \
        op monitor interval="30" on-fail="restart" timeout="20s"
-----

As the system goes along, this resource might get stopped and started again.
Once the target-role changes the first time, the target-role meta is added:

-----
primitive pri_svc_apache lsb:apache2 \
        operations $id="pri_svc_apache-operations" \
        op monitor interval="30" on-fail="restart" timeout="20s" \
        meta target-role="Started"
-----

Is there a way to automatically remove these kind of attributes if
they're equal to the cluster defaults?
The problem here is that this resource might belong to a simple group:

-----
group grp_apache pri_svc_apache
-----
(Continue reading)


Gmane