Ulrich Windl | 18 Dec 09:48 2014
Picon

Q: device-mapper: dm-log-userspace: [35cRCORE] Request timed out: [5/4] - retrying

Hello!

I have a four-node test cluster running SLES11 SP3 with HAE. On of the node is having a problem I cannot solve:
The node repeatedly reboots and the problem seems to be cLVM. Soo after cLVM starts, some requests seem to
time out, and the cluster will fence the node. Here is what I have:

cluster-dlm starts and seems to be running OK
OCFS2 tried to mount:
kernel: [62593.937391] ocfs2: Mounting device (253,9) on (node 5874044, slot 2) with ordered data mode.
kernel: [62593.969439] (mount.ocfs2,27848,3):ocfs2_global_read_info:403 ERROR: status = 24
kernel: [62593.983920] (mount.ocfs2,27848,3):ocfs2_global_read_info:403 ERROR: status = 24
kernel: [62594.017919] ocfs2: Mounting device (253,11) on (node 5874044, slot 2) with ordered data mode.
kernel: [62594.027811] (mount.ocfs2,27847,3):ocfs2_global_read_info:403 ERROR: status = 24
kernel: [62594.036353] (mount.ocfs2,27847,3):ocfs2_global_read_info:403 ERROR: status = 24
kernel: [62594.040442] ocfs2: Mounting device (253,4) on (node 5874044, slot 1) with ordered data mode.
kernel: [62594.044470] (mount.ocfs2,27916,2):ocfs2_global_read_info:403 ERROR: status = 24
kernel: [62594.083156] (mount.ocfs2,27916,3):ocfs2_global_read_info:403 ERROR: status = 24
The RAs all report the mount succeeded
cmirrord[28116]: Starting cmirrord:
cmirrord[28116]:  Built: May 29 2013 15:04:35
LVM(prm_LVM_cVG)[28196]: INFO: Activating volume group cVG
(I guess the cVG should be activated before OCFS; maybe that's the problem)
kernel: [62597.721116] device-mapper: dm-log-userspace: version 1.1.0 loaded
kernel: [62612.720078] device-mapper: dm-log-userspace: [35cRCORE] Request timed out: [5/2] - retrying
kernel: [62627.720027] device-mapper: dm-log-userspace: [35cRCORE] Request timed out: [5/4] - retrying
kernel: [62642.720022] device-mapper: dm-log-userspace: [35cRCORE] Request timed out: [5/5] - retrying
kernel: [62657.721517] device-mapper: dm-log-userspace: [35cRCORE] Request timed out: [5/6] - retrying

A short time later the node is fenced. Before some updates this node also worked fine; I don't know where to
start searching. Ideas?
(Continue reading)

Ulrich Windl | 17 Dec 08:20 2014
Picon

Antw: clvmd default options: why debug?

Hi!

After having fixed the problem, I really think the default is wrong:

The clvm RA used default "-d2" daemon options. This setting will flood your syslog with messages you cannot
make much sense of.

Unfortunately you cannot change the configuration unless you want that all your nodes restart clvmd.
Given the fact that you can reconfigure clvmd with "clvmd -d0 -C" to turn OFF debugging, I really think
debugging should be OFF by default. You can turn it on if you think you need it (using the forementioned command).

(resource-agents-3.9.5-0.34.57 of SLES11 SP3)

Regards,
Ulrich

>>> Ulrich Windl schrieb am 19.03.2014 um 08:28 in Nachricht <53294716.989 : 161 :
60728>:
> Hi!
> 
> I wonder why writing debug logs to syslog is enabled by default for clvmd in 
> SLES11 (SP3):
> ---
> crm(live)# ra info clvmd
> clvmd resource agent (ocf:lvm2:clvmd)
> 
> This is a Resource Agent for both clvmd and cmirrord.
> It starts clvmd and cmirrord as anonymous clones.
> 
> Parameters (* denotes required, [] the default):
(Continue reading)

Dave Botsch | 16 Dec 23:19 2014
Picon

crmsh and monitor for lsb primitives not existing

Hi.

crmsh 2.1 on rhel6.6/64 .

Looks like crmsh no longer allows one to set up a monitor on an "lsb"
primitive?

Eg:

crm(cups)# configure primitive lsb:cups \
   > op monitor interval=120
ERROR: syntax in primitive: Unknown arguments: monitor interval=120 near
<monitor> parsing 'primitive lsb:cups op monitor interval=120'

which is bad, since, according to pacemaker docs:

	By default, the cluster will not ensure your resources are still
	healthy. To instruct the cluster to do this, you need to add a
	monitor operation to the resource's definition

Does development or something fix this?

Thanks.

ps this is clearly a change, since I have previously successfully set up
monitor intervals on lsb resources... eg:

	primitive Coral lsb:opencoral \
		op monitor interval=120 \
		meta target-role=Started
(Continue reading)

Ulrich Windl | 4 Dec 08:12 2014
Picon

SLES11 SP3 compatibility with HP Data Protector 7' Automatic Desaster Recovery Module

Hi!

We discovered an interesting problem with SLES11 (currently at SP3 with updates) and HP Data Protector 7's
Automatic Desaster Recovery Module:
It seems that a cluster node updated from SLES11 SP1 (via SP2) to SP3 uses a different directory layout than a
cluster node installed directly from the SP3 media:
The node updated from SP1 uses this directory to store the CIB: /var/lib/heartbeat/crm/
The node installed as SP3 used this directory: /var/lib/pacemaker/cib/

HP's Desaster Recovery Module says (/var/opt/omni/tmp/AUTODR.log):
[...]
20141203T164701 ERROR cluster  Failed to parse cluster configuration.
20141203T164701 FATAL cluster  EXCEPTION at src/core/linux_cluster.cpp(798):
20141203T164701 FATAL cluster  N3drm12system_errorE(2): [ENOENT;No such file or directory]: Failed to
load RedHat xml cluster configuration file: /var/lib/heartbeat/crm/cib.xml.
20141203T164701 FATAL storage  Running on cluster and failed to query the cluster.
[...]

Of course HP's software isn't quite flexible here, but maybe a symlink from the old location to the new one
wouldn't be bad (for the lifetime of SLES11, maybe)...

Opinions? Any HP guy listening?

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
(Continue reading)

Ulrich Windl | 1 Dec 09:46 2014
Picon

Q: Avoid resource restart after configuration change

Hi!

I'd like to change a resource's configuration, but don't want a restart of the resource. That is, I want the
configuration to be effective the next time the resource is started.

I know in general it doesn't make sense, but for cLVM it does: Per default debug logging is on and it floods the
syslog with junk.
If I want to turn off debug logging, I'll have to pass an extra parameter to clvmd, or I could do "clvmd -d0 -C"
to fix the problem until next resource start.

As cLVMd is clones to all nodes in the cluster, a configuration change would effectively halt the whole
cluster (all users of cLVM). I want to avoid that. As you can see, I can change the running resources, and I
can configure the resources to be fine next time they restart.

So all I need is way to prevent a restart of clvmd when changing its resource configuration.

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

renayama19661014 | 28 Nov 05:48 2014
Picon

[crmsh] Question about latest crmsh.

Hi All,

We used the latest edition of crmsh-b932826fb4f924137eab9efd13b7bddf8afde3f0 with Pacamerk1.1.12.
We sent the following CLI file.

-------------------------------------------
## Cluster Option ###
property no-quorum-policy="ignore" \
        stonith-enabled="false"

### Resource Defaults ###
rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="1"

primitive prmDummy1 ocf:pacemaker:Dummy \
        op start interval="0s" timeout="300s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="300s" on-fail="block"

primitive prmDummy2 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="300s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="300s" on-fail="block"

### Resource Location ###
location rsc_location-1 prmDummy1 \
        rule $role=master 300: #uname eq snmp1 \
        rule $role=master 200: #uname eq snmp2
location rsc_location-2 prmDummy2 \
        rule $role=master 300: #uname eq snmp1 \
(Continue reading)

Ranjan Gajare | 24 Nov 10:48 2014
Picon

Monitor a Pacemaker Cluster with ocf:pacemaker:ClusterMon and/or external-agent

I want to configure Event Notification with Monitoring Resources using
External Agent. I want to set notification on node failover from HA
perspective.
I follow below links 

1)https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Configuring_the_Red_Hat_High_Availability_Add-On_with_Pacemaker/s1-eventnotification-HAAR.html

2)http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/

3)http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-notification-external.html

Configured ClusterMon resource as

# pcs resource create ClusterMon-External ClusterMon --clone user=root \
update=30 extra_options="-E /var/lib/pgsql/9.3/data/test.sh -e
172.26.126.100"

vim test.sh

if [[ ${CRM_notify_rc} != 0 && ${CRM_notify_task} == "monitor" ]] || [[
${CRM_notify_task} != "monitor" ]] ; then 
     # This trap is compliant with PACEMAKER MIB 
    # 
https://github.com/ClusterLabs/pacemaker/blob/master/extra/PCMK-MIB.txt 
     /usr/bin/snmptrap -v 2c -c public ${CRM_notify_recipient} ""
PACEMAKER-MIB::pacemakerNotification \ 
 	PACEMAKER-MIB::pacemakerNotificationNode s "${CRM_notify_node}" \ 
 	PACEMAKER-MIB::pacemakerNotificationResource s "${CRM_notify_rsc}" \ 
 	PACEMAKER-MIB::pacemakerNotificationOperation s "${CRM_notify_task}" \ 
 	PACEMAKER-MIB::pacemakerNotificationDescription s "${CRM_notify_desc}" \ 
(Continue reading)

ranjan | 19 Nov 11:54 2014
Picon

RHEL Server 6.6 HA Configuration

I was trying to install Corosync and Cman using
yum install -y pacemaker cman pcs ccs resource-agents

This works fine on Centos 6.3. Tried the same on Redhat Redhat Enterprise
Linux Server 6.6 and ran into issues. It gives error like

Loaded plugins: product-id, refresh-packagekit, rhnplugin, security,
subscription-manager
There was an error communicating with RHN.
RHN Satellite or RHN Classic support will be disabled.

Error Message:
        Please run rhn_register as root on this client
Error Class Code: 9
Error Class Info: Invalid System Credentials.
Explanation:
     An error has occurred while processing your request. If this problem
     persists please enter a bug report at bugzilla.redhat.com.
     If you choose to submit the bug report, please be sure to include
     details of what you were trying to do when this error occurred and
     details on how to reproduce this problem.

Setting up Install Process
No package pacemaker available.
No package cman available.
No package pcs available.
No package ccs available.
Nothing to do

centos.repo is as follows...
(Continue reading)

Vladislav Bogdanov | 17 Nov 15:20 2014

crmsh and 'no such resource agent' error

Hi Kristoffer, all,

It seems like with introduction of 'resource-discovery'
'symmetric-cluster=true' becomes not so strict in sense of resource
agents sets across nodes.

May be it is possible to add a config options to disable error messages
like:

got no meta-data, does this RA exist?
no such resource agent

Best,
Vladislav
_______________________________________________
Linux-HA mailing list
Linux-HA <at> lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Vladislav Bogdanov | 17 Nov 08:05 2014

crm configure show to a pipe

Hi Kristoffer, all,

running 'crm configure show > file' appends non-printable chars at the
end (at least if op_defaults is used):

...
property cib-bootstrap-options: \
    dc-version=1.1.12-c191bf3 \
    cluster-infrastructure=corosync \
    cluster-recheck-interval=10m \
    stonith-enabled=false \
    no-quorum-policy=freeze \
    last-lrm-refresh=1415955398 \
    maintenance-mode=false \
    stop-all-resources=false \
    stop-orphan-resources=true \
    have-watchdog=false
rsc_defaults rsc_options: \
    allow-migrate=false \
    failure-timeout=10m \
    migration-threshold=INFINITY \
    multiple-active=stop_start \
    priority=0
op_defaults op-options: \
    record-pending=true.[?1034h

Best,
Vladislav
_______________________________________________
Linux-HA mailing list
(Continue reading)

Randy S | 16 Nov 19:17 2014
Picon

time_longclock illumos

Hi all,

new user here. 
We have been testing an older version of the heartbeat / pacemaker combination compiled for illumos (an
opensolaris follow-up).
Versions:
Heartbeat-3-0-STABLE-3.0.5
Pacemaker-1-0-Pacemaker-1.0.11

It all works ok while testing (several months now) but I have noticed that every so often (and sometimes
quite frequently) I see the following console message appear:

crmd: [ID 996084 daemon.crit] [12637]: CRIT: time_longclock: old value was 298671305, new value is
298671304, diff is 1, callcount 141814

Now from what I have been able to find about this, is that this type of occurence should have been fixed in
heartbeat post 2.1.4 versions. At that time this occurence could make a cluster start behaving irratically.
We have two test implementions of a cluster, 1 in vmware and 1 on standard hardware. All just for testing.
We have made sure that timesync is done via ntp with the internet. The hardware implementation doesn't show
this message as many times as the vmware implementation, but still it appears (sometimes about three
times per 24 hours).

We haven't had any strange behaviour yet in the cluster, but my questions about this are as follows:

should we worry about this 'time_longclock' crit error eventhough it should have been fixed in version
post HA 3?

Is there something (simple) that can be done to prevent this type of error, or should we expect normal
cluster behaviour since ntp is used.

(Continue reading)


Gmane