renayama19661014 | 1 Oct 2010 04:00
Picon
Gravatar

Re: [Problem or Enhancement]When attrd reboots, a fail count is initialized.

Hi Andrew,

Thank you for comment.

> During crmd startup, one could read all the values from attrd into the
> hashtable.
> So the hashtable would only do something if only attrd went down.

If attrd communicates with crmd at the time of start and reads the data of the hash table, the problem
seems to be able to be settled.

Is the change of this attrd and crmd difficult?

> I mean: did you see this behavior in a production system, or only
> during testing when you manually killed attrd?

We carry out kill-command by manual operation as one of the tests of the trouble of the processes.
Our user minds behavior of the process trouble very much.

Best Regards,
Hideo Yamauchi.

--- Andrew Beekhof <andrew@...> wrote:

> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661014@...> wrote:
> > Hi Andrew,
> >
> > Thank you for comment.
> >
> >> The problem here is that attrd is supposed to be the authoritative
(Continue reading)

renayama19661014 | 1 Oct 2010 04:07
Picon
Gravatar

[GUI]Compatibility issues of Python.

Hi Yan,

I operated latest GUI for Japanization of GUI. 
However, on RHEL5.5, GUI causes an error by the operation of the resource.

There seems to be a cause in the difference of the version of Python.

(snip)
	def on_rsc_action(self, action) :
		(cur_type, cur_name) = self.manage_tree.get_selected_node()
		if action.get_name() == "startrsc" :
			target_role = "Started"
		elif action.get_name() == "stoprsc" :
			target_role = "Stopped"
		else :
			target_role = "#default"

		if validate_type == "dtd" :
			metaattr_name = "target_role"
		else :
			metaattr_name = "target-role"

		real_rsc_id = cur_name.partition(":")[0]

(snip)

The function of partition seems not to be supported with 2.4 of Python.

On RHEL5.5, the version of Python is 2.4. 
For compatibility, will you change this processing?
(Continue reading)

Pavlos Parissis | 1 Oct 2010 09:05
Picon

resources are restarted without obvious reasons

Hi,

I have 3 node cluster where Pacemaker decides sometimes , happend twice today, to restart the resource(pbx_01 and pbx_02) without any obvious reasons. The log simple doesn't say why. These 2 resounces use different test LSB scripts which basically do nothing, I just have them to test my cluster conf. I read the log and I simple can't explain the behavior.
When I used the ocf:heartbeat:Dummy I don't remember seeing that behavior.

All the related info are below



On DC where node resources running
08:30:50 cib: [3469]: info: cib_stats: Processed 50 operations (2800.00us average, 0% utilization) in the last 10min
08:36:09 heartbeat: [3384]: WARN: Late heartbeat: Node node-02: interval 6470 ms
08:38:27 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped!
08:38:27 crmd: [3473]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
08:38:27 crmd: [3473]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
08:38:27 crmd: [3473]: info: do_state_transition: All 3 cluster nodes are eligible to run resources.
08:38:27 crmd: [3473]: info: do_pe_invoke: Query 192: Requesting the current CIB: S_POLICY_ENGINE
08:38:27 pengine: [3979]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
08:38:27 pengine: [3979]: info: determine_online_status: Node node-02 is online
08:38:27 crmd: [3473]: info: do_pe_invoke_callback: Invoking the PE: query=192, ref=pe_calc-dc-1285915107-259, seq=3, quorate=1
08:38:27 pengine: [3979]: info: determine_online_status: Node node-03 is online
08:38:27 crmd: [3473]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
08:38:27 pengine: [3979]: info: determine_online_status: Node node-01 is online
08:38:27 crmd: [3473]: info: unpack_graph: Unpacked transition 21: 19 actions in 19 synapses
08:38:27 pengine: [3979]: info: find_clone: Internally renamed drbd_02:0 on node-01 to drbd_02:2 (ORPHAN)
08:38:27 crmd: [3473]: info: do_te_invoke: Processing graph 21 (ref=pe_calc-dc-1285915107-259) derived from /var/lib/pengine/pe-input-1453.bz2
08:38:27 pengine: [3979]: notice: group_print:  Resource Group: pbx_service_02
08:38:27 crmd: [3473]: info: te_pseudo_action: Pseudo action 27 fired and confirmed
08:38:27 pengine: [3979]: notice: native_print:      ip_02      (ocf::heartbeat:IPaddr2):       Started node-02
08:38:27 crmd: [3473]: info: te_rsc_command: Initiating action 10: stop pbx_02_stop_0 on node-02
08:38:27 pengine: [3979]: notice: native_print:      fs_02      (ocf::heartbeat:Filesystem):    Started node-02
08:38:27 crmd: [3473]: info: te_pseudo_action: Pseudo action 67 fired and confirmed
08:38:27 pengine: [3979]: notice: native_print:      pbx_02     (lsb:test-02):  Started node-02
08:38:27 crmd: [3473]: info: te_rsc_command: Initiating action 13: stop pbx_01_stop_0 on node-01
08:38:27 pengine: [3979]: notice: clone_print:  Master/Slave Set: ms-drbd_02
08:38:27 pengine: [3979]: notice: short_print:      Masters: [ node-02 ]
08:38:27 pengine: [3979]: notice: short_print:      Slaves: [ node-03 ]
08:38:27 pengine: [3979]: notice: group_print:  Resource Group: pbx_service_01
08:38:27 pengine: [3979]: notice: native_print:      ip_01      (ocf::heartbeat:IPaddr2):       Started node-01
08:38:27 pengine: [3979]: notice: native_print:      fs_01      (ocf::heartbeat:Filesystem):    Started node-01
08:38:27 pengine: [3979]: notice: native_print:      pbx_01     (lsb:test-01):  Started node-01
08:38:27 pengine: [3979]: notice: clone_print:  Master/Slave Set: ms-drbd_01
08:38:27 pengine: [3979]: notice: short_print:      Masters: [ node-01 ]
08:38:27 pengine: [3979]: notice: short_print:      Slaves: [ node-03 ]
08:38:27 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat -> <null>
08:38:27 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
08:38:27 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat -> <null>
08:38:27 pengine: [3979]: notice: DeleteRsc: Removing pbx_01 from node-01
08:38:27 pengine: [3979]: info: master_color: Promoting drbd_02:0 (Master node-02)
08:38:27 pengine: [3979]: info: master_color: ms-drbd_02: Promoted 1 instances of a possible 1 to master
08:38:27 pengine: [3979]: info: master_color: Promoting drbd_02:0 (Master node-02)
08:38:27 pengine: [3979]: info: master_color: ms-drbd_02: Promoted 1 instances of a possible 1 to master
08:38:27 pengine: [3979]: info: master_color: Promoting drbd_01:0 (Master node-01)
08:38:27 pengine: [3979]: info: master_color: ms-drbd_01: Promoted 1 instances of a possible 1 to master
08:38:27 pengine: [3979]: info: master_color: Promoting drbd_01:0 (Master node-01)
08:38:27 pengine: [3979]: info: master_color: ms-drbd_01: Promoted 1 instances of a possible 1 to master
08:38:27 pengine: [3979]: notice: RecurringOp:  Start recurring monitor (20s) for pbx_02 on node-02
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:38:27 pengine: [3979]: notice: RecurringOp:  Start recurring monitor (20s) for pbx_01 on node-01
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:38:27 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:38:27 pengine: [3979]: notice: LogActions: Leave resource ip_02      (Started node-02)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource fs_02      (Started node-02)
08:38:27 pengine: [3979]: notice: LogActions: Restart resource pbx_02   (Started node-02)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource drbd_02:0  (Master node-02)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource drbd_02:1  (Slave node-03)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource ip_01      (Started node-01)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource fs_01      (Started node-01)
08:38:27 pengine: [3979]: notice: LogActions: Restart resource pbx_01   (Started node-01)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource drbd_01:0  (Master node-01)
08:38:27 pengine: [3979]: notice: LogActions: Leave resource drbd_01:1  (Slave node-03)
08:38:27 pengine: [3979]: info: process_pe_message: Transition 21: PEngine Input stored in: /var/lib/pengine/pe-input-1453.bz2
08:38:28 crmd: [3473]: info: match_graph_event: Action pbx_01_stop_0 (13) confirmed on node-01 (rc=0)
08:38:28 crmd: [3473]: info: te_pseudo_action: Pseudo action 68 fired and confirmed
08:38:28 crmd: [3473]: info: te_rsc_command: Initiating action 14: delete pbx_01_delete_0 on node-01
08:38:28 crmd: [3473]: info: te_pseudo_action: Pseudo action 65 fired and confirmed
08:38:29 crmd: [3473]: info: match_graph_event: Action pbx_02_stop_0 (10) confirmed on node-02 (rc=0)
08:38:29 crmd: [3473]: info: te_pseudo_action: Pseudo action 28 fired and confirmed
08:38:29 crmd: [3473]: info: te_rsc_command: Initiating action 11: delete pbx_02_delete_0 on node-02
08:38:29 crmd: [3473]: info: te_pseudo_action: Pseudo action 9 fired and confirmed
08:38:29 crmd: [3473]: info: te_pseudo_action: Pseudo action 25 fired and confirmed
08:38:31 crmd: [3473]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285915110-15 from node-01
08:38:31 crmd: [3473]: info: match_graph_event: Action pbx_01_delete_0 (14) confirmed on node-01 (rc=0)
08:38:31 crmd: [3473]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285915111-16 from node-02
08:38:31 crmd: [3473]: info: match_graph_event: Action pbx_02_delete_0 (11) confirmed on node-02 (rc=0)
08:38:31 crmd: [3473]: info: te_rsc_command: Initiating action 24: start pbx_02_start_0 on node-02
08:38:31 crmd: [3473]: info: te_rsc_command: Initiating action 64: start pbx_01_start_0 on node-01
08:38:31 crmd: [3473]: info: te_crm_command: Executing crm-event (12): lrm_refresh on node-02
08:38:31 crmd: [3473]: info: te_crm_command: Executing crm-event (15): lrm_refresh on node-01
08:38:34 crmd: [3473]: info: match_graph_event: Action pbx_02_start_0 (24) confirmed on node-02 (rc=0)
08:38:34 crmd: [3473]: info: te_pseudo_action: Pseudo action 26 fired and confirmed
08:38:34 crmd: [3473]: info: te_rsc_command: Initiating action 3: monitor pbx_02_monitor_20000 on node-02
08:38:34 crmd: [3473]: info: match_graph_event: Action pbx_01_start_0 (64) confirmed on node-01 (rc=0)
08:38:34 crmd: [3473]: info: te_pseudo_action: Pseudo action 66 fired and confirmed
08:38:34 crmd: [3473]: info: te_rsc_command: Initiating action 8: monitor pbx_01_monitor_20000 on node-01
08:38:37 crmd: [3473]: info: match_graph_event: Action pbx_01_monitor_20000 (8) confirmed on node-01 (rc=0)
08:38:37 crmd: [3473]: info: match_graph_event: Action pbx_02_monitor_20000 (3) confirmed on node-02 (rc=0)
08:38:37 crmd: [3473]: info: run_graph: ====================================================
08:38:37 crmd: [3473]: notice: run_graph: Transition 21 (Complete=19, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1453.bz2): Complete
08:38:37 crmd: [3473]: info: te_graph_trigger: Transition 21 is now complete
08:38:37 crmd: [3473]: info: notify_crmd: Transition 21 status: done - <null>
08:38:37 crmd: [3473]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
08:38:37 crmd: [3473]: info: do_state_transition: Starting PEngine Recheck Timer
08:40:51 cib: [3469]: info: cib_stats: Processed 30 operations (1333.00us average, 0% utilization) in the last 10min

08:50:52 cib: [3469]: info: cib_stats: Processed 22 operations (0.00us average, 0% utilization) in the last 10min
08:53:38 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped!
08:53:38 crmd: [3473]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
08:53:38 crmd: [3473]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
08:53:38 crmd: [3473]: info: do_state_transition: All 3 cluster nodes are eligible to run resources.
08:53:38 crmd: [3473]: info: do_pe_invoke: Query 193: Requesting the current CIB: S_POLICY_ENGINE
08:53:38 crmd: [3473]: info: do_pe_invoke_callback: Invoking the PE: query=193, ref=pe_calc-dc-1285916018-270, seq=3, quorate=1
08:53:38 pengine: [3979]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
08:53:38 pengine: [3979]: info: determine_online_status: Node node-02 is online
08:53:38 pengine: [3979]: info: determine_online_status: Node node-03 is online
08:53:38 pengine: [3979]: info: determine_online_status: Node node-01 is online
08:53:38 pengine: [3979]: info: find_clone: Internally renamed drbd_02:0 on node-01 to drbd_02:2 (ORPHAN)
08:53:38 pengine: [3979]: notice: group_print:  Resource Group: pbx_service_02
08:53:38 pengine: [3979]: notice: native_print:      ip_02      (ocf::heartbeat:IPaddr2):       Started node-02
08:53:38 pengine: [3979]: notice: native_print:      fs_02      (ocf::heartbeat:Filesystem):    Started node-02
08:53:38 pengine: [3979]: notice: native_print:      pbx_02     (lsb:test-02):  Started node-02
08:53:38 pengine: [3979]: notice: clone_print:  Master/Slave Set: ms-drbd_02
08:53:38 pengine: [3979]: notice: short_print:      Masters: [ node-02 ]
08:53:38 pengine: [3979]: notice: short_print:      Slaves: [ node-03 ]
08:53:38 pengine: [3979]: notice: group_print:  Resource Group: pbx_service_01
08:53:38 pengine: [3979]: notice: native_print:      ip_01      (ocf::heartbeat:IPaddr2):       Started node-01
08:53:38 pengine: [3979]: notice: native_print:      fs_01      (ocf::heartbeat:Filesystem):    Started node-01
08:53:38 pengine: [3979]: notice: native_print:      pbx_01     (lsb:test-01):  Started node-01
08:53:38 pengine: [3979]: notice: clone_print:  Master/Slave Set: ms-drbd_01
08:53:38 pengine: [3979]: notice: short_print:      Masters: [ node-01 ]
08:53:38 pengine: [3979]: notice: short_print:      Slaves: [ node-03 ]
08:53:38 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat -> <null>
08:53:38 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
08:53:38 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat -> <null>
08:53:38 pengine: [3979]: notice: DeleteRsc: Removing pbx_01 from node-01
08:53:38 pengine: [3979]: info: master_color: Promoting drbd_02:0 (Master node-02)
08:53:38 pengine: [3979]: info: master_color: ms-drbd_02: Promoted 1 instances of a possible 1 to master
08:53:38 pengine: [3979]: info: master_color: Promoting drbd_02:0 (Master node-02)
08:53:38 pengine: [3979]: info: master_color: ms-drbd_02: Promoted 1 instances of a possible 1 to master
08:53:38 pengine: [3979]: info: master_color: Promoting drbd_01:0 (Master node-01)
08:53:38 pengine: [3979]: info: master_color: ms-drbd_01: Promoted 1 instances of a possible 1 to master
08:53:38 pengine: [3979]: info: master_color: Promoting drbd_01:0 (Master node-01)
08:53:38 pengine: [3979]: info: master_color: ms-drbd_01: Promoted 1 instances of a possible 1 to master
08:53:38 pengine: [3979]: notice: RecurringOp:  Start recurring monitor (20s) for pbx_02 on node-02
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_02
08:53:38 pengine: [3979]: notice: RecurringOp:  Start recurring monitor (20s) for pbx_01 on node-01
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:53:38 pengine: [3979]: ERROR: create_notification_boundaries: Creating boundaries for ms-drbd_01
08:53:38 pengine: [3979]: notice: LogActions: Leave resource ip_02      (Started node-02)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource fs_02      (Started node-02)
08:53:38 pengine: [3979]: notice: LogActions: Restart resource pbx_02   (Started node-02)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource drbd_02:0  (Master node-02)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource drbd_02:1  (Slave node-03)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource ip_01      (Started node-01)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource fs_01      (Started node-01)
08:53:38 pengine: [3979]: notice: LogActions: Restart resource pbx_01   (Started node-01)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource drbd_01:0  (Master node-01)
08:53:38 pengine: [3979]: notice: LogActions: Leave resource drbd_01:1  (Slave node-03)
08:53:38 crmd: [3473]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
08:53:38 crmd: [3473]: info: unpack_graph: Unpacked transition 22: 19 actions in 19 synapses
08:53:38 crmd: [3473]: info: do_te_invoke: Processing graph 22 (ref=pe_calc-dc-1285916018-270) derived from /var/lib/pengine/pe-input-1454.bz2
08:53:38 crmd: [3473]: info: te_pseudo_action: Pseudo action 27 fired and confirmed
08:53:38 crmd: [3473]: info: te_rsc_command: Initiating action 10: stop pbx_02_stop_0 on node-02
08:53:38 crmd: [3473]: info: te_pseudo_action: Pseudo action 67 fired and confirmed
08:53:38 crmd: [3473]: info: te_rsc_command: Initiating action 13: stop pbx_01_stop_0 on node-01
08:53:38 pengine: [3979]: info: process_pe_message: Transition 22: PEngine Input stored in: /var/lib/pengine/pe-input-1454.bz2
08:53:40 crmd: [3473]: info: match_graph_event: Action pbx_02_stop_0 (10) confirmed on node-02 (rc=0)
08:53:40 crmd: [3473]: info: te_pseudo_action: Pseudo action 28 fired and confirmed
08:53:40 crmd: [3473]: info: te_rsc_command: Initiating action 11: delete pbx_02_delete_0 on node-02
08:53:40 crmd: [3473]: info: te_pseudo_action: Pseudo action 25 fired and confirmed
08:53:40 crmd: [3473]: info: match_graph_event: Action pbx_01_stop_0 (13) confirmed on node-01 (rc=0)
08:53:40 crmd: [3473]: info: te_pseudo_action: Pseudo action 68 fired and confirmed
08:53:40 crmd: [3473]: info: te_rsc_command: Initiating action 14: delete pbx_01_delete_0 on node-01
08:53:40 crmd: [3473]: info: te_pseudo_action: Pseudo action 9 fired and confirmed
08:53:40 crmd: [3473]: info: te_pseudo_action: Pseudo action 65 fired and confirmed
08:53:42 crmd: [3473]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285916022-17 from node-02
08:53:42 crmd: [3473]: info: match_graph_event: Action pbx_02_delete_0 (11) confirmed on node-02 (rc=0)
08:53:42 crmd: [3473]: info: te_rsc_command: Initiating action 24: start pbx_02_start_0 on node-02
08:53:42 crmd: [3473]: info: te_crm_command: Executing crm-event (12): lrm_refresh on node-02
08:53:44 crmd: [3473]: info: process_te_message: Processing (N)ACK lrm_invoke-lrmd-1285916022-16 from node-01
08:53:44 crmd: [3473]: info: match_graph_event: Action pbx_01_delete_0 (14) confirmed on node-01 (rc=0)
08:53:44 crmd: [3473]: info: te_rsc_command: Initiating action 64: start pbx_01_start_0 on node-01
08:53:44 crmd: [3473]: info: te_crm_command: Executing crm-event (15): lrm_refresh on node-01
08:53:47 crmd: [3473]: info: match_graph_event: Action pbx_02_start_0 (24) confirmed on node-02 (rc=0)
08:53:47 crmd: [3473]: info: te_pseudo_action: Pseudo action 26 fired and confirmed
08:53:47 crmd: [3473]: info: te_rsc_command: Initiating action 3: monitor pbx_02_monitor_20000 on node-02
08:53:48 crmd: [3473]: info: match_graph_event: Action pbx_01_start_0 (64) confirmed on node-01 (rc=0)
08:53:48 crmd: [3473]: info: te_pseudo_action: Pseudo action 66 fired and confirmed
08:53:48 crmd: [3473]: info: te_rsc_command: Initiating action 8: monitor pbx_01_monitor_20000 on node-01
08:53:48 crmd: [3473]: info: match_graph_event: Action pbx_02_monitor_20000 (3) confirmed on node-02 (rc=0)
08:53:50 crmd: [3473]: info: match_graph_event: Action pbx_01_monitor_20000 (8) confirmed on node-01 (rc=0)
08:53:50 crmd: [3473]: info: run_graph: ====================================================
08:53:50 crmd: [3473]: notice: run_graph: Transition 22 (Complete=19, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1454.bz2): Complete
08:53:50 crmd: [3473]: info: te_graph_trigger: Transition 22 is now complete
08:53:50 crmd: [3473]: info: notify_crmd: Transition 22 status: done - <null>
08:53:50 crmd: [3473]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
08:53:50 crmd: [3473]: info: do_state_transition: Starting PEngine Recheck Timer


On node-01 where pbx_service_01 resource group where running
Oct 01 08:38:28 node-01 lrmd: [3558]: info: cancel_op: operation monitor[45] on lsb::test-01::pbx_01 for client 3561, its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Oct 01 08:38:28 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=13:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_stop_0 )
Oct 01 08:38:28 node-01 lrmd: [3558]: info: rsc:pbx_01:46: stop
Oct 01 08:38:28 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_monitor_20000 (call=45, status=1, cib-update=0, confirmed=true) Cancelled
Oct 01 08:38:28 node-01 lrmd: [19763]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:38:28 node-01 lrmd: [3558]: info: RA output: (pbx_01:stop:stdout) Stopping test-01: [  OK  ]
Oct 01 08:38:28 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_stop_0 (call=46, rc=0, cib-update=61, confirmed=true) ok
Oct 01 08:38:30 node-01 crmd: [3561]: info: do_lrm_invoke: Removing resource pbx_01 from the LRM
Oct 01 08:38:30 node-01 crmd: [3561]: info: do_lrm_invoke: Resource 'pbx_01' deleted for tengine on (null)
Oct 01 08:38:30 node-01 crmd: [3561]: info: notify_deleted: Notifying tengine on node-03 that pbx_01 was deleted
Oct 01 08:38:30 node-01 crmd: [3561]: info: send_direct_ack: ACK'ing resource op pbx_01_delete_0 from 14:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046: lrm_invoke-lrmd-1285915110-15
Oct 01 08:38:33 node-01 lrmd: [3558]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Oct 01 08:38:33 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=64:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_start_0 )
Oct 01 08:38:33 node-01 lrmd: [3558]: info: rsc:pbx_01:47: start
Oct 01 08:38:33 node-01 lrmd: [19836]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:38:33 node-01 crmd: [3561]: WARN: msg_to_op(1326): failed to get the value of field lrm_opstatus from a ha_msg
Oct 01 08:38:33 node-01 crmd: [3561]: info: msg_to_op: Message follows:
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG: Dumping message with 16 fields
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[0] : [lrm_t=op]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[1] : [lrm_rid=pbx_01]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[2] : [lrm_op=start]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[3] : [lrm_timeout=60000]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[4] : [lrm_interval=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[5] : [lrm_delay=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[6] : [lrm_copyparams=1]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[7] : [lrm_t_run=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[8] : [lrm_t_rcchange=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[9] : [lrm_exec_time=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[10] : [lrm_queue_time=0]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[11] : [lrm_targetrc=-1]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[12] : [lrm_app=crmd]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[13] : [lrm_userdata=64:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[14] : [(2)lrm_param=0x9759508(88 104)]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG: Dumping message with 3 fields
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[0] : [CRM_meta_timeout=60000]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[1] : [crm_feature_set=3.0.1]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[2] : [CRM_meta_name=start]
Oct 01 08:38:33 node-01 crmd: [3561]: info: MSG[15] : [lrm_callid=47]
Oct 01 08:38:33 node-01 crmd: [3561]: info: do_lrm_invoke: Forcing a local LRM refresh
Oct 01 08:38:33 node-01 lrmd: [3558]: info: RA output: (pbx_01:start:stdout) Starting test-01:
[  OK  ]8:38:33 node-01 lrmd: [3558]: info: RA output: (pbx_01:start:stdout) [  OK  ]
Oct 01 08:38:33 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_start_0 (call=47, rc=0, cib-update=63, confirmed=true) ok
Oct 01 08:38:36 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=8:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_monitor_20000 )
Oct 01 08:38:36 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_monitor_20000 (call=48, rc=0, cib-update=64, confirmed=false) ok
[root <at> node-01 ~]# grep 08:53: /var/log/ha-log
Sep 29 08:53:02 node-01 lrmd: [3523]: info: rsc:drbd_01:0:112: notify
Sep 29 08:53:02 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=135:45:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:03 node-01 lrmd: [3523]: info: RA output: (drbd_01:0:notify:stdout)
Sep 29 08:53:03 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=112, rc=0, cib-update=147, confirmed=true) ok
Sep 29 08:53:05 node-01 lrmd: [3523]: info: rsc:drbd_01:0:113: notify
Sep 29 08:53:05 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=130:45:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:05 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=113, rc=0, cib-update=148, confirmed=true) ok
Sep 29 08:53:07 node-01 lrmd: [3523]: info: rsc:drbd_01:0:114: notify
Sep 29 08:53:07 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=131:45:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:07 node-01 lrmd: [3523]: info: RA output: (drbd_01:0:notify:stdout)
Sep 29 08:53:07 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=114, rc=0, cib-update=149, confirmed=true) ok
Sep 29 08:53:09 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=68:45:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_monitor_30000 )
Sep 29 08:53:09 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_monitor_30000 (call=115, rc=0, cib-update=150, confirmed=false) ok
Sep 29 08:53:43 node-01 cib: [26073]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-18.raw
Sep 29 08:53:43 node-01 cib: [26073]: info: write_cib_contents: Wrote version 0.581.0 of the CIB to disk (digest: dee87447308900b0a802faf1af419f1d)
Sep 29 08:53:43 node-01 cib: [26073]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.FjyHAJ (digest: /var/lib/heartbeat/crm/cib.xzbgU3)
Sep 29 08:53:52 node-01 lrmd: [3523]: info: cancel_op: operation monitor[115] on ocf::drbd::drbd_01:0 for client 3526, its parameters: CRM_meta_interval=[30000] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_start_resource=[drbd_01:0 drbd_01:1 ] CRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ] CRM_meta_notify_stop_uname=[ ] drbd_resource=[drbd_pbx_service_1] CRM_meta_notify_master_uname=[ ] CRM_meta_master_max=[1] CRM_meta_notify_demote_uname=[ ] CRM_meta_notify_mast cancelled
Sep 29 08:53:52 node-01 crmd: [3526]: info: send_direct_ack: ACK'ing resource op drbd_01:0_monitor_30000 from 1:47:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0: lrm_invoke-lrmd-1285743232-34
Sep 29 08:53:52 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=134:47:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:52 node-01 lrmd: [3523]: info: rsc:drbd_01:0:116: notify
Sep 29 08:53:52 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_monitor_30000 (call=115, status=1, cib-update=0, confirmed=true) Cancelled
Sep 29 08:53:52 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=116, rc=0, cib-update=152, confirmed=true) ok
Sep 29 08:53:53 node-01 cib: [26171]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-19.raw
Sep 29 08:53:53 node-01 cib: [26171]: info: write_cib_contents: Wrote version 0.582.0 of the CIB to disk (digest: 963e0701d5049950c8625121228ae7dc)
Sep 29 08:53:53 node-01 cib: [26171]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.AP6vYW (digest: /var/lib/heartbeat/crm/cib.2JZNgI)
Sep 29 08:53:54 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=135:47:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:54 node-01 lrmd: [3523]: info: rsc:drbd_01:0:117: notify
Sep 29 08:53:54 node-01 lrmd: [3523]: info: RA output: (drbd_01:0:notify:stdout)
Sep 29 08:53:54 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=117, rc=0, cib-update=153, confirmed=true) ok
Sep 29 08:53:56 node-01 lrmd: [3523]: info: rsc:drbd_01:0:118: notify
Sep 29 08:53:56 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=130:47:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_notify_0 )
Sep 29 08:53:56 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_notify_0 (call=118, rc=0, cib-update=154, confirmed=true) ok
Sep 29 08:53:57 node-01 lrmd: [3523]: info: rsc:drbd_01:0:119: promote
Sep 29 08:53:57 node-01 crmd: [3526]: info: do_lrm_rsc_op: Performing key=68:47:0:a09c8e82-b4b8-4d01-988a-ff0cb46ea9a0 op=drbd_01:0_promote_0 )
Sep 29 08:53:58 node-01 lrmd: [3523]: info: RA output: (drbd_01:0:promote:stdout)
Sep 29 08:53:58 node-01 crmd: [3526]: info: process_lrm_event: LRM operation drbd_01:0_promote_0 (call=119, rc=0, cib-update=155, confirmed=true) ok
Oct 01 08:53:39 node-01 lrmd: [3558]: info: cancel_op: operation monitor[48] on lsb::test-01::pbx_01 for client 3561, its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Oct 01 08:53:39 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=13:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_stop_0 )
Oct 01 08:53:39 node-01 lrmd: [3558]: info: rsc:pbx_01:49: stop
Oct 01 08:53:39 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_monitor_20000 (call=48, status=1, cib-update=0, confirmed=true) Cancelled
Oct 01 08:53:39 node-01 lrmd: [32258]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:53:39 node-01 lrmd: [3558]: info: RA output: (pbx_01:stop:stdout) Stopping test-01: [  OK  ]
Oct 01 08:53:39 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_stop_0 (call=49, rc=0, cib-update=65, confirmed=true) ok
Oct 01 08:53:42 node-01 crmd: [3561]: info: do_lrm_invoke: Removing resource pbx_01 from the LRM
Oct 01 08:53:42 node-01 crmd: [3561]: info: do_lrm_invoke: Resource 'pbx_01' deleted for tengine on (null)
Oct 01 08:53:42 node-01 crmd: [3561]: info: notify_deleted: Notifying tengine on node-03 that pbx_01 was deleted
Oct 01 08:53:42 node-01 crmd: [3561]: info: send_direct_ack: ACK'ing resource op pbx_01_delete_0 from 14:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046: lrm_invoke-lrmd-1285916022-16
Oct 01 08:53:46 node-01 lrmd: [3558]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Oct 01 08:53:46 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=64:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_start_0 )
Oct 01 08:53:46 node-01 lrmd: [3558]: info: rsc:pbx_01:50: start
Oct 01 08:53:46 node-01 lrmd: [32322]: WARN: For LSB init script, no additional parameters are needed.
[  OK  ]8:53:46 node-01 lrmd: [3558]: info: RA output: (pbx_01:start:stdout) Starting test-01: [  OK  ]
Oct 01 08:53:46 node-01 crmd: [3561]: info: do_lrm_invoke: Forcing a local LRM refresh
Oct 01 08:53:46 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_start_0 (call=50, rc=0, cib-update=67, confirmed=true) ok
Oct 01 08:53:50 node-01 crmd: [3561]: info: do_lrm_rsc_op: Performing key=8:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_01_monitor_20000 )
Oct 01 08:53:50 node-01 crmd: [3561]: info: process_lrm_event: LRM operation pbx_01_monitor_20000 (call=51, rc=0, cib-update=68, confirmed=false) ok


On node-02 where pbx_service_02 resource group was running

Oct 01 08:38:29 node-02 lrmd: [3560]: info: cancel_op: operation monitor[51] on lsb::test-02::pbx_02 for client 3563, its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Oct 01 08:38:29 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=10:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_stop_0 )
Oct 01 08:38:29 node-02 lrmd: [3560]: info: rsc:pbx_02:52: stop
Oct 01 08:38:29 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_monitor_20000 (call=51, status=1, cib-update=0, confirmed=true) Cancelled
Oct 01 08:38:29 node-02 lrmd: [9167]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:38:29 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_stop_0 (call=52, rc=0, cib-update=69, confirmed=true) ok
Oct 01 08:38:29 node-02 lrmd: [3560]: info: RA output: (pbx_02:stop:stdout) Stopping test-02: [  OK  ]
Oct 01 08:38:31 node-02 crmd: [3563]: info: do_lrm_invoke: Removing resource pbx_02 from the LRM
Oct 01 08:38:31 node-02 crmd: [3563]: info: do_lrm_invoke: Resource 'pbx_02' deleted for tengine on (null)
Oct 01 08:38:31 node-02 crmd: [3563]: info: notify_deleted: Notifying tengine on node-03 that pbx_02 was deleted
Oct 01 08:38:31 node-02 crmd: [3563]: info: send_direct_ack: ACK'ing resource op pbx_02_delete_0 from 11:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046: lrm_invoke-lrmd-1285915111-16
Oct 01 08:38:32 node-02 lrmd: [3560]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Oct 01 08:38:32 node-02 lrmd: [3560]: info: rsc:pbx_02:53: start
Oct 01 08:38:32 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=24:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_start_0 )
Oct 01 08:38:32 node-02 lrmd: [9238]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:38:32 node-02 crmd: [3563]: WARN: msg_to_op(1326): failed to get the value of field lrm_opstatus from a ha_msg
Oct 01 08:38:32 node-02 crmd: [3563]: info: msg_to_op: Message follows:
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG: Dumping message with 16 fields
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[0] : [lrm_t=op]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[1] : [lrm_rid=pbx_02]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[2] : [lrm_op=start]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[3] : [lrm_timeout=60000]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[4] : [lrm_interval=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[5] : [lrm_delay=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[6] : [lrm_copyparams=1]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[7] : [lrm_t_run=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[8] : [lrm_t_rcchange=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[9] : [lrm_exec_time=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[10] : [lrm_queue_time=0]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[11] : [lrm_targetrc=-1]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[12] : [lrm_app=crmd]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[13] : [lrm_userdata=24:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[14] : [(2)lrm_param=0x98eacc0(88 104)]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG: Dumping message with 3 fields
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[0] : [CRM_meta_timeout=60000]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[1] : [crm_feature_set=3.0.1]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[2] : [CRM_meta_name=start]
Oct 01 08:38:32 node-02 crmd: [3563]: info: MSG[15] : [lrm_callid=53]
Oct 01 08:38:33 node-02 crmd: [3563]: info: do_lrm_invoke: Forcing a local LRM refresh
[  OK  ]8:38:33 node-02 lrmd: [3560]: info: RA output: (pbx_02:start:stdout) Starting test-02: [  OK  ]
Oct 01 08:38:33 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_start_0 (call=53, rc=0, cib-update=71, confirmed=true) ok
Oct 01 08:38:37 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=3:21:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_monitor_20000 )
Oct 01 08:38:37 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_monitor_20000 (call=54, rc=0, cib-update=72, confirmed=false) ok
[root <at> node-02 ~]#  grep 08:53: /var/log/ha-log
Sep 29 08:53:43 node-02 cib: [9080]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-16.raw
Sep 29 08:53:43 node-02 cib: [9080]: info: write_cib_contents: Wrote version 0.581.0 of the CIB to disk (digest: dee87447308900b0a802faf1af419f1d)
Sep 29 08:53:43 node-02 cib: [9080]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.mh8MfP (digest: /var/lib/heartbeat/crm/cib.XlqJ9q)
Sep 29 08:53:52 node-02 cib: [9153]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-17.raw
Sep 29 08:53:52 node-02 cib: [9153]: info: write_cib_contents: Wrote version 0.582.0 of the CIB to disk (digest: bf4dff2853265fdf4b5ef64eb8ba6d53)
Sep 29 08:53:52 node-02 cib: [9153]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.Z0jdaH (digest: /var/lib/heartbeat/crm/cib.y7xSYI)
Oct 01 08:53:39 node-02 lrmd: [3560]: info: cancel_op: operation monitor[54] on lsb::test-02::pbx_02 for client 3563, its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Oct 01 08:53:39 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=10:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_stop_0 )
Oct 01 08:53:39 node-02 lrmd: [3560]: info: rsc:pbx_02:55: stop
Oct 01 08:53:39 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_monitor_20000 (call=54, status=1, cib-update=0, confirmed=true) Cancelled
Oct 01 08:53:39 node-02 lrmd: [18173]: WARN: For LSB init script, no additional parameters are needed.
Oct 01 08:53:39 node-02 lrmd: [3560]: info: RA output: (pbx_02:stop:stdout) Stopping test-02: [  OK  ]
Oct 01 08:53:39 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_stop_0 (call=55, rc=0, cib-update=73, confirmed=true) ok
Oct 01 08:53:42 node-02 crmd: [3563]: info: do_lrm_invoke: Removing resource pbx_02 from the LRM
Oct 01 08:53:42 node-02 crmd: [3563]: info: do_lrm_invoke: Resource 'pbx_02' deleted for tengine on (null)
Oct 01 08:53:42 node-02 crmd: [3563]: info: notify_deleted: Notifying tengine on node-03 that pbx_02 was deleted
Oct 01 08:53:42 node-02 crmd: [3563]: info: send_direct_ack: ACK'ing resource op pbx_02_delete_0 from 11:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046: lrm_invoke-lrmd-1285916022-17
Oct 01 08:53:45 node-02 lrmd: [3560]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Oct 01 08:53:45 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=24:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_start_0 )
Oct 01 08:53:45 node-02 lrmd: [3560]: info: rsc:pbx_02:56: start
Oct 01 08:53:45 node-02 lrmd: [18201]: WARN: For LSB init script, no additional parameters are needed.
[  OK  ]8:53:45 node-02 lrmd: [3560]: info: RA output: (pbx_02:start:stdout) Starting test-02: [  OK  ]
Oct 01 08:53:45 node-02 crmd: [3563]: info: do_lrm_invoke: Forcing a local LRM refresh
Oct 01 08:53:45 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_start_0 (call=56, rc=0, cib-update=75, confirmed=true) ok
Oct 01 08:53:48 node-02 crmd: [3563]: info: do_lrm_rsc_op: Performing key=3:22:0:f80b1e20-63f0-4b00-bd04-f10c2bd10046 op=pbx_02_monitor_20000 )
Oct 01 08:53:48 node-02 crmd: [3563]: info: process_lrm_event: LRM operation pbx_02_monitor_20000 (call=57, rc=0, cib-update=76, confirmed=false) ok



The LSB test script, it is common for both pbx resource, expect few changes like filename and etc
[root <at> node-02 ~]# cat /etc/init.d/test-02
#!/bin/bash
#
# Init file for test-02 daemon
#
# chkconfig: 2345 55 25
# description: test server daemon to be used by cluster for
# testing purposes and nothing more
#
# processname: test-02

# source function library
. /etc/rc.d/init.d/functions


RETVAL=0
prog="test-02"
file="/pbx_service_02/$prog.out"

runlevel=$(set -- $(runlevel); eval "echo \$$#" )


start()
{
        # Create keys if necessary

        echo -n $"Starting $prog: "
        touch $file && success || failure
        success
        RETVAL=$?
        echo
}

stop()
{
        echo -n $"Stopping $prog: "
        rm -f $file && success || failure
        RETVAL=$?
}


case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        status)
                if [ -f $file ] ; then
                   touch $file && success || failure
                   RETVAL=$?
                else
                   failure
                   RETVAL=$?
                fi
                ;;
        *)
                echo $"Usage: $0 {start|stop|status}"
                RETVAL=1
esac
exit $RETVAL

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Pavlos Parissis | 1 Oct 2010 09:13
Picon

Re: resources are restarted without obvious reasons

Hi
Could be related to a possible bug mentioned here[1]?

BTW here is the conf of pacemaker
node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
        params drbd_resource="drbd_pbx_service_1" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
        params drbd_resource="drbd_pbx_service_2" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
        params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
        params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
        meta failure-timeout="120" migration-threshold="3" \
        op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
        params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
        op monitor interval="5s"
primitive pbx_01 lsb:test-01 \
        meta failure-timeout="60" migration-threshold="3" target-role="Started" \
        op monitor interval="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pbx_02 lsb:test-02 \
        meta failure-timeout="60" migration-threshold="3" target-role="Started" \
        op monitor interval="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
group pbx_service_01 ip_01 fs_01 pbx_01 \
        meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 \
        meta target-role="Started"
ms ms-drbd_01 drbd_01 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start
order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        symmetric-cluster="false" \
        last-lrm-refresh="1285323745"
rsc_defaults $id="rsc-options" \

Cheers,
Pavlos




[1] http://oss.clusterlabs.org/pipermail/pacemaker/2010-September/007624.html
 

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Koch, Sebastian | 1 Oct 2010 09:41
Picon
Favicon

Re: syslog-ng as resource / how to make sure it gets restarted

Hi Andrew,

 

thanks for your answer. I still need syslog-ng to restart on all nodes after the ClusterIp moved. I tried it like this:

 

 

 

Resource:

primitive res_SyslogNG lsb:syslog-ng \

        op monitor interval="15s" timeout="20s" start-delay="15s"

 

Clone:

clone cl-SyslogNG res_SyslogNG \

        meta globally-unique="false"

 

Order:

order syslogng_after_clusterip inf: res_ClusterIP cl-SyslogNG

 

 

Full Config:

 

 

node pilot01-node1 \

        attributes standby="off"

node pilot01-node2 \

        attributes standby="off"

primitive Monitor-Cluster ocf:pacemaker:ClusterMon \

        params htmlfile="/mnt/cluster/var/www/cluster-monitor.html" \

        params pidfile="/var/run/rlb-cluster-monitor.pid" \

        op start interval="0" timeout="90s" \

        op stop interval="0" timeout="100s"

primitive drbd_pilot0 ocf:linbit:drbd \

        params drbd_resource="pilot0" drbdconf="/usr/local/etc/drbd.conf" \

        operations $id="drbd_pilot0-operations" \

        op monitor interval="15s"

primitive pinggw ocf:pacemaker:pingd \

        params host_list="10.1.1.162" multiplier="200" \

        op monitor interval="10s"

primitive res_Apache lsb:apache2 \

        operations $id="res_Apache-operations" \

        op monitor interval="15s" timeout="20s" start-delay="15s" \

        meta is-managed="true" target-role="Started"

primitive res_ClusterIP ocf:heartbeat:IPaddr2 \

        params iflabel="ClusterIP" ip="10.1.1.12" nic="eth0" cidr_netmask="24" \

        operations $id="res_ClusterIP_1-operations" \

        op monitor start-delay="0" interval="10s" \

        meta target-role="Started"

primitive res_ClusterMonitor ocf:pacemaker:ClusterMon \

        params htmlfile="/mnt/cluster/var/www/cluster-monitor.html" \

        params pidfile="/var/run/rlb-cluster-monitor.pid"

primitive res_Filesystem ocf:heartbeat:Filesystem \

        params fstype="xfs" directory="/mnt/cluster" device="/dev/drbd0" options="noatime,nodiratime,barrier=0"

primitive res_JasperServer lsb:jasperserver \

        op monitor interval="15s" timeout="20s" start-delay="15s"

primitive res_MySQL lsb:mysql \

        op start interval="0" timeout="120" \

        op stop interval="0" timeout="120" \

        op monitor interval="10" timeout="30" depth="0"

primitive res_Nagios lsb:nagios3 \

        operations $id="res_Nagios-operations" \

        op monitor interval="15s" timeout="20s"

primitive res_ProLog2Db lsb:prolog2db \

        op monitor interval="15s" timeout="20s" start-delay="15s"

primitive res_ProLog2File lsb:prolog2file \

        op monitor interval="15s" timeout="20s" start-delay="15s" \

        meta target-role="Started"

primitive res_ProLogWeb lsb:prolog-web \

        op monitor interval="15s" timeout="20s" start-delay="15s"

primitive res_SyslogNG lsb:syslog-ng \

        op monitor interval="15s" timeout="20s" start-delay="15s"

group grp_MySQL res_Filesystem res_ClusterIP res_ProLog2File res_MySQL res_ProLog2Db res_Apache res_ClusterMonitor res_Nagios res_ProLogWeb res_JasperServer \

        meta target-role="Started"

ms ms_drbd_mysql0 drbd_pilot0 \

        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"

clone cl-SyslogNG res_SyslogNG \

        meta globally-unique="false"

clone cl-pinggw pinggw \

        meta globally-unique="false"

location grp_MySQL-with-pinggw grp_MySQL \

        rule $id="grp_MySQL-with-pinggw-rule-1" -inf: not_defined pingd or pingd lte 0

colocation col_drbd_on_mysql inf: grp_MySQL ms_drbd_mysql0:Master

order mysql_after_drbd inf: ms_drbd_mysql0:promote grp_MySQL:start

order syslogng_after_clusterip inf: res_ClusterIP cl-SyslogNG

property $id="cib-bootstrap-options" \

        expected-quorum-votes="2" \

        stonith-enabled="false" \

        no-quorum-policy="ignore" \

        dc-version="1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75" \

        cluster-infrastructure="openais" \

        last-lrm-refresh="1285872257" \

        symmetric-cluster="true" \

        default-action-timeout="240s"

 

 

Thanks in advance.

 

Sebastian Koch
                                                         

NETZWERK GmbH

Fon:  +49.711.220 5498 81
Achtung neue Mobilfunknummer: +49.1522.299 6524
Fax:  +49.711.220 5499 77

Email: sebastian.koch-qiQafilqtGwb1SvskN2V4Q@public.gmane.org
Web: 
www.netzwerk.de

NETZWERK GmbH, Kurze Str. 40, 70794 Filderstadt-Bonlanden
Geschäftsführer: Siegfried Herner, Hans-Baldung Luley, Olaf Müller-Haberland
Sitz der Gesellschaft: Filderstadt-Bonlanden, Amtsgericht Stuttgart HRB 225547, WEEE-Reg Nr. DE 185 622 492

 

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Pavlos Parissis | 1 Oct 2010 09:53
Picon

Re: resources are restarted without obvious reasons

Hi,
It seams that it happens every time PE wants to check the conf
09:23:55 crmd: [3473]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped!

and then check_rsc_parameters() wants to reset my resources

09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_02 on node-02, provider changed: heartbeat -> <null>
09:23:55 pengine: [3979]: notice: DeleteRsc: Removing pbx_02 from node-02
09:23:55 pengine: [3979]: notice: check_rsc_parameters: Forcing restart of pbx_01 on node-01, provider changed: heartbeat -> <null>

looking at the code I can't conclude where the issue could  be, in the actual conf or  I am hitting a bug
static gboolean
check_rsc_parameters(resource_t *rsc, node_t *node, xmlNode *rsc_entry,
             pe_working_set_t *data_set)
{
    int attr_lpc = 0;
    gboolean force_restart = FALSE;
    gboolean delete_resource = FALSE;

    const char *value = NULL;
    const char *old_value = NULL;
    const char *attr_list[] = {
        XML_ATTR_TYPE,
        XML_AGENT_ATTR_CLASS,
        XML_AGENT_ATTR_PROVIDER
    };

    for(; attr_lpc < DIMOF(attr_list); attr_lpc++) {
        value = crm_element_value(rsc->xml, attr_list[attr_lpc]);
        old_value = crm_element_value(rsc_entry, attr_list[attr_lpc]);
        if(value == old_value /* ie. NULL */
           || crm_str_eq(value, old_value, TRUE)) {
            continue;
        }

        force_restart = TRUE;
        crm_notice("Forcing restart of %s on %s, %s changed: %s -> %s",
               rsc->id, node->details->uname, attr_list[attr_lpc],
               crm_str(old_value), crm_str(value));
    }
    if(force_restart) {
        /* make sure the restart happens */
        stop_action(rsc, node, FALSE);
        set_bit(rsc->flags, pe_rsc_start_pending);
        delete_resource = TRUE;
    }
    return delete_resource;
}


On 1 October 2010 09:13, Pavlos Parissis <pavlos.parissis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Hi
Could be related to a possible bug mentioned here[1]?

BTW here is the conf of pacemaker
node $id="b8ad13a6-8a6e-4304-a4a1-8f69fa735100" node-02
node $id="d5557037-cf8f-49b7-95f5-c264927a0c76" node-01
node $id="e5195d6b-ed14-4bb3-92d3-9105543f9251" node-03
primitive drbd_01 ocf:linbit:drbd \
        params drbd_resource="drbd_pbx_service_1" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive drbd_02 ocf:linbit:drbd \
        params drbd_resource="drbd_pbx_service_2" \
        op monitor interval="30s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s"
primitive fs_01 ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/pbx_service_01" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive fs_02 ocf:heartbeat:Filesystem \
        params device="/dev/drbd2" directory="/pbx_service_02" fstype="ext3" \
        meta migration-threshold="3" failure-timeout="60" \
        op monitor interval="20s" timeout="40s" OCF_CHECK_LEVEL="20" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive ip_01 ocf:heartbeat:IPaddr2 \
        params ip="192.168.78.10" cidr_netmask="24" broadcast="192.168.78.255" \
        meta failure-timeout="120" migration-threshold="3" \
        op monitor interval="5s"
primitive ip_02 ocf:heartbeat:IPaddr2 \
        params ip="192.168.78.20" cidr_netmask="24" broadcast="192.168.78.255" \
        op monitor interval="5s"
primitive pbx_01 lsb:test-01 \
        meta failure-timeout="60" migration-threshold="3" target-role="Started" \
        op monitor interval="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive pbx_02 lsb:test-02 \
        meta failure-timeout="60" migration-threshold="3" target-role="Started" \
        op monitor interval="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
group pbx_service_01 ip_01 fs_01 pbx_01 \
        meta target-role="Started"
group pbx_service_02 ip_02 fs_02 pbx_02 \
        meta target-role="Started"
ms ms-drbd_01 drbd_01 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
ms ms-drbd_02 drbd_02 \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
location PrimaryNode-drbd_01 ms-drbd_01 100: node-01
location PrimaryNode-drbd_02 ms-drbd_02 100: node-02
location PrimaryNode-pbx_service_01 pbx_service_01 200: node-01
location PrimaryNode-pbx_service_02 pbx_service_02 200: node-02
location SecondaryNode-drbd_01 ms-drbd_01 0: node-03
location SecondaryNode-drbd_02 ms-drbd_02 0: node-03
location SecondaryNode-pbx_service_01 pbx_service_01 10: node-03
location SecondaryNode-pbx_service_02 pbx_service_02 10: node-03
colocation fs_01-on-drbd_01 inf: fs_01 ms-drbd_01:Master
colocation fs_02-on-drbd_02 inf: fs_02 ms-drbd_02:Master
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote pbx_service_01:start
order pbx_service_02-after-drbd_02 inf: ms-drbd_02:promote pbx_service_02:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        symmetric-cluster="false" \
        last-lrm-refresh="1285323745"
rsc_defaults $id="rsc-options" \

Cheers,
Pavlos




[1] http://oss.clusterlabs.org/pipermail/pacemaker/2010-September/007624.html
 

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Andrew Beekhof | 1 Oct 2010 13:13
Favicon
Gravatar

Re: [Problem or Enhancement]When attrd reboots, a fail count is initialized.

On Fri, Oct 1, 2010 at 4:00 AM,  <renayama19661014@...> wrote:
> Hi Andrew,
>
> Thank you for comment.
>
>> During crmd startup, one could read all the values from attrd into the
>> hashtable.
>> So the hashtable would only do something if only attrd went down.
>
> If attrd communicates with crmd at the time of start and reads the data of the hash table, the problem
> seems to be able to be settled.
>
> Is the change of this attrd and crmd difficult?

I dont think so.
But its not a huge priority because I've never heard of attrd actually crashing.

So while I agree that its theoretically a problem, in practice no-one
is going to hit this in production.
Even if they were unlucky enough to see it, at worst the resource is
able to run on the node again - which doesn't seem that bad for a HA
cluster :-)

>
>
>> I mean: did you see this behavior in a production system, or only
>> during testing when you manually killed attrd?
>
> We carry out kill-command by manual operation as one of the tests of the trouble of the processes.
> Our user minds behavior of the process trouble very much.
>
> Best Regards,
> Hideo Yamauchi.
>
> --- Andrew Beekhof <andrew@...> wrote:
>
>> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661014@...> wrote:
>> > Hi Andrew,
>> >
>> > Thank you for comment.
>> >
>> >> The problem here is that attrd is supposed to be the authoritative
>> >> source for this sort of data.
>> >
>> > Yes. I understand.
>> >
>> >> Additionally, you don't always want attrd reading from the status
>> >> section - like after the cluster restarts.
>> >
>> > The problem seems to be able to solve even that it retrieves a status section from cib after
>> attrd
>> > rebooted.
>> > "method2" which I suggested is such a meaning.
>> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
>> >
>> >> For failcount, the crmd could keep a hashtable of the current values
>> >> which it could re-send to attrd if it detects a disconnection.
>> >> But that might not be a generic-enough solution.
>> >
>> > If a Hash table of crmd can maintain it, it may be a good thought.
>> > However, I have a feeling that the same problem happens when crmd causes trouble and rebooted.
>>
>> During crmd startup, one could read all the values from attrd into the
>> hashtable.
>> So the hashtable would only do something if only attrd went down.
>>
>> >
>> >> The chance that attrd dies _and_ there were relevant values for
>> >> fail-count is pretty remote though... is this a real problem you've
>> >> experienced or a theoretical one?
>> >
>> > I did not understand meanings well.
>> > Does this mean that there is fail-count of attrd in the other node?
>>
>> I mean: did you see this behavior in a production system, or only
>> during testing when you manually killed attrd?
>>
>> >
>> > Best Regards,
>> > Hideo Yamauchi.
>> >
>> > --- Andrew Beekhof <andrew@...> wrote:
>> >
>> >> On Mon, Sep 27, 2010 at 7:26 AM,
&#65533;<renayama19661014@...> wrote:
>> >> > Hi,
>> >> >
>> >> > When I investigated another problem, I discovered this phenomenon.
>> >> > If attrd causes process trouble and does not restart, the problem does not occur.
>> >> >
>> >> > Step1) After start, it causes a monitor error in UmIPaddr twice.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > &#65533;Resource Group: UMgroup01
>> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; &#65533; &#65533;
>> > &#65533;Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=2
>> >> >
>> >> > Step2) Kill Attrd and Attrd reboots.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > &#65533;Resource Group: UMgroup01
>> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; &#65533; &#65533;
>> > &#65533;Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=2
>> >> >
>> >> > Step3) It causes a monitor error in UmIPaddr.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > &#65533;Resource Group: UMgroup01
>> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; &#65533; &#65533;
>> > &#65533;Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=1 -----> Fail-count return to the
>> first.
>> >> >
>> >> > The problem is so that attrd disappears fail-count by reboot.(Hash-tables is Lost.)
>> >> > It is a problem very much that the trouble number of times is initialized.
>> >> >
>> >> > I think that there is the following method.
>> >> >
>> >> > method 1)Attrd maintain fail-count as a file in "/var/run" directories and refer.
>> >> >
>> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
>> >> >
>> >> > Is there a better method?
>> >> >
>> >> > Please think about the solution of this problem.
>> >>
>> >> Hmmmm... a tricky one.
>> >>
>> >> The problem here is that attrd is supposed to be the authoritative
>> >> source for this sort of data.
>> >> Additionally, you don't always want attrd reading from the status
>> >> section - like after the cluster restarts.
>> >>
>> >> For failcount, the crmd could keep a hashtable of the current values
>> >> which it could re-send to attrd if it detects a disconnection.
>> >> But that might not be a generic-enough solution.
>> >>
>> >> The chance that attrd dies _and_ there were relevant values for
>> >> fail-count is pretty remote though... is this a real problem you've
>> >> experienced or a theoretical one?
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker@...
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker@...
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@...
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@...
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Holger.Teutsch | 1 Oct 2010 15:40

Fail-count and failure timeout

Hi,
I observed the following in pacemaker Versions 1.1.3 and tip up to patch 10258.

In a small test environment to study fail-count behavior I have one resource

anything
doing sleep 600 with monitoring interval 10 secs.

The failure-timeout is 300.

I would expect to never see a failcount higher than 1.

I observed some sporadic clears but mostly the count is increasing by 1 each 10 minutes.

Am I mistaken or is this a bug ?

Regards
Holger

-- complete cib for reference ---

<cib epoch="32" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.4" have-quorum="0" cib-last-written="Fri Oct  1 14:17:31 2010" dc-uuid="hotlx">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.3-09640bd6069e677d5eed65203a6056d9bf562e67"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="openais"/>
        <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-start-failure-is-fatal" name="start-failure-is-fatal" value="false"/>
        <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1285926879"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="hotlx" uname="hotlx" type="normal"/>
    </nodes>
    <resources>
      <primitive class="ocf" id="test" provider="heartbeat" type="anything">
        <meta_attributes id="test-meta_attributes">
          <nvpair id="test-meta_attributes-target-role" name="target-role" value="started"/>
          <nvpair id="test-meta_attributes-failure-timeout" name="failure-timeout" value="300"/>
        </meta_attributes>
        <operations id="test-operations">
          <op id="test-op-monitor-10" interval="10" name="monitor" on-fail="restart" timeout="20s"/>
          <op id="test-op-start-0" interval="0" name="start" on-fail="restart" timeout="20s"/>
        </operations>
        <instance_attributes id="test-instance_attributes">
          <nvpair id="test-instance_attributes-binfile" name="binfile" value="sleep 600"/>
        </instance_attributes>
      </primitive>
    </resources>
    <constraints/>
  </configuration>
</cib>
_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Shravan Mishra | 1 Oct 2010 15:45
Picon

Re: cib

Hi,

Just a quick question, who generates the very first cib.xml when
pacemaker processes are initialized?

Thanks
Shravan

On Thu, Sep 30, 2010 at 4:22 AM, Andrew Beekhof <andrew@...> wrote:
> On Tue, Sep 28, 2010 at 11:47 AM, Andrew Beekhof <andrew@...> wrote:
>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>> <shravan.mishra@...> wrote:
>>> Thanks Raoul for the response.
>>>
>>> Changing the permission to hacluster:haclient did stop that error.
>>>
>>> Now I'm hitting another problem whereby cib is failing to start
>>
>> Very strange logs.
>> Which distribution is this?
>
>   ^^^^^^^^^^^^^^^^^^^^^^^^
>
>> What does your corosync.conf look like?
>>
>>
>>> =====
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has process list:
>>> 00000000000000000000000000110012 (1114130)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
>>> ha2.itactics.com now has 1 quorum votes (was 0)
>>> Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
>>> Sending membership update 100 to 0 children
>>> Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
>>> ready to provide service.
>>> Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14889, rc=127)
>>> Sep 27 00:16:30 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:30 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14896 for process cib
>>> crmd[14893]: 2010/09/27_00:16:30 WARN: do_cib_control: Couldn't
>>> complete CIB registration 1 times... pause and retry
>>> Sep 27 00:16:31 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14896, rc=127)
>>> Sep 27 00:16:31 corosync [pcmk  ] notice: pcmk_wait_dispatch:
>>> Respawning failed child process: cib
>>> Sep 27 00:16:31 corosync [pcmk  ] info: spawn_child: Forked child
>>> 14901 for process cib
>>> Sep 27 00:16:32 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib exited (pid=14901, rc=1
>>> ======
>>>
>>>
>>> I have attached the full logs.
>>>
>>> We are using  corosync 1.2.8 and pacemaker 1.1.3.
>>>
>>>
>>>  Thanks.
>>> Shravan
>>>
>>>
>>>
>>> On Sat, Sep 25, 2010 at 4:36 AM, Raoul Bhatia [IPAX] <r.bhatia@...> wrote:
>>>> On 24.09.2010 21:41, Shravan Mishra wrote:
>>>>>
>>>>> crmd[20612]: 2010/09/24_15:29:57 ERROR: crm_log_init_worker: Cannot
>>>>> change active directory to /var/lib/heartbeat/cores/hacluster:
>>>>> Permission denied (13)
>>>>
>>>> ls -ald /var/lib/heartbeat/cores/hacluster /var/lib/heartbeat/cores/
>>>> /var/lib/heartbeat/ /var/lib/ /var/
>>>>
>>>> is haclient allowed to cd all the way into
>>>> /var/lib/heartbeat/cores/hacluster ?
>>>>
>>>> cheers,
>>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@...
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@...
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Andreas Hofmeister | 1 Oct 2010 20:39
Favicon

resource stop timeout broken in 1.0 branch tip

Hi,

it seems to me that patch

   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/8241f689bf9f

broke  timeouts for stop operations. The observable effect is that the 
timeout for stop operations is always 125s, regardless what was 
specified in the CIB. Reverting the part of the patch that changes 
"crmd/lrm.c" seems to fix the problem.

The attached patch reverts the change to "crmd/lrm.c" and also

   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/66df1404fdcb

which dealt with another side effect of the change in "crmd/lrm.c" .

The stuff around the  place in the code seems to assume that stop 
operations do not always have attributes.  I'm wondering when this 
happens as I cannot remember to see any of the assertions that appear 
w/o patch "66df1404fdcb"

Ciao
   Andi
Attachment (016-stop-timeout.patch): text/x-patch, 861 bytes
_______________________________________________
Pacemaker mailing list: Pacemaker@...
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Gmane