<p dir="ltr">&gt; Hi  Guys!<br>

&gt;<br>

&gt; I&#39;m having a strange problem with pacemaker-heartbeat cluster when I put it in a maintenance-mode.<br>

&gt;<br>

&gt; First of all, let me show my configuration:<br>

&gt;<br>

&gt; [ Pacemaker ]<br>

&gt;<br>

&gt; node $id=&quot;23ade9ce-d274-4e56-aa91-9e95a8c08cf7&quot; test-lb02 \<br>

&gt;         attributes standby=&quot;off&quot;<br>

&gt; node $id=&quot;52ac429f-2b78-4630-bbd8-fb73a8152ab3&quot; test-lb01<br>

&gt; primitive ClusterMon ocf:pacemaker:ClusterMon \<br>

&gt;         params user=&quot;root&quot; update=&quot;30&quot; extra_options=&quot;-T somemail -F somemail -P PACEMAKER&quot; \<br>

&gt;         op monitor interval=&quot;60&quot; timeout=&quot;20&quot; on-fail=&quot;standby&quot;<br>

&gt; primitive IP-rsc_apache ocf:heartbeat:IPaddr2 \<br>

&gt;         params ip=&quot;xx.xx.xx.yy&quot; nic=&quot;eth0&quot; cidr_netmask=&quot;255.255.255.192&quot; \<br>

&gt;         meta migration-threshold=&quot;2&quot; target-role=&quot;Started&quot; \<br>

&gt;         op monitor interval=&quot;20&quot; timeout=&quot;20&quot; on-fail=&quot;standby&quot;<br>

&gt; primitive Nginx-rsc ocf:heartbeat:nginx \<br>

&gt;         meta migration-threshold=&quot;2&quot; is-managed=&quot;true&quot; target-role=&quot;Started&quot; \<br>

&gt;         op monitor interval=&quot;20&quot; timeout=&quot;20&quot; on-fail=&quot;standby&quot;<br>

&gt; clone ClusterMon-clone ClusterMon \<br>

&gt;         meta taget-role=&quot;Started&quot;<br>

&gt; colocation lb-loc inf: IP-rsc_apache Nginx-rsc<br>

&gt; order lb-ord inf: IP-rsc_apache Nginx-rsc<br>

&gt; property $id=&quot;cib-bootstrap-options&quot; \<br>

&gt;         stonith-enabled=&quot;no&quot; \<br>

&gt;         dc-version=&quot;1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff&quot; \<br>

&gt;         cluster-infrastructure=&quot;Heartbeat&quot; \<br>

&gt;         maintenance-mode=&quot;false&quot; \<br>

&gt;         cluster-recheck-interval=&quot;60s&quot;<br>

&gt;<br>

&gt; [heartbeat]<br>

&gt;<br>

&gt; crm yes<br>

&gt;<br>

&gt; logfile /var/log/ha-log<br>

&gt;<br>

&gt; logfacility     local0<br>

&gt;<br>

&gt; keepalive 2<br>

&gt; deadtime 30<br>

&gt; warntime 10<br>

&gt; initdead 120<br>

&gt;<br>

&gt; auto_failback off<br>

&gt;<br>

&gt; ucast   eth0 xx.xx.xx.xx<br>

&gt; ucast   eth0 xx.xx.xx.xy<br>

&gt;<br>

&gt; node    test-lb01<br>

&gt; node    test-lb02<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; [STATUS]<br>

&gt;<br>

&gt; Last updated: Fri May 29 18:04:11 2015<br>

&gt; Last change: Fri May 29 18:01:57 2015 via cibadmin on test-lb01<br>

&gt; Stack: Heartbeat<br>

&gt; Current DC: test-lb01 (52ac429f-2b78-4630-bbd8-fb73a8152ab3) - partition with quorum<br>

&gt; Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff<br>

&gt; 2 Nodes configured, unknown expected votes<br>

&gt; 4 Resources configured.<br>

&gt; ============<br>

&gt;<br>

&gt; Online: [ test-lb01 test-lb02 ]<br>

&gt;<br>

&gt;  IP-rsc_apache (ocf::heartbeat:IPaddr2): Started test-lb01<br>

&gt;  Nginx-rsc (ocf::heartbeat:nginx): Started test-lb01<br>

&gt;  Clone Set: ClusterMon-clone [ClusterMon]<br>

&gt;      Started: [ test-lb01 test-lb02 ]<br>

&gt;<br>

&gt;<br>

&gt; ====================<br>

&gt;<br>

&gt;<br>

&gt; So, everything is working. But now, if put enable the maintenance-mode on the cluster, both nodes get rebooted:<br>

&gt;<br>

&gt;<br>

&gt; crm(live)# configure property maintenance-mode=&quot;true&quot;<br>

&gt;<br>

&gt; Then in the logs:<br>

&gt;<br>

&gt;<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 2: cancel IP-rsc_apache_monitor_20000 on test-lb01 (local)<br>

&gt; May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation monitor[8] on IP-rsc_apache for client 3240, its parameters: cidr_netmask=[255.255.255.192] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] CRM_meta_on_fail=[standby] crm_feature_set=[3.0.6] CRM_meta_interval=[20000] ip=[xx.xx.xx.yy] nic=[eth0]  cancelled<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 3: cancel Nginx-rsc_monitor_20000 on test-lb01 (local)<br>

&gt; May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation monitor[10] on Nginx-rsc for client 3240, its parameters: crm_feature_set=[3.0.6] CRM_meta_on_fail=[standby] CRM_meta_name=[monitor] CRM_meta_interval=[20000] CRM_meta_timeout=[20000]  cancelled<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 4: cancel ClusterMon:0_monitor_60000 on test-lb01 (local)<br>

&gt; May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation monitor[6] on ClusterMon:0 for client 3240, its parameters: CRM_meta_timeout=[20000] CRM_meta_name=[monitor] CRM_meta_on_fail=[standby] crm_feature_set=[3.0.6] CRM_meta_notify=[false] <br>

&gt; extra_options=[-T somemail -F somemail -P PACEMAKuser=[root] CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_clone_node_max=[1] CRM_meta_interval=[60000] CRM_meta_globally_unique=[false] update=[30]  cancelled<br>

&gt;<br>

&gt; May 29 18:06:39 test-lb02 lrmd: [3223]: info: cancel_op: operation monitor[6] on ClusterMon:1 for client 3226, its parameters: CRM_meta_timeout=[20000] CRM_meta_name=[monitor] CRM_meta_on_fail=[standby] crm_feature_set=[3.0.6] CRM_meta_notify=[false] extra_options=[-T somemail -F some mail -P PACEMAKuser=[root] CRM_meta_clone=[1] CRM_meta_clone_max=[2] CRM_meta_clone_node_max=[1] CRM_meta_interval=[60000] CRM_meta_globally_unique=[false] update=[30]  cancelled<br>

&gt;<br>

&gt;<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 1: cancel ClusterMon:1_monitor_60000 on test-lb02<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM operation IP-rsc_apache_monitor_20000 (call=8, status=1, cib-update=0, confirmed=true) Cancelled<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_20000 (call=10, status=1, cib-update=0, confirmed=true) Cancelled<br>

&gt; May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM operation ClusterMon:0_monitor_60000 (call=6, status=1, cib-update=0, confirmed=true) Cancelled<br>

&gt; May 29 18:06:39 test-lb02 crmd: [3226]: info: process_lrm_event: LRM operation ClusterMon:1_monitor_60000 (call=6, status=1, cib-update=0, confirmed=true) Cancelled<br>

&gt;<br>

&gt;<br>

&gt; But after 60s, when the timmer hits, the whole cluster went away:<br>

&gt;<br>

&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (60000ms)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: notice: do_state_transition: State transition S_IDLE -&gt; S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: notice: do_state_transition: State transition S_POLICY_ENGINE -&gt; S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling timer for action 2 (src=98)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling timer for action 3 (src=99)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling timer for action 4 (src=100)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: do_te_invoke: Processing graph 8 (ref=pe_calc-dc-1432915660-42) derived from /var/lib/pengine/pe-input-242.bz2<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 2: cancel IP-rsc_apache_monitor_20000 on test-lb01 (local)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: cancel_op: No pending op found for IP-rsc_apache:8<br>

&gt; May 29 18:07:40 test-lb01 lrmd: [3237]: info: on_msg_cancel_op: no operation with id 8<br>

&gt; May 29 18:07:40 test-lb01 cib: [3236]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname=&#39;test-lb01&#39;]//lrm_resource[@id=&#39;IP-rsc_apache&#39;]/lrm_rsc_op[@id=&#39;IP-rsc_apache_monitor_20000&#39; and @call-id=&#39;8&#39;] (origin=local/crmd/73, version=0.124.27): ok (rc=0)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 3: cancel Nginx-rsc_monitor_20000 on test-lb01 (local)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: get_lrm_resource: Could not add resource Nginx-rsc to LRM<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: do_lrm_invoke: Invalid resource definition<br>

&gt;<br>

&gt;<br>

&gt; And this message repeated for every resource I have:<br>

&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input &lt;create_request_adv origin=&quot;te_rsc_command&quot; t=&quot;crmd&quot; version=&quot;3.0.6&quot; subt=&quot;request&quot; reference=&quot;lrm_invoke-tengine-1432915660-45&quot; crm_task=&quot;lrm_invoke&quot; crm_sys_to=&quot;lrmd&quot; crm_sys_from=&quot;tengine&quot; crm_host_to=&quot;test-lb01&quot; &gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input   &lt;crm_xml &gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input     &lt;rsc_op id=&quot;3&quot; operation=&quot;cancel&quot; operation_key=&quot;Nginx-rsc_monitor_20000&quot; on_node=&quot;test-lb01&quot; on_node_uuid=&quot;52ac429f-2b78-4630-bbd8-fb73a8152ab3&quot; transition-key=&quot;3:8:0:3edaee69-5093-4538-8d12-90e0db0658ba&quot; &gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input       &lt;primitive id=&quot;Nginx-rsc&quot; long-id=&quot;Nginx-rsc&quot; class=&quot;ocf&quot; provider=&quot;heartbeat&quot; type=&quot;nginx&quot; /&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input       &lt;attributes CRM_meta_call_id=&quot;10&quot; CRM_meta_interval=&quot;20000&quot; CRM_meta_name=&quot;monitor&quot; CRM_meta_on_fail=&quot;standby&quot; CRM_meta_operation=&quot;monitor&quot; CRM_meta_timeout=&quot;20000&quot; crm_feature_set=&quot;3.0.6&quot; /&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input     &lt;/rsc_op&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input   &lt;/crm_xml&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input &lt;/create_request_adv&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating action 4: cancel ClusterMon:0_monitor_60000 on test-lb01 (local)<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.<br>

&gt;<br>

&gt; And then the rea crash<br>

&gt;<br>

&gt; May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL<br>

&gt; May 29 18:07:40 test-lb01 heartbeat: [3155]: WARN: Managed /usr/lib/heartbeat/crmd process 3240 killed by signal 11 [SIGSEGV - Segmentation violation].<br>

&gt; May 29 18:07:40 test-lb01 heartbeat: [3155]: ERROR: Managed /usr/lib/heartbeat/crmd process 3240 dumped core<br>

&gt; May 29 18:07:40 test-lb01 heartbeat: [3155]: EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/crmd<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; At the same time, I found that in the other node:<br>

&gt;<br>

&gt;<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: CRIT: lrm_connection_destroy: LRM Connection failed<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: info: lrm_connection_destroy: LRM Connection disconnected<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_ELECTION<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: notice: do_state_transition: State transition S_ELECTION -&gt; S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: info: do_dc_release: DC role released<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: info: do_te_control: Transitioner is now inactive<br>

&gt; May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY<br>

&gt;<br>

&gt;<br>

&gt; May 29 18:07:41 test-lb02 heartbeat: [3148]: WARN: Managed /usr/lib/heartbeat/crmd process 3226 exited with return code 2.<br>

&gt; May 29 18:07:41 test-lb02 heartbeat: [3148]: EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/crmd<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; For some reason, pacemaker found something that it didn&#39;t like, and crashed, but I&#39;m not understanding what.<br>

&gt; Could someone throw me some hints about that?<br>

&gt;<br>

&gt; Thanks in advance<br>

&gt; Have a nice weekend!<br>

&gt; Best Regards<br>

&gt;<br>

&gt;<br>

&gt;<br>

</p>