<div dir="ltr"><div>Ken,</div><div><br></div><div>I have another set of logs : </div><div><br></div><div><font face="monospace" size="2">Sep 01 09:10:05 [1328] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>       crmd:     info: do_lrm_rsc_op: Performing key=5:50864:0:86160921-abd7-4e14-94d4-f53cee278858 op=SVSDEHA_monitor_2000<br>SvsdeStateful(SVSDEHA)[6174]:   2017/09/01_09:10:06 ERROR: Resource is in failed state<br>Sep 01 09:10:06 [1328] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>       crmd:     info: action_synced_wait:    Managed SvsdeStateful_meta-data_0 process 6274 exited with rc=4<br>Sep 01 09:10:06 [1328] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>       crmd:    error: generic_get_metadata:  Failed to receive meta-data for ocf:pacemaker:SvsdeStateful<br>Sep 01 09:10:06 [1328] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>       crmd:    error: build_operation_update:    No metadata for ocf::pacemaker:SvsdeStateful<br>Sep 01 09:10:06 [1328] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>       crmd:     info: process_lrm_event: Result of monitor operation for SVSDEHA on <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>: 0 (ok) | call=939 key=SVSDEHA_monitor_2000 confirmed=false cib-update=476<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_process_request:   Forwarding cib_modify operation for section status to all (origin=local/crmd/476)<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: --- 0.37.4054 2<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: +++ 0.37.4055 (null)<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    +  /cib:  @num_updates=4055<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    ++ /cib/status/node_state[@id=&#39;1&#39;]/lrm[@id=&#39;1&#39;]/lrm_resources/lrm_resource[@id=&#39;SVSDEHA&#39;]:  &lt;lrm_rsc_op id=&quot;SVSDEHA_monitor_2000&quot; operation_key=&quot;SVSDEHA_monitor_2000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_resource&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;5:50864:0:86160921-abd7-4e14-94d4-f53cee278858&quot; transition-magic=&quot;0:0;5:50864:0:86160921-abd7-4e14-94d4-f53cee278858&quot; on_node=&quot;TPC-F9-26.phaedrus.sandvi<br>Sep 01 09:10:06 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_process_request:   Completed cib_modify operation for section status: OK (rc=0, origin=<a href="http://TPC-F9-26.phaedrus.sandvine.com/crmd/476">TPC-F9-26.phaedrus.sandvine.com/crmd/476</a>, version=0.37.4055)<br><b>Sep 01 09:10:12 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_process_ping:  Reporting our current digest to <a href="http://TPC-E9-23.phaedrus.sandvine.com">TPC-E9-23.phaedrus.sandvine.com</a>: 74bbb7e9f35fabfdb624300891e32018 for 0.37.4055 (0x7f5719954560 0)<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: --- 0.37.4055 2</b><br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: +++ 0.37.4056 (null)<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    +  /cib:  @num_updates=4056<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    ++ /cib/status/node_state[@id=&#39;2&#39;]/lrm[@id=&#39;2&#39;]/lrm_resources/lrm_resource[@id=&#39;SVSDEHA&#39;]:  &lt;lrm_rsc_op id=&quot;SVSDEHA_last_failure_0&quot; operation_key=&quot;SVSDEHA_monitor_1000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_resource&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;7:50662:8:86160921-abd7-4e14-94d4-f53cee278858&quot; transition-magic=&quot;2:1;7:50662:8:86160921-abd7-4e14-94d4-f53cee278858&quot; on_node=&quot;TPC-E9-23.phaedrus.sand<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_process_request:   Completed cib_modify operation for section status: OK (rc=0, origin=<a href="http://TPC-E9-23.phaedrus.sandvine.com/crmd/53508">TPC-E9-23.phaedrus.sandvine.com/crmd/53508</a>, version=0.37.4056)<br>Sep 01 09:15:33 [1327] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>      attrd:     info: attrd_peer_update: Setting fail-count-SVSDEHA[<a href="http://TPC-E9-23.phaedrus.sandvine.com">TPC-E9-23.phaedrus.sandvine.com</a>]: (null) -&gt; 1 from <a href="http://TPC-E9-23.phaedrus.sandvine.com">TPC-E9-23.phaedrus.sandvine.com</a><br>Sep 01 09:15:33 [1327] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>      attrd:     info: attrd_peer_update: Setting last-failure-SVSDEHA[<a href="http://TPC-E9-23.phaedrus.sandvine.com">TPC-E9-23.phaedrus.sandvine.com</a>]: (null) -&gt; 1504271733 from <a href="http://TPC-E9-23.phaedrus.sandvine.com">TPC-E9-23.phaedrus.sandvine.com</a><br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: --- 0.37.4056 2<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    Diff: +++ 0.37.4057 (null)<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    +  /cib:  @num_updates=4057<br>Sep 01 09:15:33 [1325] <a href="http://TPC-F9-26.phaedrus.sandvine.com">TPC-F9-26.phaedrus.sandvine.com</a>        cib:     info: cib_perform_op:    ++ /cib/status/node_state[@id=&#39;2&#39;]/transient_attributes[@id=&#39;2&#39;]/instance_attributes[@id=&#39;status-2&#39;]:  &lt;nvpair id=&quot;status-2-fail-count-SVSDEHA&quot; name=&quot;fail-count-SVSDEHA&quot; </font></div><div><font face="monospace" size="2">value=&quot;1&quot;/&gt;</font></div><div><font face="Consolas" size="2"><br></font></div><div>I was suspecting around the highlighted parts of the logs above. </div><div><font face="sans-serif" size="2">After 09:10:12 the next log is at 09:15:33. During this time other node failed several times but was not migrated here.</font></div><div><font size="2"><br></font></div><div><font size="2">I am yet to check with sbd fencing with  the patch shared by Klaus.</font></div><div><font size="2">I am on CentOS. </font></div><div><font face="monospace" size="2"><br></font></div><div><font face="monospace" size="2"># cat /etc/centos-release<br>CentOS Linux release 7.3.1611 (Core)</font></div><div><font face="sans-serif" size="2"><br></font></div><div><font face="sans-serif" size="2">Regards,</font></div><div><font face="sans-serif" size="2">Abhay</font></div><div><font face="sans-serif" size="2"><br></font></div><div><br></div><div><b></b><i></i><u></u><sub></sub><sup></sup><strike></strike><b></b><i></i><u></u><sub></sub><sup></sup><strike></strike><b></b><b></b><font face="sans-serif"></font><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Sat, 2 Sep 2017 at 15:23 Klaus Wenninger &lt;<a href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 09/01/2017 11:45 PM, Ken Gaillot wrote:<br>

&gt; On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:<br>

&gt;&gt;         Are you sure the monitor stopped? Pacemaker only logs<br>

&gt;&gt;         recurring monitors<br>

&gt;&gt;         when the status changes. Any successful monitors after this<br>

&gt;&gt;         wouldn&#39;t be<br>

&gt;&gt;         logged.<br>

&gt;&gt;<br>

&gt;&gt; Yes. Since there  were no logs which said &quot;RecurringOp:  Start<br>

&gt;&gt; recurring monitor&quot; on the node after it had failed.<br>

&gt;&gt; Also there were no logs for any actions pertaining to<br>

&gt;&gt; The problem was that even though the one node was failing, the<br>

&gt;&gt; resources were never moved to the other node(the node on which I<br>

&gt;&gt; suspect monitoring had stopped).<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;         There are a lot of resource action failures, so I&#39;m not sure<br>

&gt;&gt;         where the<br>

&gt;&gt;         issue is, but I&#39;m guessing it has to do with<br>

&gt;&gt;         migration-threshold=1 --<br>

&gt;&gt;         once a resource has failed once on a node, it won&#39;t be allowed<br>

&gt;&gt;         back on<br>

&gt;&gt;         that node until the failure is cleaned up. Of course you also<br>

&gt;&gt;         have<br>

&gt;&gt;         failure-timeout=1s, which should clean it up immediately, so<br>

&gt;&gt;         I&#39;m not<br>

&gt;&gt;         sure.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; migration-threshold=1<br>

&gt;&gt; failure-timeout=1s<br>

&gt;&gt;<br>

&gt;&gt; cluster-recheck-interval=2s<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;         first, set &quot;two_node:<br>

&gt;&gt;         1&quot; in corosync.conf and let no-quorum-policy default in<br>

&gt;&gt;         pacemaker<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; This is already configured.<br>

&gt;&gt; # cat /etc/corosync/corosync.conf<br>

&gt;&gt; totem {<br>

&gt;&gt;     version: 2<br>

&gt;&gt;     secauth: off<br>

&gt;&gt;     cluster_name: SVSDEHA<br>

&gt;&gt;     transport: udpu<br>

&gt;&gt;     token: 5000<br>

&gt;&gt; }<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; nodelist {<br>

&gt;&gt;     node {<br>

&gt;&gt;         ring0_addr: 2.0.0.10<br>

&gt;&gt;         nodeid: 1<br>

&gt;&gt;     }<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;     node {<br>

&gt;&gt;         ring0_addr: 2.0.0.11<br>

&gt;&gt;         nodeid: 2<br>

&gt;&gt;     }<br>

&gt;&gt; }<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; quorum {<br>

&gt;&gt;     provider: corosync_votequorum<br>

&gt;&gt;     two_node: 1<br>

&gt;&gt; }<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; logging {<br>

&gt;&gt;     to_logfile: yes<br>

&gt;&gt;     logfile: /var/log/cluster/corosync.log<br>

&gt;&gt;     to_syslog: yes<br>

&gt;&gt; }<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;         let no-quorum-policy default in pacemaker; then,<br>

&gt;&gt;         get stonith configured, tested, and enabled<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; By not configuring no-quorum-policy, would it ignore quorum for a 2<br>

&gt;&gt; node cluster?<br>

&gt; With two_node, corosync always provides quorum to pacemaker, so<br>

&gt; pacemaker doesn&#39;t see any quorum loss. The only significant difference<br>

&gt; from ignoring quorum is that corosync won&#39;t form a cluster from a cold<br>

&gt; start unless both nodes can reach each other (a safety feature).<br>

&gt;<br>

&gt;&gt; For my use case I don&#39;t need stonith enabled. My intention is to have<br>

&gt;&gt; a highly available system all the time.<br>

&gt; Stonith is the only way to recover from certain types of failure, such<br>

&gt; as the &quot;split brain&quot; scenario, and a resource that fails to stop.<br>

&gt;<br>

&gt; If your nodes are physical machines with hardware watchdogs, you can set<br>

&gt; up sbd for fencing without needing any extra equipment.<br>

<br>

Small caveat here:<br>

If I get it right you have a 2-node-setup. In this case the watchdog-only<br>

sbd-setup would not be usable as it relies on &#39;real&#39; quorum.<br>

In 2-node-setups sbd needs at least a single shared disk.<br>

For the sbd-single-disk-setup working with 2-node<br>

you need the patch from <a href="https://github.com/ClusterLabs/sbd/pull/23" rel="noreferrer" target="_blank">https://github.com/ClusterLabs/sbd/pull/23</a><br>

in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has<br>

it in since GA)<br>

<br>

Regards,<br>

Klaus<br>

<br>

&gt;<br>

&gt;&gt; I will test my RA again as suggested with no-quorum-policy=default.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; One more doubt.<br>

&gt;&gt; Why do we see this is &#39;pcs property&#39; ?<br>

&gt;&gt; last-lrm-refresh: 1504090367<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Never seen this on a healthy cluster.<br>

&gt;&gt; From RHEL documentation:<br>

&gt;&gt; last-lrm-refresh<br>

&gt;&gt;<br>

&gt;&gt; Last refresh of the<br>

&gt;&gt; Local Resource Manager,<br>

&gt;&gt; given in units of<br>

&gt;&gt; seconds since epoca.<br>

&gt;&gt; Used for diagnostic<br>

&gt;&gt; purposes; not<br>

&gt;&gt; user-configurable.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Doesn&#39;t explain much.<br>

&gt; Whenever a cluster property changes, the cluster rechecks the current<br>

&gt; state to see if anything needs to be done. last-lrm-refresh is just a<br>

&gt; dummy property that the cluster uses to trigger that. It&#39;s set in<br>

&gt; certain rare circumstances when a resource cleanup is done. You should<br>

&gt; see a line in your logs like &quot;Triggering a refresh after ... deleted ...<br>

&gt; from the LRM&quot;. That might give some idea of why.<br>

&gt;<br>

&gt;&gt; Also. does avg. CPU load impact resource monitoring ?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Regards,<br>

&gt;&gt; Abhay<br>

&gt; Well, it could cause the monitor to take so long that it times out. The<br>

&gt; only direct effect of load on pacemaker is that the cluster might lower<br>

&gt; the number of agent actions that it can execute simultaneously.<br>

&gt;<br>

&gt;<br>

&gt;&gt; On Thu, 31 Aug 2017 at 20:11 Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt;         On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:<br>

&gt;&gt;         &gt; Hi,<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; I have a 2 node HA cluster configured on CentOS 7 with pcs<br>

&gt;&gt;         command.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; Below are the properties of the cluster :<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; # pcs property<br>

&gt;&gt;         &gt; Cluster Properties:<br>

&gt;&gt;         &gt;  cluster-infrastructure: corosync<br>

&gt;&gt;         &gt;  cluster-name: SVSDEHA<br>

&gt;&gt;         &gt;  cluster-recheck-interval: 2s<br>

&gt;&gt;         &gt;  dc-deadtime: 5<br>

&gt;&gt;         &gt;  dc-version: 1.1.15-11.el7_3.5-e174ec8<br>

&gt;&gt;         &gt;  have-watchdog: false<br>

&gt;&gt;         &gt;  last-lrm-refresh: 1504090367<br>

&gt;&gt;         &gt;  no-quorum-policy: ignore<br>

&gt;&gt;         &gt;  start-failure-is-fatal: false<br>

&gt;&gt;         &gt;  stonith-enabled: false<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; PFA the cib.<br>

&gt;&gt;         &gt; Also attached is the corosync.log around the time the below<br>

&gt;&gt;         issue<br>

&gt;&gt;         &gt; happened.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; After around 10 hrs and multiple failures, pacemaker stops<br>

&gt;&gt;         monitoring<br>

&gt;&gt;         &gt; resource on one of the nodes in the cluster.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; So even though the resource on other node fails, it is never<br>

&gt;&gt;         migrated<br>

&gt;&gt;         &gt; to the node on which the resource is not monitored.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; Wanted to know what could have triggered this and how to<br>

&gt;&gt;         avoid getting<br>

&gt;&gt;         &gt; into such scenarios.<br>

&gt;&gt;         &gt; I am going through the logs and couldn&#39;t find why this<br>

&gt;&gt;         happened.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; After this log the monitoring stopped.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; Aug 29 11:01:44 [16500] <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.sandvine.com</a><br>

&gt;&gt;         &gt; crmd:     info: process_lrm_event:   Result of monitor<br>

&gt;&gt;         operation for<br>

&gt;&gt;         &gt; SVSDEHA on <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.sandvine.com</a>: 0 (ok) |<br>

&gt;&gt;         call=538<br>

&gt;&gt;         &gt; key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013<br>

&gt;&gt;<br>

&gt;&gt;         Are you sure the monitor stopped? Pacemaker only logs<br>

&gt;&gt;         recurring monitors<br>

&gt;&gt;         when the status changes. Any successful monitors after this<br>

&gt;&gt;         wouldn&#39;t be<br>

&gt;&gt;         logged.<br>

&gt;&gt;<br>

&gt;&gt;         &gt; Below log says the resource is leaving the cluster.<br>

&gt;&gt;         &gt; Aug 29 11:01:44 [16499] <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.sandvine.com</a><br>

&gt;&gt;         &gt; pengine:     info: LogActions:  Leave   SVSDEHA:0<br>

&gt;&gt;          (Slave<br>

&gt;&gt;         &gt; <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.sandvine.com</a>)<br>

&gt;&gt;<br>

&gt;&gt;         This means that the cluster will leave the resource where it<br>

&gt;&gt;         is (i.e. it<br>

&gt;&gt;         doesn&#39;t need a start, stop, move, demote, promote, etc.).<br>

&gt;&gt;<br>

&gt;&gt;         &gt; Let me know if anything more is needed.<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; Regards,<br>

&gt;&gt;         &gt; Abhay<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt;<br>

&gt;&gt;         &gt; PS:&#39;pcs resource cleanup&#39; brought the cluster back into good<br>

&gt;&gt;         state.<br>

&gt;&gt;<br>

&gt;&gt;         There are a lot of resource action failures, so I&#39;m not sure<br>

&gt;&gt;         where the<br>

&gt;&gt;         issue is, but I&#39;m guessing it has to do with<br>

&gt;&gt;         migration-threshold=1 --<br>

&gt;&gt;         once a resource has failed once on a node, it won&#39;t be allowed<br>

&gt;&gt;         back on<br>

&gt;&gt;         that node until the failure is cleaned up. Of course you also<br>

&gt;&gt;         have<br>

&gt;&gt;         failure-timeout=1s, which should clean it up immediately, so<br>

&gt;&gt;         I&#39;m not<br>

&gt;&gt;         sure.<br>

&gt;&gt;<br>

&gt;&gt;         My gut feeling is that you&#39;re trying to do too many things at<br>

&gt;&gt;         once. I&#39;d<br>

&gt;&gt;         start over from scratch and proceed more slowly: first, set<br>

&gt;&gt;         &quot;two_node:<br>

&gt;&gt;         1&quot; in corosync.conf and let no-quorum-policy default in<br>

&gt;&gt;         pacemaker; then,<br>

&gt;&gt;         get stonith configured, tested, and enabled; then, test your<br>

&gt;&gt;         resource<br>

&gt;&gt;         agent manually on the command line to make sure it conforms to<br>

&gt;&gt;         the<br>

&gt;&gt;         expected return values<br>

&gt;&gt;         ( <a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf</a> ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.<br>

&gt;&gt;         --<br>

&gt;&gt;         Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;         _______________________________________________<br>

&gt;&gt;         Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

&gt;&gt;         <a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/mailman/listinfo/users</a><br>

&gt;&gt;<br>

&gt;&gt;         Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

&gt;&gt;         Getting started:<br>

&gt;&gt;         <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt;&gt;         Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

<br>

<br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div>