<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="color:rgb(33,33,33)">Are you sure the monitor stopped? Pacemaker only logs recurring monitors<br></span><span style="color:rgb(33,33,33)">when the status changes. Any successful monitors after this wouldn&#39;t be<br></span><span style="color:rgb(33,33,33)">logged.</span>  </blockquote><div> </div><div>Yes. Since there  were no logs which said &quot;RecurringOp:  Start recurring monitor&quot; on the node after it had failed.</div><div>Also there were no logs for any actions pertaining to </div><div>The problem was that even though the one node was failing, the resources were never moved to the other node(the node on which I suspect monitoring had stopped).</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:12.8px">There are a lot of resource action failures, so I&#39;m not sure where the<br></span><span style="font-size:12.8px">issue is, but I&#39;m guessing it has to do with migration-threshold=1 --<br></span><span style="font-size:12.8px">once a resource has failed once on a node, it won&#39;t be allowed back on<br></span><span style="font-size:12.8px">that node until the failure is cleaned up. Of course you also have<br></span><span style="font-size:12.8px">failure-timeout=1s, which should clean it up immediately, so I&#39;m not<br></span><span style="font-size:12.8px">sure.</span></blockquote><div><br></div><div>migration-threshold=1</div><div><span style="font-size:12.8px">failure-timeout=1s</span><br></div><div><span style="font-size:12.8px">cluster-recheck-interval=2s</span></div><div><span style="font-size:12.8px"><br></span></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-size:12.8px">first, set &quot;two_node:<br></span><span style="font-size:12.8px">1&quot; in corosync.conf and let no-quorum-policy default in pacemaker</span></blockquote><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">This is already configured.</span></div><div><div><span style="font-size:12.8px"><font face="monospace, monospace"># cat /etc/corosync/corosync.conf</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">totem {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    version: 2</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    secauth: off</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    cluster_name: SVSDEHA</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    transport: udpu</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    token: 5000</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">}</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace"><br></font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">nodelist {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    node {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">        ring0_addr: 2.0.0.10</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">        nodeid: 1</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    }</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace"><br></font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    node {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">        ring0_addr: 2.0.0.11</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">        nodeid: 2</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    }</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">}</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace"><br></font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">quorum {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    provider: corosync_votequorum</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    <b>two_node: 1</b></font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">}</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace"><br></font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">logging {</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    to_logfile: yes</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    logfile: /var/log/cluster/corosync.log</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">    to_syslog: yes</font></span></div><div><span style="font-size:12.8px"><font face="monospace, monospace">}</font></span></div><div style="font-size:12.8px"><br></div></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span style="font-size:12.8px">let no-quorum-policy default in pacemaker; then,<br></span><span style="font-size:12.8px">get stonith configured, tested, and enabled</span></blockquote><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">By not configuring </span><span style="font-size:12.8px">no-quorum-policy, would it ignore quorum for a 2 node cluster? </span></div><div><span style="font-size:12.8px">For my use case I don&#39;t need stonith enabled. My intention is to have a highly available system all the time.</span></div><div>I will test my RA again as suggested with <span style="font-size:12.8px">no-quorum-policy=default.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">One more doubt. </span></div><div><span style="font-size:12.8px">Why do we see this is &#39;pcs property&#39; ?</span></div><div><font face="monospace, monospace">last-lrm-refresh: 1504090367</font><span style="font-size:12.8px"><br></span></div><div><font face="monospace, monospace"><br></font></div><div>Never seen this on a healthy cluster.</div><div>>From RHEL documentation: </div><div><table class="gmail-lt-4-cols gmail-gt-14-rows" summary="Cluster Properties"><tbody><tr><td><code class="gmail-literal">last-lrm-refresh</code></td><td> </td><td><div class="gmail-para"><a class="gmail-indexterm" id="gmail-idm83017008"></a><a class="gmail-indexterm" id="gmail-idm83016208"></a><a class="gmail-indexterm" id="gmail-idm83015728"></a><a class="gmail-indexterm" id="gmail-idm83014608"></a><a class="gmail-indexterm" id="gmail-idm83013808"></a> Last refresh of the Local Resource Manager, given in units of seconds since epoca. Used for diagnostic purposes; not user-configurable. </div></td></tr></tbody></table></div><div><br></div><div>Doesn&#39;t explain much.</div><div><br></div><div>Also. does avg. CPU load impact resource monitoring ? </div><div><br></div><div><font face="arial, helvetica, sans-serif">Regards,</font></div><div><font face="arial, helvetica, sans-serif">Abhay</font></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div class="gmail_quote"><div dir="ltr">On Thu, 31 Aug 2017 at 20:11 Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:<br>

&gt; Hi,<br>

&gt;<br>

&gt;<br>

&gt; I have a 2 node HA cluster configured on CentOS 7 with pcs command.<br>

&gt;<br>

&gt;<br>

&gt; Below are the properties of the cluster :<br>

&gt;<br>

&gt;<br>

&gt; # pcs property<br>

&gt; Cluster Properties:<br>

&gt;  cluster-infrastructure: corosync<br>

&gt;  cluster-name: SVSDEHA<br>

&gt;  cluster-recheck-interval: 2s<br>

&gt;  dc-deadtime: 5<br>

&gt;  dc-version: 1.1.15-11.el7_3.5-e174ec8<br>

&gt;  have-watchdog: false<br>

&gt;  last-lrm-refresh: 1504090367<br>

&gt;  no-quorum-policy: ignore<br>

&gt;  start-failure-is-fatal: false<br>

&gt;  stonith-enabled: false<br>

&gt;<br>

&gt;<br>

&gt; PFA the cib.<br>

&gt; Also attached is the corosync.log around the time the below issue<br>

&gt; happened.<br>

&gt;<br>

&gt;<br>

&gt; After around 10 hrs and multiple failures, pacemaker stops monitoring<br>

&gt; resource on one of the nodes in the cluster.<br>

&gt;<br>

&gt;<br>

&gt; So even though the resource on other node fails, it is never migrated<br>

&gt; to the node on which the resource is not monitored.<br>

&gt;<br>

&gt;<br>

&gt; Wanted to know what could have triggered this and how to avoid getting<br>

&gt; into such scenarios.<br>

&gt; I am going through the logs and couldn&#39;t find why this happened.<br>

&gt;<br>

&gt;<br>

&gt; After this log the monitoring stopped.<br>

&gt;<br>

&gt; Aug 29 11:01:44 [16500] <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.<wbr>sandvine.com</a><br>

&gt; crmd:     info: process_lrm_event:   Result of monitor operation for<br>

&gt; SVSDEHA on <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.<wbr>sandvine.com</a>: 0 (ok) | call=538<br>

&gt; key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013<br>

<br>

Are you sure the monitor stopped? Pacemaker only logs recurring monitors<br>

when the status changes. Any successful monitors after this wouldn&#39;t be<br>

logged.<br>

<br>

&gt; Below log says the resource is leaving the cluster.<br>

&gt; Aug 29 11:01:44 [16499] <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.<wbr>sandvine.com</a><br>

&gt; pengine:     info: LogActions:  Leave   SVSDEHA:0       (Slave<br>

&gt; <a href="http://TPC-D12-10-002.phaedrus.sandvine.com" rel="noreferrer" target="_blank">TPC-D12-10-002.phaedrus.<wbr>sandvine.com</a>)<br>

<br>

This means that the cluster will leave the resource where it is (i.e. it<br>

doesn&#39;t need a start, stop, move, demote, promote, etc.).<br>

<br>

&gt; Let me know if anything more is needed.<br>

&gt;<br>

&gt;<br>

&gt; Regards,<br>

&gt; Abhay<br>

&gt;<br>

&gt;<br>

&gt; PS:&#39;pcs resource cleanup&#39; brought the cluster back into good state.<br>

<br>

There are a lot of resource action failures, so I&#39;m not sure where the<br>

issue is, but I&#39;m guessing it has to do with migration-threshold=1 --<br>

once a resource has failed once on a node, it won&#39;t be allowed back on<br>

that node until the failure is cleaned up. Of course you also have<br>

failure-timeout=1s, which should clean it up immediately, so I&#39;m not<br>

sure.<br>

<br>

My gut feeling is that you&#39;re trying to do too many things at once. I&#39;d<br>

start over from scratch and proceed more slowly: first, set &quot;two_node:<br>

1&quot; in corosync.conf and let no-quorum-policy default in pacemaker; then,<br>

get stonith configured, tested, and enabled; then, test your resource<br>

agent manually on the command line to make sure it conforms to the<br>

expected return values<br>

( <a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-<wbr>US/Pacemaker/1.1-pcs/html-<wbr>single/Pacemaker_Explained/<wbr>index.html#ap-ocf</a> ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.<br>

--<br>

Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;<br>

<br>

<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div></div>