<div dir="ltr">Hi Ken,<div><br></div><div>Please see inline comments of last mail</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 8, 2015 at 8:25 PM, Pritam Kharat <span dir="ltr">&lt;<a href="mailto:pritam.kharat@oneconvergence.com" target="_blank">pritam.kharat@oneconvergence.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Ken,<div><br></div><div>Thanks for reply.</div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Thu, Oct 8, 2015 at 8:13 PM, Ken Gaillot <span dir="ltr">&lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On 10/02/2015 01:47 PM, Pritam Kharat wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; I have set up a ACTIVE/PASSIVE HA<br>
&gt;<br>
</span>&gt; *Issue 1) *<br>
&gt;<br>
&gt; *corosync.conf*  file is<br>
<div><div>&gt;<br>
&gt; # Please read the openais.conf.5 manual page<br>
&gt;<br>
&gt; totem {<br>
&gt;<br>
&gt;         version: 2<br>
&gt;<br>
&gt;         # How long before declaring a token lost (ms)<br>
&gt;         token: 10000<br>
&gt;<br>
&gt;         # How many token retransmits before forming a new configuration<br>
&gt;         token_retransmits_before_loss_const: 20<br>
&gt;<br>
&gt;         # How long to wait for join messages in the membership protocol (ms)<br>
&gt;         join: 10000<br>
&gt;<br>
&gt;         # How long to wait for consensus to be achieved before starting a<br>
&gt; new round of membership configuration (ms)<br>
&gt;         consensus: 12000<br>
&gt;<br>
&gt;         # Turn off the virtual synchrony filter<br>
&gt;         vsftype: none<br>
&gt;<br>
&gt;         # Number of messages that may be sent by one processor on receipt<br>
&gt; of the token<br>
&gt;         max_messages: 20<br>
&gt;<br>
&gt;         # Limit generated nodeids to 31-bits (positive signed integers)<br>
&gt;         clear_node_high_bit: yes<br>
&gt;<br>
&gt;         # Disable encryption<br>
&gt;         secauth: off<br>
&gt;<br>
&gt;         # How many threads to use for encryption/decryption<br>
&gt;         threads: 0<br>
&gt;<br>
&gt;         # Optionally assign a fixed node id (integer)<br>
&gt;         # nodeid: 1234<br>
&gt;<br>
&gt;         # This specifies the mode of redundant ring, which may be none,<br>
&gt; active, or passive.<br>
&gt;         rrp_mode: none<br>
&gt;         interface {<br>
&gt;                 # The following values need to be set based on your<br>
&gt; environment<br>
&gt;                 ringnumber: 0<br>
&gt;                 bindnetaddr: 192.168.101.0<br>
&gt; mcastport: 5405<br>
&gt;         }<br>
&gt;<br>
&gt;         transport: udpu<br>
&gt; }<br>
&gt;<br>
&gt; amf {<br>
&gt;         mode: disabled<br>
&gt; }<br>
&gt;<br>
&gt; quorum {<br>
&gt;         # Quorum for the Pacemaker Cluster Resource Manager<br>
&gt;         provider: corosync_votequorum<br>
&gt;         expected_votes: 1<br>
<br>
</div></div>If you&#39;re using a recent version of corosync, use &quot;two_node: 1&quot; instead<br>
of &quot;expected_votes: 1&quot;, and get rid of &quot;no-quorum-policy: ignore&quot; in the<br>
pacemaker cluster options.<br>
<div><div><br></div></div></blockquote></div></div><div>   -&gt; We are using corosync version 2.3.3. Do we above mentioned change for this version ?</div><div><div class="h5"><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>
&gt; }<br>
&gt;<br>
&gt;<br>
&gt; nodelist {<br>
&gt;<br>
&gt;         node {<br>
&gt;                 ring0_addr: 192.168.101.73<br>
&gt;         }<br>
&gt;<br>
&gt;         node {<br>
&gt;                 ring0_addr: 192.168.101.74<br>
&gt;         }<br>
&gt; }<br>
&gt;<br>
&gt; aisexec {<br>
&gt;         user:   root<br>
&gt;         group:  root<br>
&gt; }<br>
&gt;<br>
&gt;<br>
&gt; logging {<br>
&gt;         fileline: off<br>
&gt;         to_stderr: yes<br>
&gt;         to_logfile: yes<br>
&gt;         to_syslog: yes<br>
&gt;         syslog_facility: daemon<br>
&gt;         logfile: /var/log/corosync/corosync.log<br>
&gt;         debug: off<br>
&gt;         timestamp: on<br>
&gt;         logger_subsys {<br>
&gt;                 subsys: AMF<br>
&gt;                 debug: off<br>
&gt;                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6<br>
&gt;         }<br>
&gt; }<br>
&gt;<br>
&gt; And I have added 5 resources - 1 is VIP and 4 are upstart jobs<br>
&gt; Node names are configured as -&gt; sc-node-1(ACTIVE) and sc-node-2(PASSIVE)<br>
&gt; Resources are running on ACTIVE node<br>
&gt;<br>
&gt; Default cluster properties -<br>
&gt;<br>
&gt;       &lt;cluster_property_set id=&quot;cib-bootstrap-options&quot;&gt;<br>
&gt;         &lt;nvpair id=&quot;cib-bootstrap-options-dc-version&quot; name=&quot;dc-version&quot;<br>
&gt; value=&quot;1.1.10-42f2063&quot;/&gt;<br>
&gt;         &lt;nvpair id=&quot;cib-bootstrap-options-cluster-infrastructure&quot;<br>
&gt; name=&quot;cluster-infrastructure&quot; value=&quot;corosync&quot;/&gt;<br>
&gt;         &lt;nvpair name=&quot;no-quorum-policy&quot; value=&quot;ignore&quot;<br>
&gt; id=&quot;cib-bootstrap-options-no-quorum-policy&quot;/&gt;<br>
&gt;         &lt;nvpair name=&quot;stonith-enabled&quot; value=&quot;false&quot;<br>
&gt; id=&quot;cib-bootstrap-options-stonith-enabled&quot;/&gt;<br>
&gt;         &lt;nvpair name=&quot;cluster-recheck-interval&quot; value=&quot;3min&quot;<br>
&gt; id=&quot;cib-bootstrap-options-cluster-recheck-interval&quot;/&gt;<br>
&gt;         &lt;nvpair name=&quot;default-action-timeout&quot; value=&quot;120s&quot;<br>
&gt; id=&quot;cib-bootstrap-options-default-action-timeout&quot;/&gt;<br>
&gt;       &lt;/cluster_property_set&gt;<br>
&gt;<br>
&gt;<br>
&gt; But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from<br>
&gt; STANDBY to ACTIVE,<br>
&gt; both nodes become OFFLINE and Current DC becomes None, I have disabled the<br>
&gt; stonith property and even quorum is ignored<br>
<br>
</div></div>Disabling stonith isn&#39;t helping you. The cluster needs stonith to<br>
recover from difficult situations, so it&#39;s easier to get into weird<br>
states like this without it.<br>
<span><br>
&gt; root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status<br>
&gt; Last updated: Sat Oct  3 00:01:40 2015<br>
&gt; Last change: Fri Oct  2 23:38:28 2015 via crm_resource on sc-node-1<br>
&gt; Stack: corosync<br>
&gt; Current DC: NONE<br>
&gt; 2 Nodes configured<br>
&gt; 5 Resources configured<br>
&gt;<br>
&gt; OFFLINE: [ sc-node-1 sc-node-2 ]<br>
&gt;<br>
&gt; What is going wrong here ? What is the reason for node Current DC becoming<br>
&gt; None suddenly ? Is corosync.conf okay ? Are default cluster properties fine<br>
&gt; ? Help will be appreciated.<br>
<br>
</span>I&#39;d recommend seeing how the problem behaves with stonith enabled, but<br>
in any case you&#39;ll need to dive into the logs to figure what starts the<br>
chain of events.<br>
<br></blockquote><div><br></div></div></div><div>   -&gt; We are seeing this issue when we try rebooting the vms</div><div><div class="h5"><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
&gt;<br>
&gt; *Issue 2)*<br>
<span>&gt; Command used to add upstart job is<br>
&gt;<br>
&gt; crm configure primitive service upstart:service meta allow-migrate=true<br>
&gt; migration-threshold=5 failure-timeout=30s op monitor interval=15s<br>
&gt;  timeout=60s<br>
&gt;<br>
&gt; But still sometimes I see fail count going to INFINITY. Why ? How can we<br>
&gt; avoid it ? Resource should have migrated as soon as it reaches migration<br>
&gt; threshold.<br>
&gt;<br>
&gt; * Node sc-node-2:<br>
&gt;    service: migration-threshold=5 fail-count=1000000 last-failure=&#39;Fri Oct<br>
&gt;  2 23:38:53 2015&#39;<br>
&gt;    service1: migration-threshold=5 fail-count=1000000 last-failure=&#39;Fri Oct<br>
&gt;  2 23:38:53 2015&#39;<br>
&gt;<br>
&gt; Failed actions:<br>
&gt;     service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,<br>
&gt; last-rc-change=Fri Oct  2 23:38:53 2015<br>
&gt; , queued=0ms, exec=0ms<br>
&gt; ): unknown error<br>
&gt;     service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out,<br>
&gt; last-rc-change=Fri Oct  2 23:38:53 2015<br>
&gt; , queued=0ms, exec=0ms<br>
<br>
</span>migration-threshold is used for monitor failures, not (by default) start<br>
or stop failures.<br>
<br>
This is a start failure, which (by default) makes the fail-count go to<br>
infinity. The rationale is that a monitor failure indicates some sort of<br>
temporary error, but failing to start could well mean that something is<br>
wrong with the installation or configuration.<br>
<br>
You can tell the cluster to apply migration-threshold to start failures<br>
too, by setting the start-failure-is-fatal=false cluster option.<br>
<br>
<br>
_______________________________________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div></div></div><br><br clear="all"><span class=""><div><br></div>-- <br><div>Thanks and Regards,<br>Pritam Kharat.<br></div>
</span></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Thanks and Regards,<br>Pritam Kharat.<br></div>
</div>