<div dir="ltr">I checked the node_state of the node that is killed and brought back (test3). in_ccm == true and crmd == online <span style="font-size:12.8px">for a second or two between &quot;pcs cluster start test3&quot; &quot;monitor&quot;:</span><div><span style="font-size:12.8px"><br></span></div><div><div><span style="font-size:12.8px">    &lt;node_state id=&quot;3&quot; uname=&quot;test3&quot; in_ccm=&quot;true&quot; crmd=&quot;online&quot; crm-debug-origin=&quot;peer_update_<wbr>callback&quot; join=&quot;member&quot; expected=&quot;member&quot;&gt;</span></div><div style="font-size:12.8px"><br></div></div><div><span style="font-size:12.8px"><br></span><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin <span dir="ltr">&lt;<a href="mailto:ludovicvp@gmail.com" target="_blank">ludovicvp@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Yes I haven&#39;t been using the &quot;nodes&quot; element in the XML, only the &quot;resources&quot; element. I couldn&#39;t find &quot;<span style="font-size:12.8px">node_state&quot; elements or attributes in the XML, so after some searching I found that it is in the CIB that can be gotten with &quot;pcs cluster cib foo.xml&quot;. I will start exploring this as an alternative to  crm_mon/&quot;pcs status&quot;.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">However I still find what happens to be confusing, so below I try</span> to better explain what I see:</div><div><br></div><div><br></div><div>Before &quot;pcs cluster start test3&quot; at 10:45:36.362 (test3 has been HW shutdown a minute ago):</div><div><br></div><div>crm_mon -1:</div><div><br></div><div>    Stack: corosync</div><div>    Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>    Last updated: Fri May 12 10:45:36 2017          Last change: Fri May 12 09:18:13 2017 by root via crm_attribute on test1</div><div><br></div><div>    3 nodes and 4 resources configured</div><div><br></div><div>    Online: [ test1 test2 ]</div><div>    OFFLINE: [ test3 ]</div><div><br></div><div>    Active resources:</div><div><br></div><div>     Master/Slave Set: pgsql-ha [pgsqld]</div><div>         Masters: [ test1 ]</div><div>         Slaves: [ test2 ]</div><div>     pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1</div><div><br></div><div>     </div><div>crm_mon -X:</div><span class="m_-1029886906927212801gmail-"><div><br></div><div>    &lt;resources&gt;</div><div>    &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;</div></span><span class="m_-1029886906927212801gmail-"><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; f</div><div>    ailed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; fa</div><div>    iled=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div></span><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Stopped&quot; active=&quot;false&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;</div><div>    failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;0&quot; /&gt;</div><div>    &lt;/clone&gt;</div><div>    &lt;resource id=&quot;pgsql-master-ip&quot; resource_agent=&quot;ocf::heartbeat<wbr>:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed</div><span class="m_-1029886906927212801gmail-"><div>    =&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>        &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>    &lt;/resource&gt;</div></span><div>    &lt;/resources&gt;</div><div><br></div><div><br></div><div><br></div><div>At 10:45:39.440, after &quot;pcs cluster start test3&quot;, before first &quot;monitor&quot; on test3 (this is where I can&#39;t seem to know that resources on test3 are down):</div><div><br></div><div>crm_mon -1:</div><div><br></div><div>    Stack: corosync</div><div>    Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>    Last updated: Fri May 12 10:45:39 2017          Last change: Fri May 12 10:45:39 2017 by root via crm_attribute on test1</div><div><br></div><div>    3 nodes and 4 resources configured</div><div><br></div><div>    Online: [ test1 test2 test3 ]</div><div><br></div><div>    Active resources:</div><div><br></div><div>     Master/Slave Set: pgsql-ha [pgsqld]</div><div>         Masters: [ test1 ]</div><div>         Slaves: [ test2 test3 ]</div><div>     pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1</div><div><br></div><div><br></div><div>crm_mon -X:</div><span class="m_-1029886906927212801gmail-"><div><br></div><div>    &lt;resources&gt;</div><div>    &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;</div></span><span class="m_-1029886906927212801gmail-"><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test3&quot; id=&quot;3&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div></span><div>    &lt;/clone&gt;</div><div>    &lt;resource id=&quot;pgsql-master-ip&quot; resource_agent=&quot;ocf::heartbeat<wbr>:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><span class="m_-1029886906927212801gmail-"><div>        &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>    &lt;/resource&gt;</div></span><div>    &lt;/resources&gt;</div><div><br></div><div><br></div><div>    </div><div>At 10:45:41.606, after first &quot;monitor&quot; on test3 (I can now tell the resources on test3 are not ready):</div><div><br></div><div>crm_mon -1:</div><div><br></div><div>    Stack: corosync</div><div>    Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>    Last updated: Fri May 12 10:45:41 2017          Last change: Fri May 12 10:45:39 2017 by root via crm_attribute on test1</div><div><br></div><div>    3 nodes and 4 resources configured</div><div><br></div><div>    Online: [ test1 test2 test3 ]</div><div><br></div><div>    Active resources:</div><div><br></div><div>     Master/Slave Set: pgsql-ha [pgsqld]</div><div>         Masters: [ test1 ]</div><div>         Slaves: [ test2 ]</div><div>     pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started test1</div><div><br></div><div><br></div><div>crm_mon -X:</div><span class="m_-1029886906927212801gmail-"><div><br></div><div>    &lt;resources&gt;</div><div>    &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;</div></span><span class="m_-1029886906927212801gmail-"><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><div>            &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;</div><div>        &lt;/resource&gt;</div></span><div>        &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot; role=&quot;Stopped&quot; active=&quot;false&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;0&quot; /&gt;</div><div>    &lt;/clone&gt;</div><div>    &lt;resource id=&quot;pgsql-master-ip&quot; resource_agent=&quot;ocf::heartbeat<wbr>:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;</div><span class="m_-1029886906927212801gmail-"><div>        &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;</div><div>    &lt;/resource&gt;</div></span><div>    &lt;/resources&gt;</div></div><div class="gmail_extra"><div><div class="m_-1029886906927212801gmail-h5"><br><div class="gmail_quote">On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <span dir="ltr">&lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_-1029886906927212801gmail-m_1902586051693798442HOEnZb"><div class="m_-1029886906927212801gmail-m_1902586051693798442h5">On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:<br>
&gt; Hi<br>
&gt; I translated the a Postgresql multi state RA<br>
&gt; (<a href="https://github.com/dalibo/PAF" rel="noreferrer" target="_blank">https://github.com/dalibo/PAF</a><wbr>) in Python<br>
&gt; (<a href="https://github.com/ulodciv/deploy_cluster" rel="noreferrer" target="_blank">https://github.com/ulodciv/de<wbr>ploy_cluster</a>), and I have been editing it<br>
&gt; heavily.<br>
&gt;<br>
&gt; In parallel I am writing unit tests and functional tests.<br>
&gt;<br>
&gt; I am having an issue with a functional test that abruptly powers off a<br>
&gt; slave named says &quot;host3&quot; (hot standby PG instance). Later on I start the<br>
&gt; slave back. Once it is started, I run &quot;pcs cluster start host3&quot;. And<br>
&gt; this is where I start having a problem.<br>
&gt;<br>
&gt; I check every second the output of &quot;pcs status xml&quot; until host3 is said<br>
&gt; to be ready as a slave again. In the following I assume that test3 is<br>
&gt; ready as a slave:<br>
&gt;<br>
&gt;     &lt;nodes&gt;<br>
&gt;         &lt;node name=&quot;test1&quot; id=&quot;1&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;false&quot;<br>
&gt; resources_running=&quot;2&quot; type=&quot;member&quot; /&gt;<br>
&gt;         &lt;node name=&quot;test2&quot; id=&quot;2&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;true&quot;<br>
&gt; resources_running=&quot;1&quot; type=&quot;member&quot; /&gt;<br>
&gt;         &lt;node name=&quot;test3&quot; id=&quot;3&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;false&quot;<br>
&gt; resources_running=&quot;1&quot; type=&quot;member&quot; /&gt;<br>
&gt;     &lt;/nodes&gt;<br>
<br>
</div></div>The &lt;nodes&gt; section says nothing about the current state of the nodes.<br>
Look at the &lt;node_state&gt; entries for that. in_ccm means the cluster<br>
stack level, and crmd means the pacemaker level -- both need to be up.<br>
<span><br>
&gt;     &lt;resources&gt;<br>
&gt;         &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot;<br>
&gt; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test3&quot; id=&quot;3&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt; role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;         &lt;/clone&gt;<br>
&gt; By ready to go I mean that upon running &quot;pcs cluster start test3&quot;, the<br>
&gt; following occurs before test3 appears ready in the XML:<br>
&gt;<br>
&gt; pcs cluster start test3<br>
</span>&gt; monitor-&gt; RA returns unknown error (1)<br>
<span>&gt; notify/pre-stop    -&gt; RA returns ok (0)<br>
&gt; stop   -&gt; RA returns ok (0)<br>
</span>&gt; start-&gt; RA returns ok (0)<br>
<span>&gt;<br>
&gt; The problem I have is that between &quot;pcs cluster start test3&quot; and<br>
&gt; &quot;monitor&quot;, it seems that the XML returned by &quot;pcs status xml&quot; says test3<br>
&gt; is ready (the XML extract above is what I get at that moment). Once<br>
&gt; &quot;monitor&quot; occurs, the returned XML shows test3 to be offline, and not<br>
&gt; until the start is finished do I once again have test3 shown as ready.<br>
&gt;<br>
&gt; I am getting anything wrong? Is there a simpler or better way to check<br>
&gt; if test3 is fully functional again, ie OCF start was successful?<br>
&gt;<br>
&gt; Thanks<br>
&gt;<br>
&gt; Ludovic<br>
<br>
</span>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/m<wbr>ailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="m_-1029886906927212801gmail-HOEnZb"><font color="#888888">-- <br><div class="m_-1029886906927212801gmail-m_1902586051693798442gmail_signature">Ludovic Vaugeois-Pepin<br></div>
</font></span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-1029886906927212801gmail_signature">Ludovic Vaugeois-Pepin<br></div>
</div></div></div>