<div dir="ltr">I will look into adding alerts, thanks for the info. <div><br></div><div>For now I introduced a 5 seconds sleep after &quot;pcs cluster start ...&quot;. It seems enough for <span style="font-size:12.8px">monitor to be run.</span></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot <span dir="ltr">&lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Another possibility you might want to look into is alerts. Pacemaker can<br>
call a script of your choosing whenever a resource is started or<br>
stopped. See:<br>
<br>
<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296" rel="noreferrer" target="_blank">http://clusterlabs.org/doc/en-<wbr>US/Pacemaker/1.1-pcs/html-sing<wbr>le/Pacemaker_Explained/index.<wbr>html#idm139683940283296</a><br>
<br>
for the concepts, and the pcs man page for the &quot;pcs alert&quot; interface.<br>
<span><br>
On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:<br>
&gt; I checked the node_state of the node that is killed and brought back<br>
&gt; (test3). in_ccm == true and crmd == online for a second or two between<br>
&gt; &quot;pcs cluster start test3&quot; &quot;monitor&quot;:<br>
&gt;<br>
&gt;     &lt;node_state id=&quot;3&quot; uname=&quot;test3&quot; in_ccm=&quot;true&quot; crmd=&quot;online&quot;<br>
&gt; crm-debug-origin=&quot;peer_update_<wbr>callback&quot; join=&quot;member&quot; expected=&quot;member&quot;&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin<br>
</span><div><div class="m_-5704398829504974448h5">&gt; &lt;<a href="mailto:ludovicvp@gmail.com" target="_blank">ludovicvp@gmail.com</a> &lt;mailto:<a href="mailto:ludovicvp@gmail.com" target="_blank">ludovicvp@gmail.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;     Yes I haven&#39;t been using the &quot;nodes&quot; element in the XML, only the<br>
&gt;     &quot;resources&quot; element. I couldn&#39;t find &quot;node_state&quot; elements or<br>
&gt;     attributes in the XML, so after some searching I found that it is in<br>
&gt;     the CIB that can be gotten with &quot;pcs cluster cib foo.xml&quot;. I will<br>
&gt;     start exploring this as an alternative to  crm_mon/&quot;pcs status&quot;.<br>
&gt;<br>
&gt;<br>
&gt;     However I still find what happens to be confusing, so below I try to<br>
&gt;     better explain what I see:<br>
&gt;<br>
&gt;<br>
&gt;     Before &quot;pcs cluster start test3&quot; at 10:45:36.362 (test3 has been HW<br>
&gt;     shutdown a minute ago):<br>
&gt;<br>
&gt;     crm_mon -1:<br>
&gt;<br>
&gt;         Stack: corosync<br>
&gt;         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
&gt;     partition with quorum<br>
&gt;         Last updated: Fri May 12 10:45:36 2017          Last change: Fri<br>
&gt;     May 12 09:18:13 2017 by root via crm_attribute on test1<br>
&gt;<br>
&gt;         3 nodes and 4 resources configured<br>
&gt;<br>
&gt;         Online: [ test1 test2 ]<br>
&gt;         OFFLINE: [ test3 ]<br>
&gt;<br>
&gt;         Active resources:<br>
&gt;<br>
&gt;          Master/Slave Set: pgsql-ha [pgsqld]<br>
&gt;              Masters: [ test1 ]<br>
&gt;              Slaves: [ test2 ]<br>
&gt;          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started<br>
&gt;     test1<br>
&gt;<br>
&gt;<br>
&gt;     crm_mon -X:<br>
&gt;<br>
&gt;         &lt;resources&gt;<br>
&gt;         &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot;<br>
&gt;     managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; f<br>
&gt;         ailed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot; fa<br>
&gt;         iled=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Stopped&quot; active=&quot;false&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;         failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;0&quot; /&gt;<br>
&gt;         &lt;/clone&gt;<br>
&gt;         &lt;resource id=&quot;pgsql-master-ip&quot;<br>
&gt;     resource_agent=&quot;ocf::heartbea<wbr>t:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot;<br>
&gt;     orphaned=&quot;false&quot; managed<br>
&gt;         =&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot;<br>
&gt;     nodes_running_on=&quot;1&quot; &gt;<br>
&gt;             &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &lt;/resource&gt;<br>
&gt;         &lt;/resources&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;     At 10:45:39.440, after &quot;pcs cluster start test3&quot;, before first<br>
&gt;     &quot;monitor&quot; on test3 (this is where I can&#39;t seem to know that<br>
&gt;     resources on test3 are down):<br>
&gt;<br>
&gt;     crm_mon -1:<br>
&gt;<br>
&gt;         Stack: corosync<br>
&gt;         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
&gt;     partition with quorum<br>
&gt;         Last updated: Fri May 12 10:45:39 2017          Last change: Fri<br>
&gt;     May 12 10:45:39 2017 by root via crm_attribute on test1<br>
&gt;<br>
&gt;         3 nodes and 4 resources configured<br>
&gt;<br>
&gt;         Online: [ test1 test2 test3 ]<br>
&gt;<br>
&gt;         Active resources:<br>
&gt;<br>
&gt;          Master/Slave Set: pgsql-ha [pgsqld]<br>
&gt;              Masters: [ test1 ]<br>
&gt;              Slaves: [ test2 test3 ]<br>
&gt;          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started<br>
&gt;     test1<br>
&gt;<br>
&gt;<br>
&gt;     crm_mon -X:<br>
&gt;<br>
&gt;         &lt;resources&gt;<br>
&gt;         &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot;<br>
&gt;     managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test3&quot; id=&quot;3&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;         &lt;/clone&gt;<br>
&gt;         &lt;resource id=&quot;pgsql-master-ip&quot;<br>
&gt;     resource_agent=&quot;ocf::heartbea<wbr>t:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot;<br>
&gt;     orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot;<br>
&gt;     failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;             &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &lt;/resource&gt;<br>
&gt;         &lt;/resources&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;     At 10:45:41.606, after first &quot;monitor&quot; on test3 (I can now tell the<br>
&gt;     resources on test3 are not ready):<br>
&gt;<br>
&gt;     crm_mon -1:<br>
&gt;<br>
&gt;         Stack: corosync<br>
&gt;         Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -<br>
&gt;     partition with quorum<br>
&gt;         Last updated: Fri May 12 10:45:41 2017          Last change: Fri<br>
&gt;     May 12 10:45:39 2017 by root via crm_attribute on test1<br>
&gt;<br>
&gt;         3 nodes and 4 resources configured<br>
&gt;<br>
&gt;         Online: [ test1 test2 test3 ]<br>
&gt;<br>
&gt;         Active resources:<br>
&gt;<br>
&gt;          Master/Slave Set: pgsql-ha [pgsqld]<br>
&gt;              Masters: [ test1 ]<br>
&gt;              Slaves: [ test2 ]<br>
&gt;          pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started<br>
&gt;     test1<br>
&gt;<br>
&gt;<br>
&gt;     crm_mon -X:<br>
&gt;<br>
&gt;         &lt;resources&gt;<br>
&gt;         &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot;<br>
&gt;     managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;                 &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;<br>
&gt;             &lt;/resource&gt;<br>
&gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;     role=&quot;Stopped&quot; active=&quot;false&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;     failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;0&quot; /&gt;<br>
&gt;         &lt;/clone&gt;<br>
&gt;         &lt;resource id=&quot;pgsql-master-ip&quot;<br>
&gt;     resource_agent=&quot;ocf::heartbea<wbr>t:IPaddr2&quot; role=&quot;Started&quot; active=&quot;true&quot;<br>
&gt;     orphaned=&quot;false&quot; managed=&quot;true&quot; failed=&quot;false&quot;<br>
&gt;     failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;             &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &lt;/resource&gt;<br>
&gt;         &lt;/resources&gt;<br>
&gt;<br>
&gt;     On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a><br>
</div></div><span>&gt;     &lt;mailto:<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;         On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:<br>
&gt;         &gt; Hi<br>
&gt;         &gt; I translated the a Postgresql multi state RA<br>
&gt;         &gt; (<a href="https://github.com/dalibo/PAF" rel="noreferrer" target="_blank">https://github.com/dalibo/PAF</a><wbr>) in Python<br>
&gt;         &gt; (<a href="https://github.com/ulodciv/deploy_cluster" rel="noreferrer" target="_blank">https://github.com/ulodciv/de<wbr>ploy_cluster</a><br>
</span>&gt;         &lt;<a href="https://github.com/ulodciv/deploy_cluster" rel="noreferrer" target="_blank">https://github.com/ulodciv/d<wbr>eploy_cluster</a>&gt;), and I have been<br>
<div class="m_-5704398829504974448HOEnZb"><div class="m_-5704398829504974448h5">&gt;         editing it<br>
&gt;         &gt; heavily.<br>
&gt;         &gt;<br>
&gt;         &gt; In parallel I am writing unit tests and functional tests.<br>
&gt;         &gt;<br>
&gt;         &gt; I am having an issue with a functional test that abruptly<br>
&gt;         powers off a<br>
&gt;         &gt; slave named says &quot;host3&quot; (hot standby PG instance). Later on I<br>
&gt;         start the<br>
&gt;         &gt; slave back. Once it is started, I run &quot;pcs cluster start<br>
&gt;         host3&quot;. And<br>
&gt;         &gt; this is where I start having a problem.<br>
&gt;         &gt;<br>
&gt;         &gt; I check every second the output of &quot;pcs status xml&quot; until<br>
&gt;         host3 is said<br>
&gt;         &gt; to be ready as a slave again. In the following I assume that<br>
&gt;         test3 is<br>
&gt;         &gt; ready as a slave:<br>
&gt;         &gt;<br>
&gt;         &gt;     &lt;nodes&gt;<br>
&gt;         &gt;         &lt;node name=&quot;test1&quot; id=&quot;1&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt;         &gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt;         &gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;false&quot;<br>
&gt;         &gt; resources_running=&quot;2&quot; type=&quot;member&quot; /&gt;<br>
&gt;         &gt;         &lt;node name=&quot;test2&quot; id=&quot;2&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt;         &gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt;         &gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;true&quot;<br>
&gt;         &gt; resources_running=&quot;1&quot; type=&quot;member&quot; /&gt;<br>
&gt;         &gt;         &lt;node name=&quot;test3&quot; id=&quot;3&quot; online=&quot;true&quot; standby=&quot;false&quot;<br>
&gt;         &gt; standby_onfail=&quot;false&quot; maintenance=&quot;false&quot; pending=&quot;false&quot;<br>
&gt;         &gt; unclean=&quot;false&quot; shutdown=&quot;false&quot; expected_up=&quot;true&quot; is_dc=&quot;false&quot;<br>
&gt;         &gt; resources_running=&quot;1&quot; type=&quot;member&quot; /&gt;<br>
&gt;         &gt;     &lt;/nodes&gt;<br>
&gt;<br>
&gt;         The &lt;nodes&gt; section says nothing about the current state of the<br>
&gt;         nodes.<br>
&gt;         Look at the &lt;node_state&gt; entries for that. in_ccm means the cluster<br>
&gt;         stack level, and crmd means the pacemaker level -- both need to<br>
&gt;         be up.<br>
&gt;<br>
&gt;         &gt;     &lt;resources&gt;<br>
&gt;         &gt;         &lt;clone id=&quot;pgsql-ha&quot; multi_state=&quot;true&quot; unique=&quot;false&quot;<br>
&gt;         &gt; managed=&quot;true&quot; failed=&quot;false&quot; failure_ignored=&quot;false&quot; &gt;<br>
&gt;         &gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;         &gt; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;         &gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;         &gt;                 &lt;node name=&quot;test3&quot; id=&quot;3&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &gt;             &lt;/resource&gt;<br>
&gt;         &gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;         &gt; role=&quot;Master&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;         &gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;         &gt;                 &lt;node name=&quot;test1&quot; id=&quot;1&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &gt;             &lt;/resource&gt;<br>
&gt;         &gt;             &lt;resource id=&quot;pgsqld&quot; resource_agent=&quot;ocf::heartbeat<wbr>:pgha&quot;<br>
&gt;         &gt; role=&quot;Slave&quot; active=&quot;true&quot; orphaned=&quot;false&quot; managed=&quot;true&quot;<br>
&gt;         &gt; failed=&quot;false&quot; failure_ignored=&quot;false&quot; nodes_running_on=&quot;1&quot; &gt;<br>
&gt;         &gt;                 &lt;node name=&quot;test2&quot; id=&quot;2&quot; cached=&quot;false&quot;/&gt;<br>
&gt;         &gt;             &lt;/resource&gt;<br>
&gt;         &gt;         &lt;/clone&gt;<br>
&gt;         &gt; By ready to go I mean that upon running &quot;pcs cluster start test3&quot;, the<br>
&gt;         &gt; following occurs before test3 appears ready in the XML:<br>
&gt;         &gt;<br>
&gt;         &gt; pcs cluster start test3<br>
&gt;         &gt; monitor-&gt; RA returns unknown error (1)<br>
&gt;         &gt; notify/pre-stop    -&gt; RA returns ok (0)<br>
&gt;         &gt; stop   -&gt; RA returns ok (0)<br>
&gt;         &gt; start-&gt; RA returns ok (0)<br>
&gt;         &gt;<br>
&gt;         &gt; The problem I have is that between &quot;pcs cluster start test3&quot; and<br>
&gt;         &gt; &quot;monitor&quot;, it seems that the XML returned by &quot;pcs status xml&quot; says test3<br>
&gt;         &gt; is ready (the XML extract above is what I get at that moment). Once<br>
&gt;         &gt; &quot;monitor&quot; occurs, the returned XML shows test3 to be offline, and not<br>
&gt;         &gt; until the start is finished do I once again have test3 shown as ready.<br>
&gt;         &gt;<br>
&gt;         &gt; I am getting anything wrong? Is there a simpler or better way to check<br>
&gt;         &gt; if test3 is fully functional again, ie OCF start was successful?<br>
&gt;         &gt;<br>
&gt;         &gt; Thanks<br>
&gt;         &gt;<br>
&gt;         &gt; Ludovic<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-5704398829504974448gmail_signature" data-smartmail="gmail_signature">Ludovic Vaugeois-Pepin<br></div>
</div></div>