<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 8, 2016 at 7:03 AM, Ferenc Wágner <span dir="ltr">&lt;<a href="mailto:wferi@niif.hu" target="_blank">wferi@niif.hu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>&gt; writes:<br>

<br>

&gt; On 03/07/2016 07:31 AM, Ferenc Wágner wrote:<br>

&gt;<br>

</span><span class="">&gt;&gt; 12:55:13 vhbl07 crmd[8484]: notice: Transition aborted by vm-eiffel_monitor_60000 &#39;create&#39; on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)<br>

&gt;<br>

&gt; That means the action was initiated by a different node (the previous DC<br>

&gt; presumably),</span></blockquote><div><br></div><div>I suspect s/previous/other/</div><div><br></div><div>With a stuck machine its entirely possible that the other nodes elected a new leader.</div><div>Would I be right in guessing that fencing is disabled?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""> so the new DC wants to recalculate everything.<br>

<br>

</span>Time travel was sort of possible in that situation, and recurring<br>

monitor operations are not logged, so this is indeed possible.  The main<br>

thing is that it wasn&#39;t mishandled.<br>

<span class=""><br>

&gt;&gt; recovery actions turned into start actions for the resources stopped<br>

&gt;&gt; during the previous transition.  However, almost all other recovery<br>

&gt;&gt; actions just disappeared without any comment.  This was actually<br>

&gt;&gt; correct, but I really wonder why the cluster decided to paper over<br>

&gt;&gt; the previous monitor operation timeouts.  Maybe the operations<br>

&gt;&gt; finished meanwhile and got accounted somehow, just not logged?<br>

&gt;<br>

&gt; I&#39;m not sure why the PE decided recovery was not necessary. Operation<br>

&gt; results wouldn&#39;t be accepted without being logged.<br>

<br>

</span>At which logging level?  I can&#39;t see recurring monitor operation logs in<br>

syslog (at default logging level: notice) nor in /var/log/pacemaker.log<br>

(which contains info level messages as well).<br></blockquote><div><br></div><div>The DC will log that the recurring monitor was successfully started, but due to noise it doesn&#39;t log subsequent successes.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

However, the info level logs contain more &quot;Transition aborted&quot; lines, as<br>

if only the first of them got logged with notice level.  This would make<br>

sense, since the later ones don&#39;t make any difference on an already<br>

aborted transition, so they aren&#39;t that important.  And in fact such<br>

lines were suppressed from the syslog I checked first, for example:<br>

<br>

12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: --- 0.613.120 2<br>

12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     Diff: +++ 0.613.121 (null)<br>

12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     +  /cib:  @num_updates=121<br>

12:55:39 [8479] vhbl07        cib:     info: cib_perform_op:     ++ /cib/status/node_state[@id=&#39;167773707&#39;]/lrm[@id=&#39;167773707&#39;]/lrm_resources/lrm_resource[@id=&#39;vm-elm&#39;]:  &lt;lrm_rsc_op id=&quot;vm-elm_monitor_60000&quot; operation_key=&quot;vm-elm_monitor_60000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;do_update_resource&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; transition-magic=&quot;0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; on_node=&quot;vhbl05&quot; call-id=&quot;645&quot; rc-code=&quot;0&quot; op-st<br>

12:55:39 [8479] vhbl07        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=vhbl05/crmd/362, version=0.613.121)<br>

12:55:39 [8484] vhbl07       crmd:     info: abort_transition_graph:     Transition aborted by vm-elm_monitor_60000 &#39;create&#39; on vhbl05: Foreign event (magic=0:0;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.121, source=process_graph_event:600, 0)<br>

12:55:39 [8484] vhbl07       crmd:     info: process_graph_event:        Detected action (0.473) vm-elm_monitor_60000.645=ok: initiated by a different node<br>

<br>

I can very much imagine this cancelling the FAILED state induced by a<br>

monitor timeout like:<br>

<br>

12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               &lt;lrm_resource id=&quot;vm-elm&quot; type=&quot;TransientDomain&quot; class=&quot;ocf&quot; provider=&quot;niif&quot;&gt;<br>

12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 &lt;lrm_rsc_op id=&quot;vm-elm_last_failure_0&quot; operation_key=&quot;vm-elm_monitor_60000&quot; operation=&quot;monitor&quot; crm-debug-origin=&quot;build_active_RAs&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; transition-magic=&quot;2:1;473:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; on_node=&quot;vhbl05&quot; call-id=&quot;645&quot; rc-code=&quot;1&quot; op-status=&quot;2&quot; interval=&quot;60000&quot; last-rc-change=&quot;1456833279&quot; exe<br>

12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                                 &lt;lrm_rsc_op id=&quot;vm-elm_last_0&quot; operation_key=&quot;vm-elm_start_0&quot; operation=&quot;start&quot; crm-debug-origin=&quot;build_active_RAs&quot; crm_feature_set=&quot;3.0.10&quot; transition-key=&quot;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; transition-magic=&quot;0:0;472:0:0:634eef05-39c1-4093-94d4-8d624b423bb7&quot; on_node=&quot;vhbl05&quot; call-id=&quot;602&quot; rc-code=&quot;0&quot; op-status=&quot;0&quot; interval=&quot;0&quot; last-run=&quot;1456091121&quot; last-rc-change=&quot;1456091121&quot; e<br>

12:54:52 [8479] vhbl07        cib:     info: cib_perform_op:     ++                                               &lt;/lrm_resource&gt;<br>

<br>

The transition-keys match, does this mean that the above is a late<br>

result from the monitor operation which was considered timed-out<br>

previously?  How did it reach vhbl07, if the DC at that time was vhbl03?<br></blockquote><div><br></div><div>Everything goes into the cib (replicated datastore) and the DC(s) get notified.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

&gt; The pe-input files from the transitions around here should help.<br>

<br>

</span>They are available.  What shall I look for?<br>

<span class=""><br>

&gt;&gt; Basically, the cluster responded beyond my expectations, sparing lots of<br>

&gt;&gt; unnecessary recoveries or fencing.  I&#39;m happy, thanks for this wonderful<br>

&gt;&gt; software!  But I&#39;m left with these &quot;Processing failed op monitor&quot;<br>

&gt;&gt; warnings emitted every 15 minutes (timer pops).  Is it safe and clever<br>

&gt;&gt; to cleanup the affected resources?  Would that get rid of them without<br>

&gt;&gt; invoking other effects, like recoveries for example?<br>

&gt;<br>

&gt; That&#39;s normal; it&#39;s how the cluster maintains the effect of a failure<br>

&gt; that has not been cleared. The logs can be confusing, because it&#39;s not<br>

&gt; apparent from that message alone whether the failure is new or old.<br>

<br>

</span>Ah, do you mean that these are the same thing that appears after &quot;Failed<br>

Actions:&quot; at the end of the crm_mon output?  They certainly match, and<br>

the logs are confusing indeed.<br>

<span class=""><br>

&gt; Cleaning up the resource will end the failure condition, so what happens<br>

&gt; next depends on the configuration and state of the cluster. If the<br>

&gt; failure was preventing a preferred node from running the resource, the<br>

&gt; resource could move, depending on other factors such as stickiness.<br>

<br>

</span>These resources are (still) running fine, suffered only monitor failures<br>

and are node-neutral, so it should be safe to cleanup them, I suppose.<br>

<span class="HOEnZb"><font color="#888888">--<br>

Thanks for your quick and enlightening answer!  I feared the mere length<br>

of my message would scare everybody away...<br>

Regards,<br>

Feri<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br></div></div>