<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot <span dir="ltr">&lt;<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi everybody,<br>

<br>

Currently, Pacemaker&#39;s on-fail property allows you to configure how the<br>

cluster reacts to operation failures. The default &quot;restart&quot; means try to<br>

restart on the same node, optionally moving to another node once<br>

migration-threshold is reached. Other possibilities are &quot;ignore&quot;,<br>

&quot;block&quot;, &quot;stop&quot;, &quot;fence&quot;, and &quot;standby&quot;.<br>

<br>

Occasionally, we get requests to have something like migration-threshold<br>

for values besides restart. For example, try restarting the resource on<br>

the same node 3 times, then fence.<br>

<br>

I&#39;d like to get your feedback on two alternative approaches we&#39;re<br>

considering.<br>

<br>

###<br>

<br>

Our first proposed approach would add a new hard-fail-threshold<br>

operation property. If specified, the cluster would first try restarting<br>

the resource on the same node, </blockquote><div><br></div><div>Well, just as now, it would be _allowed_ to start on the same node, but this is not guaranteed.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">before doing the on-fail handling.<br>

<br>

For example, you could configure a promote operation with<br>

hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

One point that&#39;s not settled is whether failures of *any* operation<br>

would count toward the 3 failures (which is how migration-threshold<br>

works now), or only failures of the specified operation.<br></blockquote><div><br></div><div>I think if hard-fail-threshold is per-op, then only failures of that operation should count.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Currently, if a start fails (but is retried successfully), then a<br>

promote fails (but is retried successfully), then a monitor fails, the<br>

resource will move to another node if migration-threshold=3. We could<br>

keep that behavior with hard-fail-threshold, or only count monitor<br>

failures toward monitor&#39;s hard-fail-threshold. Each alternative has<br>

advantages and disadvantages.<br>

<br>

###<br>

<br>

The second proposed approach would add a new on-restart-fail resource<br>

property.<br>

<br>

Same as now, on-fail set to anything but restart would be done<br>

immediately after the first failure. A new value, &quot;ban&quot;, would<br>

immediately move the resource to another node. (on-fail=ban would behave<br>

like on-fail=restart with migration-threshold=1.)<br>

<br>

When on-fail=restart, and restarting on the same node doesn&#39;t work, the<br>

cluster would do the on-restart-fail handling. on-restart-fail would<br>

allow the same values as on-fail (minus &quot;restart&quot;), and would default to<br>

&quot;ban&quot;. </blockquote><div><br></div><div>I do wish you well tracking &quot;is this a restart&quot; across demote -&gt; stop -&gt; start -&gt; promote in 4 different transitions :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

So, if you want to fence immediately after any promote failure, you<br>

would still configure on-fail=fence; if you want to try restarting a few<br>

times first, you would configure on-fail=restart and on-restart-fail=fence.<br>

<br>

This approach keeps the current threshold behavior -- failures of any<br>

operation count toward the threshold. We&#39;d rename migration-threshold to<br>

something like hard-fail-threshold, since it would apply to more than<br>

just migration, but unlike the first approach, it would stay a resource<br>

property.<br>

<br>

###<br>

<br>

Comparing the two approaches, the first is more flexible, but also more<br>

complex and potentially confusing.<br></blockquote><div><br></div><div>More complex to implement or more complex to configure?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

With either approach, we would deprecate the start-failure-is-fatal<br>

cluster property. start-failure-is-fatal=true would be equivalent to<br>

hard-fail-threshold=1 with the first approach, and on-fail=ban with the<br>

second approach. This would be both simpler and more useful -- it allows<br>

the value to be set differently per resource.<br>

<span class="gmail-HOEnZb"><font color="#888888">--<br>

Ken Gaillot &lt;<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>&gt;<br>

<br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</font></span></blockquote></div><br></div></div>