<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 9, 2021 at 6:19 AM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com">arvidjaar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 09.08.2021 16:00, Andreas Janning wrote:<br>

> Hi,<br>

> <br>

> yes, by "service" I meant the apache-clone resource.<br>

> <br>

> Maybe I can give a more stripped down and detailed example:<br>

> <br>

> *Given the following configuration:*<br>

> [root@pacemaker-test-1 cluster]# pcs cluster cib --config<br>

> <configuration><br>

>   <crm_config><br>

>     <cluster_property_set id="cib-bootstrap-options"><br>

>       <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog"<br>

> value="false"/><br>

>       <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"<br>

> value="1.1.23-1.el7_9.1-9acf116022"/><br>

>       <nvpair id="cib-bootstrap-options-cluster-infrastructure"<br>

> name="cluster-infrastructure" value="corosync"/><br>

>       <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name"<br>

> value="pacemaker-test"/><br>

>       <nvpair id="cib-bootstrap-options-stonith-enabled"<br>

> name="stonith-enabled" value="false"/><br>

>       <nvpair id="cib-bootstrap-options-symmetric-cluster"<br>

> name="symmetric-cluster" value="false"/><br>

>       <nvpair id="cib-bootstrap-options-last-lrm-refresh"<br>

> name="last-lrm-refresh" value="1628511747"/><br>

>     </cluster_property_set><br>

>   </crm_config><br>

>   <nodes><br>

>     <node id="1" uname="pacemaker-test-1"/><br>

>     <node id="2" uname="pacemaker-test-2"/><br>

>   </nodes><br>

>   <resources><br>

>     <clone id="apache-clone"><br>

>       <primitive class="ocf" id="apache" provider="heartbeat" type="apache"><br>

>         <instance_attributes id="apache-instance_attributes"><br>

>           <nvpair id="apache-instance_attributes-port" name="port"<br>

> value="80"/><br>

>           <nvpair id="apache-instance_attributes-statusurl"<br>

> name="statusurl" value="<a href="http://localhost/server-status" rel="noreferrer" target="_blank">http://localhost/server-status</a>"/><br>

>         </instance_attributes><br>

>         <operations><br>

>           <op id="apache-monitor-interval-10s" interval="10s"<br>

> name="monitor" timeout="20s"/><br>

>           <op id="apache-start-interval-0s" interval="0s" name="start"<br>

> timeout="40s"/><br>

>           <op id="apache-stop-interval-0s" interval="0s" name="stop"<br>

> timeout="60s"/><br>

>         </operations><br>

>       </primitive><br>

>       <meta_attributes id="apache-meta_attributes"><br>

>         <nvpair id="apache-clone-meta_attributes-clone-max"<br>

> name="clone-max" value="2"/><br>

>         <nvpair id="apache-clone-meta_attributes-clone-node-max"<br>

> name="clone-node-max" value="1"/><br>

>         <nvpair id="apache-clone-meta_attributes-interleave"<br>

> name="interleave" value="true"/><br>

>       </meta_attributes><br>

>     </clone><br>

>   </resources><br>

>   <constraints><br>

>     <rsc_location id="location-apache-clone-pacemaker-test-1-100"<br>

> node="pacemaker-test-1" rsc="apache-clone" score="100"<br>

> resource-discovery="exclusive"/><br>

>     <rsc_location id="location-apache-clone-pacemaker-test-2-0"<br>

> node="pacemaker-test-2" rsc="apache-clone" score="0"<br>

> resource-discovery="exclusive"/><br>

>   </constraints><br>

>   <rsc_defaults><br>

>     <meta_attributes id="rsc_defaults-options"><br>

>       <nvpair id="rsc_defaults-options-resource-stickiness"<br>

> name="resource-stickiness" value="50"/><br>

>     </meta_attributes><br>

>   </rsc_defaults><br>

> </configuration><br>

> <br>

> <br>

> *With the cluster in a running state:*<br>

> <br>

> [root@pacemaker-test-1 cluster]# pcs status<br>

> Cluster name: pacemaker-test<br>

> Stack: corosync<br>

> Current DC: pacemaker-test-2 (version 1.1.23-1.el7_9.1-9acf116022) -<br>

> partition with quorum<br>

> Last updated: Mon Aug  9 14:45:38 2021<br>

> Last change: Mon Aug  9 14:43:14 2021 by hacluster via crmd on<br>

> pacemaker-test-1<br>

> <br>

> 2 nodes configured<br>

> 2 resource instances configured<br>

> <br>

> Online: [ pacemaker-test-1 pacemaker-test-2 ]<br>

> <br>

> Full list of resources:<br>

> <br>

>  Clone Set: apache-clone [apache]<br>

>      Started: [ pacemaker-test-1 pacemaker-test-2 ]<br>

> <br>

> Daemon Status:<br>

>   corosync: active/disabled<br>

>   pacemaker: active/disabled<br>

>   pcsd: active/enabled<br>

> <br>

> *When simulating an error by killing the apache-resource on<br>

> pacemaker-test-1:*<br>

> <br>

> [root@pacemaker-test-1 ~]# killall httpd<br>

> <br>

> *After a few seconds, the cluster notices that the apache-resource is down<br>

> on pacemaker-test-1 and restarts it on pacemaker-test-1 (this is fine):*<br>

> <br>

> [root@pacemaker-test-1 cluster]# cat corosync.log | grep crmd:<br>

<br>

Never ever filter logs that you show unless you know what you are doing.<br>

<br>

You skipped the most interesting part that is the intended actions.<br>

Which are<br>

<br>

Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:<br>

 * Recover    apache:0     ( ha1 -> ha2 )<br>

Aug 09 15:59:37.889 ha1 pacemaker-schedulerd[3783] (LogAction)  notice:<br>

 * Move       apache:1     ( ha2 -> ha1 )<br>

<br>

So pacemaker decides to "swap" nodes where current instances are running.<br></blockquote><div><br></div><div>Correct. I've only skimmed this thread but it looks like:<br></div><div><br></div><div><a href="https://github.com/ClusterLabs/pacemaker/pull/2313">https://github.com/ClusterLabs/pacemaker/pull/2313</a></div><div><a href="https://bugzilla.redhat.com/show_bug.cgi?id=1931023">https://bugzilla.redhat.com/show_bug.cgi?id=1931023</a></div><div><br></div><div>I've had some personal things get in the way of following up on the PR for a while. In my experience, configuring resource-stickiness has worked around the issue.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Looking at scores<br>

<br>

Using the original execution date of: 2021-08-09 12:59:37Z<br>

<br>

Current cluster status:<br>

Online: [ ha1 ha2 ]<br>

<br>

 vip    (ocf::pacemaker:Dummy):  Started ha1<br>

 Clone Set: apache-clone [apache]<br>

     apache     (ocf::pacemaker:Dummy):  FAILED ha1<br>

     Started: [ ha2 ]<br>

<br>

Allocation scores:<br>

pcmk__clone_allocate: apache-clone allocation score on ha1: 200<br>

pcmk__clone_allocate: apache-clone allocation score on ha2: 0<br>

pcmk__clone_allocate: apache:0 allocation score on ha1: 101<br>

pcmk__clone_allocate: apache:0 allocation score on ha2: 0<br>

pcmk__clone_allocate: apache:1 allocation score on ha1: 100<br>

pcmk__clone_allocate: apache:1 allocation score on ha2: 1<br>

pcmk__native_allocate: apache:1 allocation score on ha1: 100<br>

pcmk__native_allocate: apache:1 allocation score on ha2: 1<br>

pcmk__native_allocate: apache:1 allocation score on ha1: 100<br>

pcmk__native_allocate: apache:1 allocation score on ha2: 1<br>

pcmk__native_allocate: apache:0 allocation score on ha1: -INFINITY<br>

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^<br>

pcmk__native_allocate: apache:0 allocation score on ha2: 0<br>

pcmk__native_allocate: vip allocation score on ha1: 100<br>

pcmk__native_allocate: vip allocation score on ha2: 0<br>

<br>

Transition Summary:<br>

 * Recover    apache:0     ( ha1 -> ha2 )<br>

 * Move       apache:1     ( ha2 -> ha1 )<br>

<br>

<br>

No, I do not have explanation why pacemaker decides that apache:0 cannot<br>

run on ha1 in this case and so decides to move it to another node. It<br>

most certainly has something to do with asymmetric cluster and location<br>

scores. If you set the same location scores for apache-clone on both<br>

nodes pacemaker will recover failed instance and won't attempt to move<br>

it. Like<br>

<br>

location location-apache-clone-ha1-100 apache-clone<br>

resource-discovery=exclusive 100: ha1<br>

location location-apache-clone-ha2-100 apache-clone<br>

resource-discovery=exclusive 100: ha2<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div>Regards,<br><br></div>Reid Wahl, RHCA<br></div><div>Senior Software Maintenance Engineer, Red Hat<br></div>CEE - Platform Support Delivery - ClusterHA</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>