<html><body><p>A quick addendum... <br><br>After sending this post, I decided to stop pacemaker on the single, Online node in the cluster, <br>and this effectively killed the corosync daemon: <br><br>[root@zs93kl VD]# date;pcs cluster stop<br>Wed Sep 28 16:39:22 EDT 2016<br>Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...<br><br><br>[root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep<br>Wed Sep 28 16:46:19 EDT 2016<br>[root@zs93kl VD]#<br><br><br><br>Next, I went to a node in &quot;Pending&quot; state, and sure enough... the pcs cluster stop killed the daemon there, too: <br><br>[root@zs95kj VD]# date;pcs cluster stop<br>Wed Sep 28 16:48:15 EDT 2016<br>Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...<br><br>[root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep<br>Wed Sep 28 16:48:38 EDT 2016<br>[root@zs95kj VD]#<br><br>So, this answers my own question...  cluster stop should kill corosync.  So, why isn't the `pcs cluster stop --all` failing to <br>kill corosync? <br><br>Thanks... <br><br><br>Scott Greenlese ... IBM KVM on System Z Test,  Poughkeepsie, N.Y.<br>  INTERNET:  swgreenl@us.ibm.com  <br><br><br><br><img width="16" height="16" src="cid:1__=8FBB0AAFDFE1FA798f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync "><font color="#424282">Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown.</font><br><br><font size="2" color="#5F5F5F">From:        </font><font size="2">Scott Greenlese/Poughkeepsie/IBM</font><br><font size="2" color="#5F5F5F">To:        </font><font size="2">kgaillot@redhat.com, Cluster Labs - All topics related to open-source clustering welcomed &lt;users@clusterlabs.org&gt;</font><br><font size="2" color="#5F5F5F">Date:        </font><font size="2">09/28/2016 04:30 PM</font><br><font size="2" color="#5F5F5F">Subject:        </font><font size="2">Re: [ClusterLabs] Pacemaker quorum behavior</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br>Hi folks..<br><br>I have some follow-up questions about corosync daemon status after cluster shutdown. <br><br>Basically, what should happen to corosync on a cluster node when pacemaker is shutdown on that node? <br>On my 5 node cluster, when I do a global shutdown, the pacemaker processes exit, but corosync processes remain active. <br><br>Here's an example of where this led me into some trouble... <br><br>My cluster is still configured to use the &quot;symmetric&quot; resource distribution.   I don't have any location constraints in place, so pacemaker tries to evenly distribute resources across all Online nodes. <br><br>With one cluster node (KVM host) powered off, I did the global cluster stop: <br><br>[root@zs90KP VD]# date;pcs cluster stop --all<br>Wed Sep 28 15:07:40 EDT 2016<br>zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)<br>zs90kppcs1: Stopping Cluster (pacemaker)...<br>zs95KLpcs1: Stopping Cluster (pacemaker)...<br>zs95kjpcs1: Stopping Cluster (pacemaker)...<br>zs93kjpcs1: Stopping Cluster (pacemaker)...<br>Error: unable to stop all nodes<br>zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)<br><br>Note:  The &quot;No route to host&quot; messages are expected because that node / LPAR is powered down. <br><br>(I don't show it here, but the corosync daemon is still running on the 4 active nodes. I do show it later). <br><br>I then powered on the one zs93KLpcs1 LPAR,  so in theory I should not have quorum when it comes up and activates<br>pacemaker, which is enabled to autostart at boot time on all 5 cluster nodes.  At this point, only 1 out of 5<br>nodes should be Online to the cluster, and therefore ... no quorum. <br><br>I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' Online, and &quot;partition with quorum&quot;: <br><br>[root@zs93kl ~]# date;pcs status |less<br>Wed Sep 28 15:25:13 EDT 2016<br>Cluster name: test_cluster_2<br>Last updated: Wed Sep 28 15:25:13 2016          Last change: Mon Sep 26 16:15:08 2016 by root via crm_resource on zs95kjpcs1<br>Stack: corosync<br>Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum<br>106 nodes and 304 resources configured<br><br>Node zs90kppcs1: pending<br>Node zs93kjpcs1: pending<br>Node zs95KLpcs1: pending<br>Node zs95kjpcs1: pending<br>Online: [ zs93KLpcs1 ]<br><br>Full list of resources:<br><br> zs95kjg109062_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br>.<br>.<br>.<br><br><br>Here you can see that corosync is up on all 5 nodes: <br><br>[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host &quot;hostname;ps -ef |grep corosync |grep -v grep&quot;; done<br>Wed Sep 28 15:22:21 EDT 2016<br>zs90KP<br>root     155374      1  0 Sep26 ?        00:10:17 corosync<br>zs95KL<br>root      22933      1  0 11:51 ?        00:00:54 corosync<br>zs95kj<br>root      19382      1  0 Sep26 ?        00:10:15 corosync<br>zs93kj<br>root     129102      1  0 Sep26 ?        00:12:10 corosync<br>zs93kl<br>root      21894      1  0 15:19 ?        00:00:00 corosync<br><br><br>But, pacemaker is only running on the one, online node: <br><br>[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host &quot;hostname;ps -ef |grep pacemakerd |grep -v grep&quot;; done<br>Wed Sep 28 15:23:29 EDT 2016<br>zs90KP<br>zs95KL<br>zs95kj<br>zs93kj<br>zs93kl<br>root      23005      1  0 15:19 ?        00:00:00 /usr/sbin/pacemakerd -f<br>You have new mail in /var/spool/mail/root<br>[root@zs95kj VD]#<br><br><br>This situation wreaks havoc on my VirtualDomain resources, as the majority of them are in FAILED or Stopped state, and to my<br>surprise... many of them show as Started: <br><br>[root@zs93kl VD]# date;pcs resource show |grep zs93KL<br>Wed Sep 28 15:55:29 EDT 2016<br> zs95kjg109062_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109064_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109065_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109066_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109068_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109069_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109070_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109071_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109072_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109073_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109074_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109075_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109076_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109077_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109078_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109079_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109080_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109081_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109082_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109083_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109084_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109085_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109086_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109087_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109088_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109089_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109090_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109092_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109095_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109096_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109097_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109101_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109102_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109104_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110065_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110066_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110067_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110068_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110069_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110070_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110071_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110072_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110073_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110074_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110075_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110076_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110079_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110080_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110081_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110082_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110084_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110086_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110087_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110088_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110089_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110103_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110104_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110093_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110094_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110095_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110097_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110099_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110100_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110101_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110102_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110098_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110105_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110106_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110107_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110108_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110109_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110110_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110111_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110112_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110113_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110114_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110115_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110116_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110117_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110118_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110119_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110120_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110121_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110122_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110123_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110124_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110125_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110126_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110128_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110129_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110130_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110131_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110132_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110133_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110134_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110135_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110137_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110138_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110139_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110140_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110141_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110142_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110143_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110144_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110145_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110146_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110148_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110149_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110150_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110152_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110154_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110155_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110156_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110159_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110160_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110161_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110164_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110165_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110166_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br><br><br>Pacemaker is attempting to activate all VirtualDomain resources on the one cluster node. <br><br>So back to my original question... what should happen when I do a cluster stop? <br>If it should be deactivating, what would prevent this? <br><br>Also,  I have tried simulating a failed cluster node (to trigger a STONITH action) by killing the <br>corosync daemon on one node, but all that does is respawn the daemon ... causing a temporary / transient<br>failure condition, and no fence takes place.   Is there a way to kill corosync in such a way<br>that it stays down?   Is there a best practice for STONITH testing?<br><br>As usual, thanks in advance for your advice. <br><br>Scott Greenlese ... IBM KVM on System Z -  Solutions Test,  Poughkeepsie, N.Y.<br>  INTERNET:  swgreenl@us.ibm.com  <br> <br><br><br><br><img width="16" height="16" src="cid:1__=8FBB0AAFDFE1FA798f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Ken Gaillot ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger wrote: &gt; On 09/08/201"><font color="#424282">Ken Gaillot ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger wrote: &gt; On 09/08/2016 07:31 PM, Scott Greenlese wrote:</font><br><br><font size="2" color="#5F5F5F">From:        </font><font size="2">Ken Gaillot &lt;kgaillot@redhat.com&gt;</font><br><font size="2" color="#5F5F5F">To:        </font><font size="2">users@clusterlabs.org</font><br><font size="2" color="#5F5F5F">Date:        </font><font size="2">09/09/2016 06:23 PM</font><br><font size="2" color="#5F5F5F">Subject:        </font><font size="2">Re: [ClusterLabs] Pacemaker quorum behavior</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt>On 09/09/2016 04:27 AM, Klaus Wenninger wrote:<br>&gt; On 09/08/2016 07:31 PM, Scott Greenlese wrote:<br>&gt;&gt;<br>&gt;&gt; Hi Klaus, thanks for your prompt and thoughtful feedback...<br>&gt;&gt;<br>&gt;&gt; Please see my answers nested below (sections entitled, &quot;Scott's<br>&gt;&gt; Reply&quot;). Thanks!<br>&gt;&gt;<br>&gt;&gt; - Scott<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.<br>&gt;&gt; INTERNET: swgreenl@us.ibm.com<br>&gt;&gt; PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27<br>&gt;&gt; AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: &gt;Klaus Wenninger<br>&gt;&gt; ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese<br>&gt;&gt; wrote: &gt;<br>&gt;&gt;<br>&gt;&gt; From: Klaus Wenninger &lt;kwenning@redhat.com&gt;<br>&gt;&gt; To: users@clusterlabs.org<br>&gt;&gt; Date: 09/08/2016 10:59 AM<br>&gt;&gt; Subject: Re: [ClusterLabs] Pacemaker quorum behavior<br>&gt;&gt;<br>&gt;&gt; ------------------------------------------------------------------------<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; On 09/08/2016 03:55 PM, Scott Greenlese wrote:<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Hi all...<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; I have a few very basic questions for the group.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100<br>&gt;&gt; &gt; VirtualDomain pacemaker-remote nodes<br>&gt;&gt; &gt; plus 100 &quot;opaque&quot; VirtualDomain resources. The cluster is configured<br>&gt;&gt; &gt; to be 'symmetric' and I have no<br>&gt;&gt; &gt; location constraints on the 200 VirtualDomain resources (other than to<br>&gt;&gt; &gt; prevent the opaque guests<br>&gt;&gt; &gt; from running on the pacemaker remote node resources). My quorum is set<br>&gt;&gt; &gt; as:<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; quorum {<br>&gt;&gt; &gt; provider: corosync_votequorum<br>&gt;&gt; &gt; }<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; As an experiment, I powered down one LPAR in the cluster, leaving 4<br>&gt;&gt; &gt; powered up with the pcsd service up on the 4 survivors<br>&gt;&gt; &gt; but corosync/pacemaker down (pcs cluster stop --all) on the 4<br>&gt;&gt; &gt; survivors. I then started pacemaker/corosync on a single cluster<br>&gt;&gt; &gt;<br>&gt;&gt;<br>&gt;&gt; &quot;pcs cluster stop&quot; shuts down pacemaker &amp; corosync on my test-cluster but<br>&gt;&gt; did you check the status of the individual services?<br>&gt;&gt;<br>&gt;&gt; Scott's reply:<br>&gt;&gt;<br>&gt;&gt; No, I only assumed that pacemaker was down because I got this back on<br>&gt;&gt; my pcs status<br>&gt;&gt; command from each cluster node:<br>&gt;&gt;<br>&gt;&gt; [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1<br>&gt;&gt; zs93kjpcs1 ; do ssh $host pcs status; done<br>&gt;&gt; Wed Sep 7 15:49:27 EDT 2016<br>&gt;&gt; Error: cluster is not currently running on this node<br>&gt;&gt; Error: cluster is not currently running on this node<br>&gt;&gt; Error: cluster is not currently running on this node<br>&gt;&gt; Error: cluster is not currently running on this node<br><br>In my experience, this is sufficient to say that pacemaker and corosync<br>aren't running.<br><br>&gt;&gt;<br>&gt;&gt; What else should I check? &nbsp;The pcsd.service service was still up,<br>&gt;&gt; since I didn't not stop that<br>&gt;&gt; anywhere. Should I have done, &nbsp;ps -ef |grep -e pacemaker -e corosync<br>&gt;&gt; &nbsp;to check the state before<br>&gt;&gt; assuming it was really down?<br>&gt;&gt;<br>&gt;&gt;<br>&gt; Guess the answer from Poki should guide you well here ...<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; &gt; node (pcs cluster start), and this resulted in the 200 VirtualDomain<br>&gt;&gt; &gt; resources activating on the single node.<br>&gt;&gt; &gt; This was not what I was expecting. I assumed that no resources would<br>&gt;&gt; &gt; activate / start on any cluster nodes<br>&gt;&gt; &gt; until 3 out of the 5 total cluster nodes had pacemaker/corosync running.<br><br>Your expectation is correct; I'm not sure what happened in this case.<br>There are some obscure corosync options (e.g. last_man_standing,<br>allow_downscale) that could theoretically lead to this, but I don't get<br>the impression you're using anything unusual.<br><br>&gt;&gt; &gt; After starting pacemaker/corosync on the single host (zs95kjpcs1),<br>&gt;&gt; &gt; this is what I see :<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; [root@zs95kj VD]# date;pcs status |less<br>&gt;&gt; &gt; Wed Sep 7 15:51:17 EDT 2016<br>&gt;&gt; &gt; Cluster name: test_cluster_2<br>&gt;&gt; &gt; Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12<br>&gt;&gt; &gt; 2016 by hacluster via crmd on zs93kjpcs1<br>&gt;&gt; &gt; Stack: corosync<br>&gt;&gt; &gt; Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -<br>&gt;&gt; &gt; partition with quorum<br>&gt;&gt; &gt; 106 nodes and 304 resources configured<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Node zs93KLpcs1: pending<br>&gt;&gt; &gt; Node zs93kjpcs1: pending<br>&gt;&gt; &gt; Node zs95KLpcs1: pending<br>&gt;&gt; &gt; Online: [ zs95kjpcs1 ]<br>&gt;&gt; &gt; OFFLINE: [ zs90kppcs1 ]<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; .<br>&gt;&gt; &gt; .<br>&gt;&gt; &gt; .<br>&gt;&gt; &gt; PCSD Status:<br>&gt;&gt; &gt; zs93kjpcs1: Online<br>&gt;&gt; &gt; zs95kjpcs1: Online<br>&gt;&gt; &gt; zs95KLpcs1: Online<br>&gt;&gt; &gt; zs90kppcs1: Offline<br>&gt;&gt; &gt; zs93KLpcs1: Online<br><br>FYI the Online/Offline above refers only to pcsd, which doesn't have any<br>effect on the cluster itself -- just the ability to run pcs commands.<br><br>&gt;&gt; &gt; So, what exactly constitutes an &quot;Online&quot; vs. &quot;Offline&quot; cluster node<br>&gt;&gt; &gt; w.r.t. quorum calculation? Seems like in my case, it's &quot;pending&quot; on 3<br>&gt;&gt; &gt; nodes,<br>&gt;&gt; &gt; so where does that fall? Any why &quot;pending&quot;? What does that mean?<br><br>&quot;pending&quot; means that the node has joined the corosync cluster (which<br>allows it to contribute to quorum), but it has not yet completed the<br>pacemaker join process (basically a handshake with the DC).<br><br>I think the corosync and pacemaker detail logs would be essential to<br>figuring out what's going on. Check the logs on the &quot;pending&quot; nodes to<br>see whether corosync somehow started up by this point, and check the<br>logs on this node to see what the most recent references to the pending<br>nodes were.<br><br>&gt;&gt; &gt; Also, what exactly is the cluster's expected reaction to quorum loss?<br>&gt;&gt; &gt; Cluster resources will be stopped or something else?<br>&gt;&gt; &gt;<br>&gt;&gt; Depends on how you configure it using cluster property no-quorum-policy<br>&gt;&gt; (default: stop).<br>&gt;&gt;<br>&gt;&gt; Scott's reply:<br>&gt;&gt;<br>&gt;&gt; This is how the policy is configured:<br>&gt;&gt;<br>&gt;&gt; [root@zs95kj VD]# date;pcs config |grep quorum<br>&gt;&gt; Thu Sep &nbsp;8 13:18:33 EDT 2016<br>&gt;&gt; &nbsp;no-quorum-policy: stop<br>&gt;&gt;<br>&gt;&gt; What should I expect with the 'stop' setting?<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Where can I find this documentation?<br>&gt;&gt; &gt;<br>&gt;&gt; </tt><tt><a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/">http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/</a></tt><tt><br>&gt;&gt;<br>&gt;&gt; Scott's reply:<br>&gt;&gt;<br>&gt;&gt; OK, I'll keep looking thru this doc, but I don't easily find the<br>&gt;&gt; no-quorum-policy explained.<br>&gt;&gt;<br>&gt; Well, the index leads you to:<br>&gt; </tt><tt><a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html">http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html</a></tt><tt><br>&gt; where you find an exhaustive description of the option.<br>&gt; <br>&gt; In short:<br>&gt; you are running the default and that leads to all resources being<br>&gt; stopped in a partition without quorum<br>&gt; <br>&gt;&gt; Thanks..<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Thanks!<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Scott Greenlese - IBM Solution Test Team.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.<br>&gt;&gt; &gt; INTERNET: swgreenl@us.ibm.com<br>&gt;&gt; &gt; PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966<br><br>_______________________________________________<br>Users mailing list: Users@clusterlabs.org<br></tt><tt><a href="http://clusterlabs.org/mailman/listinfo/users">http://clusterlabs.org/mailman/listinfo/users</a></tt><tt><br><br>Project Home: </tt><tt><a href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></tt><tt><br>Getting started: </tt><tt><a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></tt><tt><br>Bugs: </tt><tt><a href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></tt><tt><br><br></tt><br><br><br><BR>

</body></html>