<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello,<br>

<br>

Our team has been using corosync &#43; pacemaker successfully for the last year or two, but last week ran into an issue which I wanted to get some more insight on.&nbsp; We have a 2 node cluster, using the WaitForAll votequorum parameter so all nodes must have been

 seen at least once before resources are started.&nbsp; We have two layers of fencing configured, IPMI and SBD (storage based death, using shared storage).&nbsp; We have done extensive testing on our fencing in the past and it works great, but here the fencing never

 got called.&nbsp; One of our QA testers managed to pull the network cable at a very particular time during startup, and it seems to have resulted in corosync telling pacemaker that all nodes had been seen, and that the cluster was in a normal state with one node

 up.&nbsp; No fencing was ever triggered, and all resources were started normally.&nbsp; The other node was NOT marked unclean.&nbsp; This resulted in a split brain scenario, as our master database (pgsql replication) was still running as master on the other node, and had

 now been started and promoted on this node.&nbsp; Luckily this is all in a test environment, so no production impact was seen.&nbsp; Below is test specifics and some relevant logs.<br>

<br>

Procedure:<br>

1. Allow both nodes to come up fully.<br>

2. Reboot current master node.<br>

3. As node is booting up again (during corosync startup), pull interconnect cable.

<p><br>

</p>

<p>Expected Behavior:<br>

1. Node either a) fails to start any resources or b) fences other node and promotes to master</p>

<p><br>

</p>

<p>Actual behavior:<br>

1. Node promotes to master without fencing peer, resulting in both nodes running master database.</p>

<br>

<p>Module-2 is rebooted @ 12:57:42, and comes back up ~12:59.<br>

When corosync starts up, both nodes are visible and all vote counts are normal.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>Jul 15 12:59:00 module-2 corosync[2906]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]

Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2

Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2

Jul 15 12:59:00 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.

Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired

</pre>

</div>

</div>

<p>3 seconds later, the interconnect network cable is pulled.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down

</pre>

</div>

</div>

<p>Corosync recognizes this immediately, and declares the peer as dead.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead

</pre>

</div>

</div>

<p>Slightly later (very close), corosync initialization completes, it says it has quorum, and declares system ready for use.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2

Jul 15 12:59:10 module-2 corosync[2906]: [MAIN  ] Completed service synchronization, ready to provide service.

</pre>

</div>

</div>

<p>Pacemaker starts resources normally, including Postgres.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   fence_sbd        (module-2)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ipmi-1        (module-2)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   SlaveIP        (module-2)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   postgres:0        (module-2)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ethmonitor:0        (module-2)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   tomcat-instance:0        (module-2 - blocked)

Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start   ClusterMonitor:0        (module-2 - blocked)

</pre>

</div>

</div>

<p>Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should not be able to start ANY resources until it has seen all nodes once.</p>

<div class="preformatted panel" style="border-width: 1px;">

<div class="preformattedContent panelContent">

<pre>module-2 ~ # corosync-quorumtool 

Quorum information

------------------

Date:             Wed Jul 15 18:15:34 2015

Quorum provider:  corosync_votequorum

Nodes:            1

Node ID:          2

Ring ID:          64

Quorate:          Yes

Votequorum information

----------------------

Expected votes:   2

Highest expected: 2

Total votes:      1

Quorum:           1  

Flags:            2Node Quorate WaitForAll 

Membership information

----------------------

    Nodeid      Votes Name

         2          1 module-2 (local)

</pre>

</div>

</div>

<br>

Package versions:<br>

<br>

-bash-4.3# rpm -qa | grep corosync<br>

corosynclib-2.3.4-1.fc22.x86_64<br>

corosync-2.3.4-1.fc22.x86_64<br>

<br>

-bash-4.3# rpm -qa | grep pacemaker<br>

pacemaker-cluster-libs-1.1.12-2.fc22.x86_64<br>

pacemaker-libs-1.1.12-2.fc22.x86_64<br>

pacemaker-cli-1.1.12-2.fc22.x86_64<br>

pacemaker-1.1.12-2.fc22.x86_64<br>

<br>

<br>

<br>

</div>

</body>

</html>