[ClusterLabs] Restart of 2-node cluster causes split brain ?

Stefan Schloesser sschloesser at enomic.com
Thu May 7 10:25:03 CEST 2015


Hi,

I have a 2 node drbd cluster. If for some reason one node is killed I am unable to restart the cluster without a split brain. I am running Ubuntu 14.04. This is what happens:
After rebooting the downed node (named sec) I start on it corosync and pacemaker. Sec then immediately kills prim without waiting for drbd to be synced and starts all services on itself causing the split brain.

In the log I see:
pengine:     info: determine_online_status_fencing:      Node sec is active
May 07 09:25:08 [7061] sec    pengine:     info: determine_online_status:      Node sec is online
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         stonith_sec     (stonith:external/hetzner):     Stopped
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         stonith_prim    (stonith:external/hetzner):     Stopped
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         ip      (ocf::kumina:hetzner-failover-ip):      Stopped
May 07 09:25:08 [7061] sec    pengine:     info: clone_print:   Master/Slave Set: ms_drbd [drbd]
May 07 09:25:08 [7061] sec    pengine:     info: short_print:       Stopped: [ prim sec ]
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         fs      (ocf::heartbeat:Filesystem):    Stopped
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         mysql   (ocf::heartbeat:mysql): Stopped
May 07 09:25:08 [7061] sec    pengine:     info: native_print:         apache  (ocf::heartbeat:apache):        Stopped
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource stonith_sec cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource drbd:1 cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: master_color:         ms_drbd: Promoted 0 instances of a possible 1 to master
May 07 09:25:08 [7061] sec    pengine:     info: rsc_merge_weights:    fs: Rolling back scores from mysql
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource fs cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: rsc_merge_weights:    mysql: Rolling back scores from apache
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource mysql cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: rsc_merge_weights:    apache: Rolling back scores from ip
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource apache cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: native_color:         Resource ip cannot run anywhere
May 07 09:25:08 [7061] sec    pengine:     info: RecurringOp:   Start recurring monitor (3600s) for stonith_prim on sec
May 07 09:25:08 [7061] sec    pengine:     info: RecurringOp:   Start recurring monitor (31s) for drbd:0 on sec
May 07 09:25:08 [7061] sec    pengine:     info: RecurringOp:   Start recurring monitor (31s) for drbd:0 on sec
May 07 09:25:08 [7061] sec    pengine:  warning: stage6:       Scheduling Node prim for STONITH

Version Info:
14.04: corosync Version: 2.3.3-1ubuntu1
             Pacemaker Version: 1.1.10+git20130802-1ubuntu2.3
12.04: corosync Version: 1.4.2-2ubuntu0.2
            Pacemaker Version: 1.1.6-2ubuntu3.3
        
I run other clusters with identical setup on Ubuntu 12.04 without such problems. So I believe something major has changed with the versions what I missed. 
Maybe in the original reboot both nodes wanted to kill each other, prim won the race but sec remembered it wanted to kill prim and does so at the first possible opportunity i.e. on restart. Would that be possible? If so how can I stop this behavior?


Stefan




More information about the Users mailing list