[ClusterLabs] problems with a CentOS7 SBD cluster

Sun Jun 26 06:54:11 CEST 2016

25.06.2016 23:05, Marcin Dulak пишет:
> Hi,
> 
> I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node
> CentOS7 built in VirtualBox.
> The complete setup is available at
> https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> so hopefully with some help I'll be able to make it work.
> 
> Question 1:
> The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk"
> https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> will SBD fencing work with that type of storage?
> 
> I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> $ vagrant up  # takes ~15 minutes
> 
> The setup brings up the nodes, installs the necessary packages, and
> prepares for the configuration of the pcs cluster.
> You can see which scripts the nodes execute at the bottom of the
> Vagrantfile.
> While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not
> been packaged yet.
> Therefore I rebuild Fedora 24 package using the latest
> https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> plus the update to the fence_sbd from
> https://github.com/ClusterLabs/fence-agents/pull/73
> 
> The configuration is inspired by
> https://www.novell.com/support/kb/doc.php?id=7009485 and
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
> 
> Question 2:
> After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> expect with just one stonith resource configured
> a node will be fenced when I stop pacemaker and corosync `pcs cluster stop
> node-1` or just `stonith_admin -F node-1`, but this is not the case.
> 
> As can be seen below from uptime, the node-1 is not shutdown by `pcs
> cluster stop node-1` executed on itself.
> I found some discussions on users at clusterlabs.org about whether a node
> running SBD resource can fence itself,
> but the conclusion was not clear to me.
> 

I am not familiar with pcs, but stopping pacemaker services manually
makes node leave cluster in controlled manner, and does not result in
fencing, at least in my experience.

> Question 3:
> Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
> despite the fact
> /var/log/messages on node-2 (the one currently running MyStonith) reporting:
> ...
> notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
> 'node-1' with device 'MyStonith' returned: 0 (OK)
> ...
> What is happening here?
> 

Do you have sbd daemon running? SBD is based on self-fencing - the only
thing that fence agent does is to place request for another node to kill
itself. It is expected that sbd running on another node will respond to
this request by committing suicide.

> Question 4 (for the future):
> Assuming the node-1 was fenced, what is the way of operating SBD?
> I see the sbd lists now:
> 0       node-3  clear
> 1       node-1  off    node-2
> 2       node-2  clear
> How to clear the status of node-1?
> 
> Question 5 (also for the future):
> While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented
> at
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
> is clearly described, I wonder about the relation of 'stonith-timeout'
> to other timeouts like the 'monitor interval=60s' reported by `pcs stonith
> show MyStonith`.
> 
> Here is how I configure the cluster and test it. The run.sh script is
> attached.
> 
> $ sh -x run01.sh 2>&1 | tee run01.txt
> 
> with the result:
> 
> $ cat run01.txt
> 
> Each block below shows the executed ssh command and the result.
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password
> node-1 node-2 node-3'
> node-1: Authorized
> node-3: Authorized
> node-2: Authorized
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1
> node-2 node-3'
> Shutting down pacemaker/corosync services...
> Redirecting to /bin/systemctl stop  pacemaker.service
> Redirecting to /bin/systemctl stop  corosync.service
> Killing any remaining services...
> Removing all cluster configuration files...
> node-1: Succeeded
> node-2: Succeeded
> node-3: Succeeded
> Synchronizing pcsd certificates on nodes node-1, node-2, node-3...
> node-1: Success
> node-3: Success
> node-2: Success
> Restaring pcsd on the nodes in order to reload the certificates...
> node-1: Success
> node-3: Success
> node-2: Success
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster start --all'
> node-3: Starting Cluster...
> node-2: Starting Cluster...
> node-1: Starting Cluster...
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'
> Printing ring status.
> Local node ID 1
> RING ID 0
>     id    = 192.168.10.11
>     status    = ring 0 active with no faults
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs status corosync'
> Membership information
> ----------------------
>     Nodeid      Votes Name
>          1          1 node-1 (local)
>          2          1 node-2
>          3          1 node-3
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> WARNING: no stonith devices and stonith-enabled is not false
> Last updated: Sat Jun 25 15:40:51 2016        Last change: Sat Jun 25
> 15:40:33 2016 by hacluster via crmd on node-2
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 0 resources configured
> Online: [ node-1 node-2 node-3 ]
> Full list of resources:
> PCSD Status:
>   node-1: Online
>   node-2: Online
>   node-3: Online
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0    node-3    clear
> 1    node-2    clear
> 2    node-1    clear
> 
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump'
> ==Dumping header on disk /dev/sdb1
> Header version     : 2.1
> UUID               : 79f28167-a207-4f2a-a723-aa1c00bf1dee
> Number of slots    : 255
> Sector size        : 512
> Timeout (watchdog) : 10
> Timeout (allocate) : 2
> Timeout (loop)     : 1
> Timeout (msgwait)  : 20
> ==Header on disk /dev/sdb1 is dumped
> 
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith list'
> fence_sbd - Fence agent for sbd
> 
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd
> devices=/dev/sdb1 power_timeout=21 action=off'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s'
> ssh node-1 -c sudo su - -c 'pcs property'
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: mycluster
>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>  have-watchdog: true
>  stonith-enabled: true
>  stonith-timeout: 24s
>  stonith-watchdog-timeout: 10s
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith'
>  Resource: MyStonith (class=stonith type=fence_sbd)
>   Attributes: devices=/dev/sdb1 power_timeout=21 action=off
>   Operations: monitor interval=60s (MyStonith-monitor-interval-60s)
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 '
> node-1: Stopping Cluster (pacemaker)...
> node-1: Stopping Cluster (corosync)...
> 
> 
> 
> ############################
> ssh node-2 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> Last updated: Sat Jun 25 15:42:29 2016        Last change: Sat Jun 25
> 15:41:09 2016 by root via cibadmin on node-1
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 1 resource configured
> Online: [ node-2 node-3 ]
> OFFLINE: [ node-1 ]
> Full list of resources:
>  MyStonith    (stonith:fence_sbd):    Started node-2
> PCSD Status:
>   node-1: Online
>   node-2: Online
>   node-3: Online
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> 
> 
> ############################
> ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 '
> 
> 
> 
> ############################
> ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages'
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Additional logging
> available in /var/log/cluster/corosync.log
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Connecting to cluster
> infrastructure: corosync
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-2[2] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Watching for stonith
> topology changes
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Added 'watchdog' to
> the device list (1 active devices)
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-3[3] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-1[1] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: New watchdog timeout
> 10s (was 0s)
> Jun 25 15:41:03 localhost stonith-ng[3102]:  notice: Relying on watchdog
> integration for fencing
> Jun 25 15:41:04 localhost stonith-ng[3102]:  notice: Added 'MyStonith' to
> the device list (2 active devices)
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-1[1] - state is now lost (was member)
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Removing node-1/1 from
> the membership list
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Purged 1 peers with
> id=1 and/or uname=node-1 from the membership cache
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Client
> stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device
> '(any)'
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Initiating remote
> operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0)
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: MyStonith can fence
> (off) node-1: dynamic-list
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation 'off' [3309]
> (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith'
> returned: 0 (OK)
> Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation off of
> node-1 by node-2 for stonith_admin.3288 at node-2.848cd1e9: OK
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification
> (3102-3288-12): Broken pipe (32)
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence
> notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32)
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0    node-3    clear
> 1    node-2    clear
> 2    node-1    off    node-2
> 
> 
> 
> ############################
> ssh node-1 -c sudo su - -c 'uptime'
>  15:43:31 up 21 min,  2 users,  load average: 0.25, 0.18, 0.11
> 
> 
> 
> Cheers,
> 
> Marcin
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>