<div dir="ltr"><div>Hi,<br></div>The problem was due to bad stonith configuration. Above config is an example of a working Active/Active NFS configuration.<br></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Pozdrawiam,<br>Arek</div></div></div></div>

<br><div class="gmail_quote">2017-07-10 12:59 GMT+02:00 ArekW <span dir="ltr">&lt;<a href="mailto:arkaduis@gmail.com" target="_blank">arkaduis@gmail.com</a>&gt;</span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Hi,<br>I&#39;ve created 2-node active-active HA Cluster with NFS resource. The resources are active on both nodes. The Cluster passes failover test with pcs standby command but does not work when &quot;real&quot; node shutdown occure.<br><br>Test scenario with cluster standby:<br>- start cluster<br>- mount nfs share on client1<br>- start copy file from client1 to nfs share<br>- during the copy put node1/node2 to standby mode (pcs cluster standby nfsnode2)<br>- the copy continue<br>- unstandby node1/node2<br>- the copy continue and the storage re-sync (drbd)<br>- the copy finish with no errors<br><br>I can standby and unstandby the cluster many times and it works. The problem begins when I do a &quot;true&quot; failover test by hard-shutting down one of the nodes. Test results:<br>- start cluster<br>- mount nfs share on client1<br>- start copy file from client1 to nfs share<br>- during the copy shutdown node2 by stoping the node&#39;s virtual machine (hard stop)<br>- the system hangs:<br><br></div>&lt;Start copy file at client1&gt;<br><div># rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/<br><br>&lt;everything works ok. There is temp file .testfile.dat.9780fH&gt;<br><br>[root@nfsnode1 nfs]# ls -lah<br>razem 9,8M<br>drwxr-xr-x 2 root root 3,8K 07-10 11:07 .<br>drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..<br>-rw-r--r-- 1 root root    9 07-10 08:20 client1.txt<br>-rw-r----- 1 root root    0 07-10 11:07 .rmtab<br>-rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH<br><br>[root@nfsnode1 nfs]# pcs status<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 11:07:29 2017          Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Online: [ nfsnode1 nfsnode2 ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Masters: [ nfsnode1 nfsnode2 ]<br> Clone Set: dlm-clone [dlm]<br>     Started: [ nfsnode1 nfsnode2 ]<br> vbox-fencing   (stonith:fence_vbox):   Started nfsnode1<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode2<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode1<br> Clone Set: StorageFS-clone [StorageFS]<br>     Started: [ nfsnode1 nfsnode2 ]<br> Clone Set: WebSite-clone [WebSite]<br>     Started: [ nfsnode1 nfsnode2 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Started: [ nfsnode1 nfsnode2 ]<br><br>&lt;Hard poweroff vm machine: nfsnode2&gt;<br><br>[root@nfsnode1 nfs]# pcs status<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 11:07:43 2017          Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Node nfsnode2: UNCLEAN (offline)<br>Online: [ nfsnode1 ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Storage    (ocf::linbit:drbd):     Master nfsnode2 (UNCLEAN)<br>     Masters: [ nfsnode1 ]<br> Clone Set: dlm-clone [dlm]<br>     dlm        (ocf::pacemaker:controld):    <wbr>  Started nfsnode2 (UNCLEAN)<br>     Started: [ nfsnode1 ]<br> vbox-fencing   (stonith:fence_vbox):   Started nfsnode1<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode2 (UNCLEAN)<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode1<br> Clone Set: StorageFS-clone [StorageFS]<br>     StorageFS  (ocf::heartbeat:Filesystem):  <wbr>  Started nfsnode2 (UNCLEAN)<br>     Started: [ nfsnode1 ]<br> Clone Set: WebSite-clone [WebSite]<br>     WebSite    (ocf::heartbeat:apache):      <wbr>  Started nfsnode2 (UNCLEAN)<br>     Started: [ nfsnode1 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Resource Group: nfs-group:1<br>         nfs    (ocf::heartbeat:nfsserver):   <wbr>  Started nfsnode2 (UNCLEAN)<br>         nfs-export     (ocf::heartbeat:exportfs):    <wbr>  Started nfsnode2 (UNCLEAN)<br>     Started: [ nfsnode1 ]<br><br>&lt;ssh console hangs on client1&gt;<br>[root@nfsnode1 nfs]# ls -lah<br>&lt;nothing happen&gt;<br><br>&lt;drbd status is ok in this situation&gt;<br>[root@nfsnode1 ~]# drbdadm status<br>storage role:Primary<br>  disk:UpToDate<br>  nfsnode2 connection:Connecting<br><br>&lt;the nfs export is still active on node1&gt;<br>[root@nfsnode1 ~]# exportfs<br>/mnt/drbd/nfs   <a href="http://10.0.2.0/255.255.255.0" target="_blank">10.0.2.0/255.255.255.0</a><br><br>&lt;After ssh to client1 the nfs mount is not accessible&gt;<br>login as: root<br><a href="mailto:root@127.0.0.1" target="_blank">root@127.0.0.1</a>&#39;s password:<br>Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2<br># cd /mnt/<br># ls<br>&lt;console hangs&gt;<br><br># mount<br>10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=<wbr>131072,wsize=131072,namlen=<wbr>255,hard,proto=tcp,timeo=600,<wbr>retrans=2,sec=sys,clientaddr=<wbr>10.0.2.20,local_lock=none,<wbr>addr=10.0.2.7)<br><br>&lt;Power on vm machine nfsnode2&gt;<br>&lt;After nfsnode2 boot, console an nfsnode1 start respond but coping is not proceeding&gt;<br>&lt;The temp file is visible but not active&gt;<br>[root@nfsnode1 ~]# ls -lah<br>razem 9,8M<br>drwxr-xr-x 2 root root 3,8K 07-10 11:07 .<br>drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..<br>-rw-r--r-- 1 root root    9 07-10 08:20 client1.txt<br>-rw-r----- 1 root root    0 07-10 11:16 .rmtab<br>-rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH<br><br>&lt;Coping at client1 hangs&gt;<br><br>&lt;Cluster status:&gt;<br>[root@nfsnode1 ~]# pcs status<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 11:17:19 2017          Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Online: [ nfsnode1 nfsnode2 ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Masters: [ nfsnode1 ]<br>     Stopped: [ nfsnode2 ]<br> Clone Set: dlm-clone [dlm]<br>     Started: [ nfsnode1 nfsnode2 ]<br> vbox-fencing   (stonith:fence_vbox):   Started nfsnode1<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Stopped<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode1<br> Clone Set: StorageFS-clone [StorageFS]<br>     Started: [ nfsnode1 ]<br>     Stopped: [ nfsnode2 ]<br> Clone Set: WebSite-clone [WebSite]<br>     Started: [ nfsnode1 ]<br>     Stopped: [ nfsnode2 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Resource Group: nfs-group:0<br>         nfs    (ocf::heartbeat:nfsserver):   <wbr>  Started nfsnode1<br>         nfs-export     (ocf::heartbeat:exportfs):    <wbr>  FAILED nfsnode1<br>     Stopped: [ nfsnode2 ]<br><br>Failed Actions:<br>* nfs-export_monitor_30000 on nfsnode1 &#39;unknown error&#39; (1): call=61, status=Timed Out, exitreason=&#39;none&#39;,<br>    last-rc-change=&#39;Mon Jul 10 11:11:50 2017&#39;, queued=0ms, exec=0ms<br>* vbox-fencing_monitor_60000 on nfsnode1 &#39;unknown error&#39; (1): call=22, status=Error, exitreason=&#39;none&#39;,<br>    last-rc-change=&#39;Mon Jul 10 11:06:41 2017&#39;, queued=0ms, exec=11988ms<br><br>&lt;Try to cleanup&gt;<br><br># pcs resource cleanup<br># pcs status<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 11:20:38 2017          Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Online: [ nfsnode1 nfsnode2 ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Masters: [ nfsnode1 ]<br>     Stopped: [ nfsnode2 ]<br> Clone Set: dlm-clone [dlm]<br>     Started: [ nfsnode1 nfsnode2 ]<br> vbox-fencing   (stonith:fence_vbox):   Stopped<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Stopped<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Stopped<br> Clone Set: StorageFS-clone [StorageFS]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br> Clone Set: WebSite-clone [WebSite]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br><br>Daemon Status:<br>  corosync: active/enabled<br>  pacemaker: active/enabled<br>  pcsd: active/enabled<br><br>&lt;Reboot of both nfsnode1 and nfsnode2&gt;<br>&lt;After reboot:&gt;<br><br># pcs status<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 11:24:10 2017          Last change: Mon Jul 10 10:28:12 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Online: [ nfsnode1 nfsnode2 ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Slaves: [ nfsnode2 ]<br>     Stopped: [ nfsnode1 ]<br> Clone Set: dlm-clone [dlm]<br>     Started: [ nfsnode1 nfsnode2 ]<br> vbox-fencing   (stonith:fence_vbox):   Stopped<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Stopped<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Stopped<br> Clone Set: StorageFS-clone [StorageFS]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br> Clone Set: WebSite-clone [WebSite]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Stopped: [ nfsnode1 nfsnode2 ]<br><br>Daemon Status:<br>  corosync: active/enabled<br>  pacemaker: active/enabled<br>  pcsd: active/enabled<br><br>&lt;Eventually the cluster was recovered after:&gt;<br>pcs cluster stop --all<br>&lt;Solve drbd split-brain&gt;<br>pcs cluster start --all<br><br>The client1 could not be rebooted with &#39;reboot&#39; due to mount hung (as I preasume). It has to be rebooted hard-way by virtualbox hypervisor.<br>What&#39;s wrong with this configuration? I can send CIB configuration if necessary.<br><br>---------------<br>Full cluster configuration (working state):<br><br># pcs status --full<br>Cluster name: nfscluster<br>Stack: corosync<br>Current DC: nfsnode1 (1) (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum<br>Last updated: Mon Jul 10 12:44:03 2017          Last change: Mon Jul 10 11:37:13 2017 by root via crm_attribute on nfsnode1<br><br>2 nodes and 15 resources configured<br><br>Online: [ nfsnode1 (1) nfsnode2 (2) ]<br><br>Full list of resources:<br><br> Master/Slave Set: StorageClone [Storage]<br>     Storage    (ocf::linbit:drbd):     Master nfsnode1<br>     Storage    (ocf::linbit:drbd):     Master nfsnode2<br>     Masters: [ nfsnode1 nfsnode2 ]<br> Clone Set: dlm-clone [dlm]<br>     dlm        (ocf::pacemaker:controld):    <wbr>  Started nfsnode1<br>     dlm        (ocf::pacemaker:controld):    <wbr>  Started nfsnode2<br>     Started: [ nfsnode1 nfsnode2 ]<br> vbox-fencing   (stonith:fence_vbox):   Started nfsnode1<br> Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>     ClusterIP:0        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode2<br>     ClusterIP:1        (ocf::heartbeat:IPaddr2):     <wbr>  Started nfsnode1<br> Clone Set: StorageFS-clone [StorageFS]<br>     StorageFS  (ocf::heartbeat:Filesystem):  <wbr>  Started nfsnode1<br>     StorageFS  (ocf::heartbeat:Filesystem):  <wbr>  Started nfsnode2<br>     Started: [ nfsnode1 nfsnode2 ]<br> Clone Set: WebSite-clone [WebSite]<br>     WebSite    (ocf::heartbeat:apache):      <wbr>  Started nfsnode1<br>     WebSite    (ocf::heartbeat:apache):      <wbr>  Started nfsnode2<br>     Started: [ nfsnode1 nfsnode2 ]<br> Clone Set: nfs-group-clone [nfs-group]<br>     Resource Group: nfs-group:0<br>         nfs    (ocf::heartbeat:nfsserver):   <wbr>  Started nfsnode1<br>         nfs-export     (ocf::heartbeat:exportfs):    <wbr>  Started nfsnode1<br>     Resource Group: nfs-group:1<br>         nfs    (ocf::heartbeat:nfsserver):   <wbr>  Started nfsnode2<br>         nfs-export     (ocf::heartbeat:exportfs):    <wbr>  Started nfsnode2<br>     Started: [ nfsnode1 nfsnode2 ]<br><br>Node Attributes:<br>* Node nfsnode1 (1):<br>    + master-Storage                <wbr>    : 10000<br>* Node nfsnode2 (2):<br>    + master-Storage                <wbr>    : 10000<br><br>Migration Summary:<br>* Node nfsnode1 (1):<br>* Node nfsnode2 (2):<br><br>PCSD Status:<br>  nfsnode1: Online<br>  nfsnode2: Online<br><br>Daemon Status:<br>  corosync: active/enabled<br>  pacemaker: active/enabled<br>  pcsd: active/enabled<br><br>]# pcs resource --full<br> Master: StorageClone<br>  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2 clone-node-max=1<br>  Resource: Storage (class=ocf provider=linbit type=drbd)<br>   Attributes: drbd_resource=storage<br>   Operations: start interval=0s timeout=240 (Storage-start-interval-0s)<br>               promote interval=0s timeout=90 (Storage-promote-interval-0s)<br>               demote interval=0s timeout=90 (Storage-demote-interval-0s)<br>               stop interval=0s timeout=100 (Storage-stop-interval-0s)<br>               monitor interval=60s (Storage-monitor-interval-60s)<br> Clone: dlm-clone<br>  Meta Attrs: clone-max=2 clone-node-max=1<br>  Resource: dlm (class=ocf provider=pacemaker type=controld)<br>   Operations: start interval=0s timeout=90 (dlm-start-interval-0s)<br>               stop interval=0s timeout=100 (dlm-stop-interval-0s)<br>               monitor interval=60s (dlm-monitor-interval-60s)<br> Clone: ClusterIP-clone<br>  Meta Attrs: clona-node-max=2 clone-max=2 globally-unique=true clone-node-max=2<br>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)<br>   Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip<br>   Meta Attrs: resource-stickiness=0<br>   Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)<br>               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)<br>               monitor interval=5s (ClusterIP-monitor-interval-<wbr>5s)<br> Clone: StorageFS-clone<br>  Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)<br>   Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2<br>   Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)<br>               stop interval=0s timeout=60 (StorageFS-stop-interval-0s)<br>               monitor interval=20 timeout=40 (StorageFS-monitor-interval-<wbr>20)<br> Clone: WebSite-clone<br>  Resource: WebSite (class=ocf provider=heartbeat type=apache)<br>   Attributes: configfile=/etc/httpd/conf/<wbr>httpd.conf statusurl=<a href="http://localhost/server-status" target="_blank">http://localhost/<wbr>server-status</a><br>   Operations: start interval=0s timeout=40s (WebSite-start-interval-0s)<br>               stop interval=0s timeout=60s (WebSite-stop-interval-0s)<br>               monitor interval=1min (WebSite-monitor-interval-<wbr>1min)<br> Clone: nfs-group-clone<br>  Meta Attrs: interleave=true<br>  Group: nfs-group<br>   Resource: nfs (class=ocf provider=heartbeat type=nfsserver)<br>    Attributes: nfs_ip=10.0.2.7 nfs_no_notify=true<br>    Operations: start interval=0s timeout=40 (nfs-start-interval-0s)<br>                stop interval=0s timeout=20s (nfs-stop-interval-0s)<br>                monitor interval=30s (nfs-monitor-interval-30s)<br>   Resource: nfs-export (class=ocf provider=heartbeat type=exportfs)<br>    Attributes: clientspec=<a href="http://10.0.2.0/255.255.255.0" target="_blank">10.0.2.0/255.255.<wbr>255.0</a> options=rw,sync,no_root_squash directory=/mnt/drbd/nfs fsid=0<br>    Operations: start interval=0s timeout=40 (nfs-export-start-interval-0s)<br>                stop interval=0s timeout=120 (nfs-export-stop-interval-0s)<br>                monitor interval=30s (nfs-export-monitor-interval-<wbr>30s)<br><br># pcs constraint --full<br>Location Constraints:<br>Ordering Constraints:<br>  start ClusterIP-clone then start WebSite-clone (kind:Mandatory) (id:order-ClusterIP-WebSite-<wbr>mandatory)<br>  promote StorageClone then start StorageFS-clone (kind:Mandatory) (id:order-StorageClone-<wbr>StorageFS-mandatory)<br>  start StorageFS-clone then start WebSite-clone (kind:Mandatory) (id:order-StorageFS-WebSite-<wbr>mandatory)<br>  start dlm-clone then start StorageFS-clone (kind:Mandatory) (id:order-dlm-clone-StorageFS-<wbr>mandatory)<br>  start StorageFS-clone then start nfs-group-clone (kind:Mandatory) (id:order-StorageFS-clone-nfs-<wbr>group-clone-mandatory)<br>Colocation Constraints:<br>  WebSite-clone with ClusterIP-clone (score:INFINITY) (id:colocation-WebSite-<wbr>ClusterIP-INFINITY)<br>  StorageFS-clone with StorageClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-StorageFS-<wbr>StorageClone-INFINITY)<br>  WebSite-clone with StorageFS-clone (score:INFINITY) (id:colocation-WebSite-<wbr>StorageFS-INFINITY)<br>  StorageFS-clone with dlm-clone (score:INFINITY) (id:colocation-StorageFS-dlm-<wbr>clone-INFINITY)<br>  StorageFS-clone with nfs-group-clone (score:INFINITY) (id:colocation-StorageFS-<wbr>clone-nfs-group-clone-<wbr>INFINITY)<br><br></div></div>

</blockquote></div><br></div>