Product SiteDocumentation Site

4.3.4. Testing Recovery and Fencing

Pacemaker’s policy engine is smart enough to know fencing guest nodes associated with a virtual machine means shutting off/rebooting the virtual machine. No special configuration is necessary to make this happen. If you are interested in testing this functionality out, trying stopping the guest’s pacemaker_remote daemon. This would be equivalent of abruptly terminating a cluster node’s corosync membership without properly shutting it down.
ssh into the guest and run this command.
# kill -9 `pidof pacemaker_remoted`
Within a few seconds, your pcs status output will show a monitor failure, and the guest1 node will not be shown while it is being recovered.
# pcs status
Cluster name: mycluster
Last updated: Fri Oct  9 18:08:35 2015          Last change: Fri Oct  9 18:07:00 2015 by root via cibadmin on example-host
Stack: corosync
Current DC: example-host (version 1.1.13-a14efad) - partition with quorum
2 nodes and 7 resources configured

Online: [ example-host ]

Full list of resources:

 vm-guest1      (ocf::heartbeat:VirtualDomain): Started example-host
 FAKE1  (ocf::pacemaker:Dummy): Stopped
 FAKE2  (ocf::pacemaker:Dummy): Stopped
 FAKE3  (ocf::pacemaker:Dummy): Stopped
 FAKE4  (ocf::pacemaker:Dummy): Started example-host
 FAKE5  (ocf::pacemaker:Dummy): Started example-host

Failed Actions:
* guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none',
    last-rc-change='Fri Oct  9 18:08:29 2015', queued=0ms, exec=0ms


PCSD Status:
  example-host: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Note

A guest node involves two resources: the one you explicitly configured creates the guest, and Pacemaker creates an implicit resource for the pacemaker_remote connection, which will be named the same as the value of the remote-node attribute of the explicit resource. When we killed pacemaker_remote, it is the implicit resource that failed, which is why the failed action starts with guest1 and not vm-guest1.
Once recovery of the guest is complete, you’ll see it automatically get re-integrated into the cluster. The final pcs status output should look something like this.
Cluster name: mycluster
Last updated: Fri Oct  9 18:18:30 2015          Last change: Fri Oct  9 18:07:00 2015 by root via cibadmin on example-host
Stack: corosync
Current DC: example-host (version 1.1.13-a14efad) - partition with quorum
2 nodes and 7 resources configured

Online: [ example-host ]
GuestOnline: [ guest1@example-host ]

Full list of resources:

 vm-guest1      (ocf::heartbeat:VirtualDomain): Started example-host
 FAKE1  (ocf::pacemaker:Dummy): Started guest1
 FAKE2  (ocf::pacemaker:Dummy): Started guest1
 FAKE3  (ocf::pacemaker:Dummy): Started guest1
 FAKE4  (ocf::pacemaker:Dummy): Started example-host
 FAKE5  (ocf::pacemaker:Dummy): Started example-host

Failed Actions:
* guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none',
    last-rc-change='Fri Oct  9 18:08:29 2015', queued=0ms, exec=0ms


PCSD Status:
  example-host: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
Normally, once you’ve investigated and addressed a failed action, you can clear the failure. However Pacemaker does not yet support cleanup for the implicitly created connection resource while the explicit resource is active. If you want to clear the failed action from the status output, stop the guest resource before clearing it. For example:
# pcs resource disable vm-guest1 --wait
# pcs resource cleanup guest1
# pcs resource enable vm-guest1