9.3.2. Moving Resources Due to Failure
Normally, if a running resource fails, pacemaker will try to start it again on the same node. However if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case.
Pacemaker allows you to set your preference via the
migration-threshold
resource option.
Simply define migration-threshold=N
for a resource and it will migrate to a new node after N failures. There is no threshold defined by default. To determine the resource’s current failure status and limits, run crm_mon --failcounts
.
By default, once the threshold has been reached, the troublesome node will no longer be allowed to run the failed resource until the administrator manually resets the resource’s failcount using crm_failcount
(after hopefully first fixing the failure’s cause). Alternatively, it is possible to expire them by setting the failure-timeout
option for the resource.
For example, a setting of migration-threshold=2
and failure-timeout=60s
would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute.
There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop.
If the cluster property start-failure-is-fatal
is set to true
(which is the default), start failures cause the failcount to be set to INFINITY
and thus always cause the resource to move immediately.
Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout.