Product SiteDocumentation Site

9.3.2. Moving Resources Due to Failure

Normally, if a running resource fails, pacemaker will try to start it again on the same node. However if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case.
Pacemaker allows you to set your preference via the migration-threshold resource option. [16]
Simply define migration-threshold=N for a resource and it will migrate to a new node after N failures. There is no threshold defined by default. To determine the resource’s current failure status and limits, run crm_mon --failcounts.
By default, once the threshold has been reached, the troublesome node will no longer be allowed to run the failed resource until the administrator manually resets the resource’s failcount using crm_failcount (after hopefully first fixing the failure’s cause). Alternatively, it is possible to expire them by setting the failure-timeout option for the resource.
For example, a setting of migration-threshold=2 and failure-timeout=60s would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute.
There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop.
If the cluster property start-failure-is-fatal is set to true (which is the default), start failures cause the failcount to be set to INFINITY and thus always cause the resource to move immediately.
Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout.

Important

Please read Section 8.7, “Ensuring Time-Based Rules Take Effect” to understand how timeouts work before configuring a failure-timeout.


[16] The naming of this option was perhaps unfortunate as it is easily confused with live migration, the process of moving a resource from one node to another without stopping it. Xen virtual guests are the most common example of resources that can be migrated in this manner.