9.3.2. Failure Response

Normally, if a running resource fails, pacemaker will try to stop it and start it again. Pacemaker will choose the best location to start it each time, which may be the same node that it failed on.

However, if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case. Pacemaker allows you to set your preference via the migration-threshold resource meta-attribute. ^[16]

If you define migration-threshold=N for a resource, it will be banned from the original node after N failures.

Note

The migration-threshold is per resource, even though fail counts are tracked per operation. The operation fail counts are added together to compare against the migration-threshold.

By default, fail counts remain until manually cleared by an administrator using crm_resource --cleanup or crm_failcount --delete (hopefully after first fixing the failure’s cause). It is possible to have fail counts expire automatically by setting the failure-timeout resource meta-attribute.

Important

A successful operation does not clear past failures. If a recurring monitor operation fails once, succeeds many times, then fails again days later, its fail count is 2. Fail counts are cleared only by manual intervention or falure timeout.

For example, a setting of migration-threshold=2 and failure-timeout=60s would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute.

Note

failure-timeout is measured since the most recent failure. That is, older failures do not individually time out and lower the fail count. Instead, all failures are timed out simultaneously (and the fail count is reset to 0) if there is no new failure for the timeout period.

There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop.

If the cluster property start-failure-is-fatal is set to true (which is the default), start failures cause the fail count to be set to INFINITY and thus always cause the resource to move immediately.

Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout.

Important

Please read Section 8.7, “Ensuring Time-Based Rules Take Effect” to understand how timeouts work before configuring a failure-timeout.

^[16] The naming of this option was perhaps unfortunate as it is easily confused with live migration, the process of moving a resource from one node to another without stopping it. Xen virtual guests are the most common example of resources that can be migrated in this manner.