3. Cluster-Wide Configuration

3.1. Configuration Layout

The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this:

An empty configuration

<cib crm_feature_set="3.6.0" validate-with="pacemaker-3.5" epoch="1" num_updates="0" admin_epoch="0">
  <configuration>
    <crm_config/>
    <nodes/>
    <resources/>
    <constraints/>
  </configuration>
  <status/>
</cib>

The empty configuration above contains the major sections that make up a CIB:

  • cib: The entire CIB is enclosed with a cib element. Certain fundamental settings are defined as attributes of this element.

    • configuration: This section – the primary focus of this document – contains traditional configuration information such as what resources the cluster serves and the relationships among them.

      • crm_config: cluster-wide configuration options

      • nodes: the machines that host the cluster

      • resources: the services run by the cluster

      • constraints: indications of how resources should be placed

    • status: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way.

In this document, configuration settings will be described as properties or options based on how they are defined in the CIB:

  • Properties are XML attributes of an XML element.

  • Options are name-value pairs expressed as nvpair child elements of an XML element.

Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak.

Options can appear within four types of enclosing elements:

  • cluster_property_set

  • instance_attributes

  • meta_attributes

  • utilization

We will refer to a set of options and its enclosing element as a block.

Properties of an Option Block’s Enclosing Element

Name

Type

Default

Description

id

id

A unique name for the block (required)

score

score

0

Priority with which to process the block

Each block may optionally contain a rule.

3.2. Option Precedence

This subsection describes the precedence of options within a set of blocks and within a single block.

Options are processed as follows:

  • All option blocks of a given type are processed in order of their score attribute, from highest to lowest. For cluster_property_set, if there is a block whose enclosing element has id="cib-bootstrap-options", then that block is always processed first regardless of score.

  • If a block contains a rule that evaluates to false, that block is skipped.

  • Within a block, options are processed in order from first to last.

  • The first value found for a given option is applied, and the rest are ignored.

Note that this means it is pointless to configure the same option twice in a single block, because occurrences after the first one would be ignored.

For example, in the following configuration snippet, the no-quorum-policy value demote is applied. property-set2 has a higher score than property-set1, so it’s processed first. There are no rules in this snippet, so both sets are processed. Within property-set2, the value demote appears first, so the later value freeze is ignored. We’ve already found a value for no-quorum-policy before we begin processing property-set1, so its value stop is ignored.

<cluster_property_set id="property-set1" score="500">
  <nvpair id="no-quorum-policy1" name="no-quorum-policy" value="stop"/>
</cluster_property_set>
<cluster_property_set id="property-set2" score="1000">
  <nvpair id="no-quorum-policy2a" name="no-quorum-policy" value="demote"/>
  <nvpair id="no-quorum-policy2b" name="no-quorum-policy" value="freeze"/>
</cluster_property_set>

3.3. CIB Properties

Certain settings are defined by CIB properties (that is, attributes of the cib tag) rather than with the rest of the cluster configuration in the configuration section.

The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location.

CIB Properties

Name

Type

Default

Description

admin_epoch

nonnegative integer

0

When a node joins the cluster, the cluster asks the node with the highest (admin_epoch, epoch, num_updates) tuple to replace the configuration on all the nodes – which makes setting them correctly very important. admin_epoch is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete.

epoch

nonnegative integer

0

The cluster increments this every time the CIB’s configuration section is updated.

num_updates

nonnegative integer

0

The cluster increments this every time the CIB’s configuration or status sections are updated, and resets it to 0 when epoch changes.

validate-with

enumeration

Determines the type of XML validation that will be done on the configuration. Allowed values are none (in which case the cluster will not require that updates conform to expected syntax) and the base names of schema files installed on the local machine (for example, “pacemaker-3.9”)

remote-tls-port

port

If set, the CIB manager will listen for anonymously encrypted remote connections on this port, to allow CIB administration from hosts not in the cluster. No key is used, so this should be used only on a protected network where man-in-the-middle attacks can be avoided.

remote-clear-port

port

If set to a TCP port number, the CIB manager will listen for remote connections on this port, to allow for CIB administration from hosts not in the cluster. No encryption is used, so this should be used only on a protected network.

cib-last-written

date/time

Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only.

have-quorum

boolean

Indicates whether the cluster has quorum. If false, the cluster’s response is determined by no-quorum-policy (see below). Maintained by the cluster.

dc-uuid

text

Node ID of the cluster’s current designated controller (DC). Used and maintained by the cluster.

execution-date

epoch time

Time to use when evaluating rules.

3.4. Cluster Options

Cluster options, as you might expect, control how the cluster behaves when confronted with various situations.

They are grouped into sets within the crm_config section. In advanced configurations, there may be more than one set. (This will be described later in the chapter on Rules where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once.

You can obtain an up-to-date list of cluster options, including their default values, by running the man pacemaker-schedulerd and man pacemaker-controld commands.

Cluster Options

Name

Type

Default

Description

cluster-name

text

An (optional) name for the cluster as a whole. This is mostly for users’ convenience for use as desired in administration, but can be used in the Pacemaker configuration in Rules (as the #cluster-name node attribute). It may also be used by higher-level tools when displaying cluster information, and by certain resource agents (for example, the ocf:heartbeat:GFS2 agent stores the cluster name in filesystem meta-data).

dc-version

version

detected

Version of Pacemaker on the cluster’s designated controller (DC). Maintained by the cluster, and intended for diagnostic purposes.

cluster-infrastructure

text

detected

The messaging layer with which Pacemaker is currently running. Maintained by the cluster, and intended for informational and diagnostic purposes.

no-quorum-policy

enumeration

stop

What to do when the cluster does not have quorum. Allowed values:

  • ignore: continue all resource management

  • freeze: continue resource management, but don’t recover resources from nodes not in the affected partition

  • stop: stop all resources in the affected cluster partition

  • demote: demote promotable resources and stop all other resources in the affected cluster partition (since 2.0.5)

  • fence: fence all nodes in the affected cluster partition (since 2.1.9)

  • suicide: same as fence (deprecated since 2.1.9)

batch-limit

integer

0

The maximum number of actions that the cluster may execute in parallel across all nodes. The ideal value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. If -1, the cluster will not impose any limit.

migration-limit

integer

-1

The number of live migration actions that the cluster is allowed to execute in parallel on a node. A value of -1 means unlimited.

load-threshold

percentage

80%

Maximum amount of system load that should be used by cluster nodes. The cluster will slow down its recovery process when the amount of system resources used (currently CPU) approaches this limit.

node-action-limit

integer

0

Maximum number of jobs that can be scheduled per node. If nonpositive or invalid, double the number of cores is used as the maximum number of jobs per node. PCMK_node_action_limit overrides this option on a per-node basis.

symmetric-cluster

boolean

true

If true, resources can run on any node by default. If false, a resource is allowed to run on a node only if a location constraint enables it.

stop-all-resources

boolean

false

Whether all resources should be disallowed from running (can be useful during maintenance or troubleshooting)

stop-orphan-resources

boolean

true

Whether resources that have been deleted from the configuration should be stopped. This value takes precedence over is-managed (that is, even unmanaged resources will be stopped when orphaned if this value is true).

stop-orphan-actions

boolean

true

Whether recurring operations that have been deleted from the configuration should be cancelled

start-failure-is-fatal

boolean

true

Whether a failure to start a resource on a particular node prevents further start attempts on that node. If false, the cluster will decide whether the node is still eligible based on the resource’s current failure count and migration-threshold.

enable-startup-probes

boolean

true

Whether the cluster should check the pre-existing state of resources when the cluster starts

maintenance-mode

boolean

false

If true, the cluster will not start or stop any resource in the cluster, and any recurring operations (expect those specifying role as Stopped) will be paused. If true, this overrides the maintenance node attribute, is-managed and maintenance resource meta-attributes, and enabled operation meta-attribute.

stonith-enabled

boolean

true

Whether the cluster is allowed to fence nodes (for example, failed nodes and nodes with resources that can’t be stopped).

If true, at least one fence device must be configured before resources are allowed to run.

If false, unresponsive nodes are immediately assumed to be running no resources, and resource recovery on online nodes starts without any further protection (which can mean data loss if the unresponsive node still accesses shared storage, for example). See also the requires resource meta-attribute.

This option applies only to fencing scheduled by the cluster, not to requests initiated externally (such as with the stonith_admin command-line tool).

stonith-action

enumeration

reboot

Action the cluster should send to the fence agent when a node must be fenced. Allowed values are reboot and off.

stonith-timeout

duration

60s

How long to wait for on, off, and reboot fence actions to complete by default.

stonith-max-attempts

score

10

How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. Any value below 1 will be ignored, and the default will be used instead.

have-watchdog

boolean

detected

Whether watchdog integration is enabled. This is set automatically by the cluster according to whether SBD is detected to be in use. User-configured values are ignored. The value true is meaningful if diskless SBD is used and stonith-watchdog-timeout is nonzero. In that case, if fencing is required, watchdog-based self-fencing will be performed via SBD without requiring a fencing resource explicitly configured.

stonith-watchdog-timeout

timeout

0

If nonzero, and the cluster detects have-watchdog as true, then watchdog-based self-fencing will be performed via SBD when fencing is required.

If this is set to a positive value, lost nodes are assumed to achieve self-fencing within this much time.

This does not require a fencing resource to be explicitly configured, though a fence_watchdog resource can be configured, to limit use to specific nodes.

If this is set to 0 (the default), the cluster will never assume watchdog-based self-fencing.

If this is set to a negative value, the cluster will use twice the local value of the SBD_WATCHDOG_TIMEOUT environment variable if that is positive, or otherwise treat this as 0.

Warning: When used, this timeout must be larger than SBD_WATCHDOG_TIMEOUT on all nodes that use watchdog-based SBD, and Pacemaker will refuse to start on any of those nodes where this is not true for the local value or SBD is not active. When this is set to a negative value, SBD_WATCHDOG_TIMEOUT must be set to the same value on all nodes that use SBD, otherwise data corruption or loss could occur.

concurrent-fencing

boolean

false

Whether the cluster is allowed to initiate multiple fence actions concurrently. Fence actions initiated externally, such as via the stonith_admin tool or an application such as DLM, or by the fencer itself such as recurring device monitors and status and list commands, are not limited by this option.

fence-reaction

enumeration

stop

How should a cluster node react if notified of its own fencing? A cluster node may receive notification of a “succeeded” fencing that targeted it if fencing is misconfigured, or if fabric fencing is in use that doesn’t cut cluster communication. Allowed values are stop to attempt to immediately stop Pacemaker and stay stopped, or panic to attempt to immediately reboot the local node, falling back to stop on failure. The default is likely to be changed to panic in a future release. (since 2.0.3)

priority-fencing-delay

duration

0

Apply this delay to any fencing targeting the lost nodes with the highest total resource priority in case we don’t have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match (especially meaningful in a split-brain of a 2-node cluster). A promoted resource instance takes the resource’s priority plus 1 if the resource’s priority is not 0. Any static or random delays introduced by pcmk_delay_base and pcmk_delay_max configured for the corresponding fencing resources will be added to this delay. This delay should be significantly greater than (safely twice) the maximum delay from those parameters. (since 2.0.4)

node-pending-timeout

duration

0

Fence nodes that do not join the controller process group within this much time after joining the cluster, to allow the cluster to continue managing resources. A value of 0 means never fence pending nodes. Setting the value to 2h means fence nodes after 2 hours. (since 2.1.7)

cluster-delay

duration

60s

If the DC requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node within this time (beyond the action’s own timeout). The ideal value will depend on the speed and load of your network and cluster nodes.

dc-deadtime

duration

20s

How long to wait for a response from other nodes when electing a DC. The ideal value will depend on the speed and load of your network and cluster nodes.

cluster-ipc-limit

nonnegative integer

500

The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see “Evicting client” log messages for cluster daemon process IDs.

pe-error-series-max

integer

-1

The number of scheduler inputs resulting in errors to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

pe-warn-series-max

integer

5000

The number of scheduler inputs resulting in warnings to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

pe-input-series-max

integer

4000

The number of “normal” scheduler inputs to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

enable-acl

boolean

false

Whether access control lists should be used to authorize CIB modifications

placement-strategy

enumeration

default

How the cluster should assign resources to nodes (see Utilization and Placement Strategy). Allowed values are default, utilization, balanced, and minimal.

node-health-strategy

enumeration

none

How the cluster should react to node health attributes. Allowed values are none, migrate-on-red, only-green, progressive, and custom.

node-health-base

score

0

The base health score assigned to a node. Only used when node-health-strategy is progressive.

node-health-green

score

0

The score to use for a node health attribute whose value is green. Only used when node-health-strategy is progressive or custom.

node-health-yellow

score

0

The score to use for a node health attribute whose value is yellow. Only used when node-health-strategy is progressive or custom.

node-health-red

score

-INFINITY

The score to use for a node health attribute whose value is red. Only used when node-health-strategy is progressive or custom.

cluster-recheck-interval

duration

15min

Pacemaker is primarily event-driven, and looks ahead to know when to recheck the cluster for failure-timeout settings and most time-based rules (since 2.0.3). However, it will also recheck the cluster after this amount of inactivity. This has three main effects:

  • Rules using date_spec are guaranteed to be checked only this often.

  • If fencing fails enough to reach stonith-max-attempts, attempts will begin again after at most this time.

  • It serves as a fail-safe in case of certain scheduler bugs. If the scheduler incorrectly determines only some of the actions needed to react to a particular event, it will often correctly determine the rest after at most this time.

A value of 0 disables this polling.

shutdown-lock

boolean

false

The default of false allows active resources to be recovered elsewhere when their node is cleanly shut down, which is what the vast majority of users will want. However, some users prefer to make resources highly available only for failures, with no recovery for clean shutdowns. If this option is true, resources active on a node when it is cleanly shut down are kept “locked” to that node (not allowed to run elsewhere) until they start again on that node after it rejoins (or for at most shutdown-lock-limit, if set). Stonith resources and Pacemaker Remote connections are never locked. Clone and bundle instances and the promoted role of promotable clones are currently never locked, though support could be added in a future release. Locks may be manually cleared using the --refresh option of crm_resource (both the resource and node must be specified; this works with remote nodes if their connection resource’s target-role is set to Stopped, but not if Pacemaker Remote is stopped on the remote node without disabling the connection resource). (since 2.0.4)

shutdown-lock-limit

duration

0

If shutdown-lock is true, and this is set to a nonzero time duration, locked resources will be allowed to start after this much time has passed since the node shutdown was initiated, even if the node has not rejoined. (This works with remote nodes only if their connection resource’s target-role is set to Stopped.) (since 2.0.4)

startup-fencing

boolean

true

Advanced Use Only: Whether the cluster should fence unseen nodes at start-up. Setting this to false is unsafe, because the unseen nodes could be active and running resources but unreachable. dc-deadtime acts as a grace period before this fencing, since a DC must be elected to schedule fencing.

election-timeout

duration

2min

Advanced Use Only: If a winner is not declared within this much time of starting an election, the node that initiated the election will declare itself the winner.

shutdown-escalation

duration

20min

Advanced Use Only: The controller will exit immediately if a shutdown does not complete within this much time.

join-integration-timeout

duration

3min

Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug.

join-finalization-timeout

duration

30min

Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug.

transition-delay

duration

0s

Advanced Use Only: Delay cluster recovery for the configured interval to allow for additional or related events to occur. This can be useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions.