3. Cluster-Wide Configuration¶
3.1. Configuration Layout¶
The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this:
An empty configuration
<cib crm_feature_set="3.6.0" validate-with="pacemaker-3.5" epoch="1" num_updates="0" admin_epoch="0">
<configuration>
<crm_config/>
<nodes/>
<resources/>
<constraints/>
</configuration>
<status/>
</cib>
The empty configuration above contains the major sections that make up a CIB:
cib
: The entire CIB is enclosed with acib
element. Certain fundamental settings are defined as attributes of this element.configuration
: This section – the primary focus of this document – contains traditional configuration information such as what resources the cluster serves and the relationships among them.crm_config
: cluster-wide configuration optionsnodes
: the machines that host the clusterresources
: the services run by the clusterconstraints
: indications of how resources should be placed
status
: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way.
In this document, configuration settings will be described as properties or options based on how they are defined in the CIB:
- Properties are XML attributes of an XML element.
- Options are name-value pairs expressed as
nvpair
child elements of an XML element.
Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak.
Options can appear within four types of enclosing elements:
cluster_property_set
instance_attributes
meta_attributes
utilization
We will refer to a set of options and its enclosing element as a block.
Name | Type | Default | Description |
---|---|---|---|
id |
id | A unique name for the block (required) | |
score |
score | 0 | Priority with which to process the block |
Each block may optionally contain a rule.
3.2. Option Precedence¶
This subsection describes the precedence of options within a set of blocks and within a single block.
Options are processed as follows:
- All option blocks of a given type are processed in order of their
score
attribute, from highest to lowest. Forcluster_property_set
, if there is a block whose enclosing element hasid="cib-bootstrap-options"
, then that block is always processed first regardless of score. - If a block contains a rule that evaluates to false, that block is skipped.
- Within a block, options are processed in order from first to last.
- The first value found for a given option is applied, and the rest are ignored.
Note that this means it is pointless to configure the same option twice in a single block, because occurrences after the first one would be ignored.
For example, in the following configuration snippet, the no-quorum-policy
value demote
is applied. property-set2
has a higher score than
property-set1
, so it’s processed first. There are no rules in this snippet,
so both sets are processed. Within property-set2
, the value demote
appears first, so the later value freeze
is ignored. We’ve already found a
value for no-quorum-policy
before we begin processing property-set1
, so
its value stop
is ignored.
<cluster_property_set id="property-set1" score="500">
<nvpair id="no-quorum-policy1" name="no-quorum-policy" value="stop"/>
</cluster_property_set>
<cluster_property_set id="property-set2" score="1000">
<nvpair id="no-quorum-policy2a" name="no-quorum-policy" value="demote"/>
<nvpair id="no-quorum-policy2b" name="no-quorum-policy" value="freeze"/>
</cluster_property_set>
3.3. CIB Properties¶
Certain settings are defined by CIB properties (that is, attributes of the
cib
tag) rather than with the rest of the cluster configuration in the
configuration
section.
The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location.
Name | Type | Default | Description |
---|---|---|---|
admin_epoch |
nonnegative integer | 0 | When a node joins the cluster, the cluster asks the node with the
highest (admin_epoch , epoch , num_updates ) tuple to replace
the configuration on all the nodes – which makes setting them correctly
very important. admin_epoch is never modified by the cluster; you
can use this to make the configurations on any inactive nodes obsolete. |
epoch |
nonnegative integer | 0 | The cluster increments this every time the CIB’s configuration section is updated. |
num_updates |
nonnegative integer | 0 | The cluster increments this every time the CIB’s configuration or status sections are updated, and resets it to 0 when epoch changes. |
validate-with |
enumeration | Determines the type of XML validation that will be done on the
configuration. Allowed values are none (in which case the cluster
will not require that updates conform to expected syntax) and the base
names of schema files installed on the local machine (for example,
“pacemaker-3.9”) |
|
remote-tls-port |
port | If set, the CIB manager will listen for anonymously encrypted remote connections on this port, to allow CIB administration from hosts not in the cluster. No key is used, so this should be used only on a protected network where man-in-the-middle attacks can be avoided. | |
remote-clear-port |
port | If set to a TCP port number, the CIB manager will listen for remote connections on this port, to allow for CIB administration from hosts not in the cluster. No encryption is used, so this should be used only on a protected network. | |
cib-last-written |
date/time | Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. | |
have-quorum |
boolean | Indicates whether the cluster has quorum. If false, the cluster’s
response is determined by no-quorum-policy (see below). Maintained
by the cluster. |
|
dc-uuid |
text | Node ID of the cluster’s current designated controller (DC). Used and maintained by the cluster. | |
execution-date |
epoch time | Time to use when evaluating rules. |
3.4. Cluster Options¶
Cluster options, as you might expect, control how the cluster behaves when confronted with various situations.
They are grouped into sets within the crm_config
section. In advanced
configurations, there may be more than one set. (This will be described later
in the chapter on Rules where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including their default
values, by running the man pacemaker-schedulerd
and
man pacemaker-controld
commands.
Name | Type | Default | Description |
---|---|---|---|
cluster-name |
text | An (optional) name for the cluster as a whole. This is mostly for users’
convenience for use as desired in administration, but can be used in the
Pacemaker configuration in Rules (as the #cluster-name
node attribute). It may also
be used by higher-level tools when displaying cluster information, and
by certain resource agents (for example, the ocf:heartbeat:GFS2
agent stores the cluster name in filesystem meta-data). |
|
dc-version |
version | detected | Version of Pacemaker on the cluster’s designated controller (DC). Maintained by the cluster, and intended for diagnostic purposes. |
cluster-infrastructure |
text | detected | The messaging layer with which Pacemaker is currently running. Maintained by the cluster, and intended for informational and diagnostic purposes. |
no-quorum-policy |
enumeration | stop | What to do when the cluster does not have quorum. Allowed values:
|
batch-limit |
integer | 0 | The maximum number of actions that the cluster may execute in parallel across all nodes. The ideal value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. If -1, the cluster will not impose any limit. |
migration-limit |
integer | -1 | The number of live migration actions that the cluster is allowed to execute in parallel on a node. A value of -1 means unlimited. |
load-threshold |
percentage | 80% | Maximum amount of system load that should be used by cluster nodes. The cluster will slow down its recovery process when the amount of system resources used (currently CPU) approaches this limit. |
node-action-limit |
integer | 0 | Maximum number of jobs that can be scheduled per node. If nonpositive or invalid, double the number of cores is used as the maximum number of jobs per node. PCMK_node_action_limit overrides this option on a per-node basis. |
symmetric-cluster |
boolean | true | If true, resources can run on any node by default. If false, a resource is allowed to run on a node only if a location constraint enables it. |
stop-all-resources |
boolean | false | Whether all resources should be disallowed from running (can be useful during maintenance or troubleshooting) |
stop-orphan-resources |
boolean | true | Whether resources that have been deleted from the configuration should
be stopped. This value takes precedence over
is-managed (that is, even unmanaged resources will
be stopped when orphaned if this value is true ). |
stop-orphan-actions |
boolean | true | Whether recurring operations that have been deleted from the configuration should be cancelled |
start-failure-is-fatal |
boolean | true | Whether a failure to start a resource on a particular node prevents
further start attempts on that node. If false , the cluster will
decide whether the node is still eligible based on the resource’s
current failure count and migration-threshold . |
enable-startup-probes |
boolean | true | Whether the cluster should check the pre-existing state of resources when the cluster starts |
maintenance-mode |
boolean | false | If true, the cluster will not start or stop any resource in the cluster,
and any recurring operations (expect those specifying role as
Stopped ) will be paused. If true, this overrides the
maintenance node attribute,
is-managed and maintenance
resource meta-attributes, and enabled operation
meta-attribute. |
stonith-enabled |
boolean | true | Whether the cluster is allowed to fence nodes (for example, failed nodes and nodes with resources that can’t be stopped). If true, at least one fence device must be configured before resources are allowed to run. If false, unresponsive nodes are immediately assumed to be running no resources, and resource recovery on online nodes starts without any further protection (which can mean data loss if the unresponsive node still accesses shared storage, for example). See also the requires resource meta-attribute. This option applies only to fencing scheduled by the cluster, not to
requests initiated externally (such as with the |
stonith-action |
enumeration | reboot | Action the cluster should send to the fence agent when a node must be
fenced. Allowed values are reboot and off . |
stonith-timeout |
duration | 60s | How long to wait for on , off , and reboot fence actions to
complete by default. |
stonith-max-attempts |
score | 10 | How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. Any value below 1 will be ignored, and the default will be used instead. |
have-watchdog |
boolean | detected | Whether watchdog integration is enabled. This is set automatically by the cluster according to whether SBD is detected to be in use. User-configured values are ignored. The value true is meaningful if diskless SBD is used and stonith-watchdog-timeout is nonzero. In that case, if fencing is required, watchdog-based self-fencing will be performed via SBD without requiring a fencing resource explicitly configured. |
stonith-watchdog-timeout |
timeout | 0 | If nonzero, and the cluster detects If this is set to a positive value, lost nodes are assumed to achieve self-fencing within this much time. This does not require a fencing resource to be explicitly configured, though a fence_watchdog resource can be configured, to limit use to specific nodes. If this is set to 0 (the default), the cluster will never assume watchdog-based self-fencing. If this is set to a negative value, the cluster will use twice the local
value of the Warning: When used, this timeout must be larger than
|
concurrent-fencing |
boolean | false | Whether the cluster is allowed to initiate multiple fence actions
concurrently. Fence actions initiated externally, such as via the
stonith_admin tool or an application such as DLM, or by the fencer
itself such as recurring device monitors and status and list
commands, are not limited by this option. |
fence-reaction |
enumeration | stop | How should a cluster node react if notified of its own fencing? A
cluster node may receive notification of a “succeeded” fencing that
targeted it if fencing is misconfigured, or if fabric fencing is in use
that doesn’t cut cluster communication. Allowed values are stop to
attempt to immediately stop Pacemaker and stay stopped, or panic to
attempt to immediately reboot the local node, falling back to stop on
failure. The default is likely to be changed to panic in a future
release. (since 2.0.3) |
priority-fencing-delay |
duration | 0 | Apply this delay to any fencing targeting the lost nodes with the
highest total resource priority in case we don’t have the majority of
the nodes in our cluster partition, so that the more significant nodes
potentially win any fencing match (especially meaningful in a
split-brain of a 2-node cluster). A promoted resource instance takes the
resource’s priority plus 1 if the resource’s priority is not 0. Any
static or random delays introduced by pcmk_delay_base and
pcmk_delay_max configured for the corresponding fencing resources
will be added to this delay. This delay should be significantly greater
than (safely twice) the maximum delay from those parameters. (since
2.0.4) |
node-pending-timeout |
duration | 0 | Fence nodes that do not join the controller process group within this much time after joining the cluster, to allow the cluster to continue managing resources. A value of 0 means never fence pending nodes. Setting the value to 2h means fence nodes after 2 hours. (since 2.1.7) |
cluster-delay |
duration | 60s | If the DC requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node within this time (beyond the action’s own timeout). The ideal value will depend on the speed and load of your network and cluster nodes. |
dc-deadtime |
duration | 20s | How long to wait for a response from other nodes when electing a DC. The ideal value will depend on the speed and load of your network and cluster nodes. |
cluster-ipc-limit |
nonnegative integer | 500 | The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see “Evicting client” log messages for cluster daemon process IDs. |
pe-error-series-max |
integer | -1 | The number of scheduler inputs resulting in errors to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. |
pe-warn-series-max |
integer | 5000 | The number of scheduler inputs resulting in warnings to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. |
pe-input-series-max |
integer | 4000 | The number of “normal” scheduler inputs to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. |
enable-acl |
boolean | false | Whether access control lists should be used to authorize CIB modifications |
placement-strategy |
enumeration | default | How the cluster should assign resources to nodes (see
Utilization and Placement Strategy). Allowed values are default , utilization ,
balanced , and minimal . |
node-health-strategy |
enumeration | none | How the cluster should react to node health
attributes. Allowed values are none , migrate-on-red ,
only-green , progressive , and custom . |
node-health-base |
score | 0 | The base health score assigned to a node. Only used when
node-health-strategy is progressive . |
node-health-green |
score | 0 | The score to use for a node health attribute whose value is green .
Only used when node-health-strategy is progressive or
custom . |
node-health-yellow |
score | 0 | The score to use for a node health attribute whose value is yellow .
Only used when node-health-strategy is progressive or
custom . |
node-health-red |
score | -INFINITY | The score to use for a node health attribute whose value is red .
Only used when node-health-strategy is progressive or
custom . |
cluster-recheck-interval |
duration | 15min | Pacemaker is primarily event-driven, and looks ahead to know when to recheck the cluster for failure-timeout settings and most time-based rules (since 2.0.3). However, it will also recheck the cluster after this amount of inactivity. This has three main effects:
A value of 0 disables this polling. |
shutdown-lock |
boolean | false | The default of false allows active resources to be recovered elsewhere
when their node is cleanly shut down, which is what the vast majority of
users will want. However, some users prefer to make resources highly
available only for failures, with no recovery for clean shutdowns. If
this option is true, resources active on a node when it is cleanly shut
down are kept “locked” to that node (not allowed to run elsewhere) until
they start again on that node after it rejoins (or for at most
shutdown-lock-limit , if set). Stonith resources and Pacemaker Remote
connections are never locked. Clone and bundle instances and the
promoted role of promotable clones are currently never locked, though
support could be added in a future release. Locks may be manually
cleared using the --refresh option of crm_resource (both the
resource and node must be specified; this works with remote nodes if
their connection resource’s target-role is set to Stopped , but
not if Pacemaker Remote is stopped on the remote node without disabling
the connection resource). (since 2.0.4) |
shutdown-lock-limit |
duration | 0 | If shutdown-lock is true, and this is set to a nonzero time
duration, locked resources will be allowed to start after this much time
has passed since the node shutdown was initiated, even if the node has
not rejoined. (This works with remote nodes only if their connection
resource’s target-role is set to Stopped .) (since 2.0.4) |
startup-fencing |
boolean | true | Advanced Use Only: Whether the cluster should fence unseen nodes at
start-up. Setting this to false is unsafe, because the unseen nodes
could be active and running resources but unreachable. dc-deadtime
acts as a grace period before this fencing, since a DC must be elected
to schedule fencing. |
election-timeout |
duration | 2min | Advanced Use Only: If a winner is not declared within this much time of starting an election, the node that initiated the election will declare itself the winner. |
shutdown-escalation |
duration | 20min | Advanced Use Only: The controller will exit immediately if a shutdown does not complete within this much time. |
join-integration-timeout |
duration | 3min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
join-finalization-timeout |
duration | 30min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
transition-delay |
duration | 0s | Advanced Use Only: Delay cluster recovery for the configured interval to allow for additional or related events to occur. This can be useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions. |