At its core, a cluster is a distributed finite state machine capable of co-ordinating the startup and recovery of inter-related services across a set of machines.
Even a distributed and/or replicated application that is able to survive the failure of one or more components can benefit from a higher level cluster:
- awareness of other applications in the stack
- a shared quorum implementation and calculation
- data integrity through fencing (a non-responsive process does not imply it is not doing anything)
- automated recovery of instances to ensure capacity
System HA is possible without a cluster manager, but you save many headaches using one anyway
While SYS-V init replacements like systemd can provide deterministic recovery of a complex stack of services, the recovery is limited to one machine and lacks the context of what is happening on other machines - context that is crucial to determine the difference between a local failure, clean startup or recovery after a total site failure.
The ClusterLabs stack, incorporating Corosync and Pacemaker defines an Open Source, High Availability cluster offering suitable for both small and large deployments.
- Detection and recovery of machine and application-level failures
- Supports practically any redundancy configuration
- Supports both quorate and resource-driven clusters
- Configurable strategies for dealing with quorum loss (when multiple machines fail)
- Supports application startup/shutdown ordering, regardless of which machine(s) the applications are on
- Supports applications that must/must-not run on the same machine
- Supports applications which need to be active on multiple machines
- Supports applications with multiple modes (eg. master/slave)
- Provably correct response to any failure or cluster state.
The cluster's response to any stimuli can be tested offline before the condition exists
A Pacemaker stack is built on five core components:
- libQB - core services (logging, IPC, etc)
- Corosync - Membership, messaging and quorum
- Resource agents - A collection of scripts that interact with the underlying services managed by the cluster
- Fencing agents - A colllection of scripts that interact with network power switches and SAN devices to isolate cluster members
- Pacemaker itself
We describe each of these in more detail as well as other optional components such as CLIs and GUIs.
Pacemaker has been around since 2004 and is primarily a collaborative effort between Red Hat and Novell, however we also receive considerable help and support from the folks at LinBit and the community in general.
Corosync also began life in 2004 but was then part of the OpenAIS project. It is primarily a Red Hat initiative, however we also receive considerable help and support from the folks the community.
The core ClusterLabs team is made up of full-time developers from Australia, Austria, Canada, China, Czech Repulic, England, Germany and the USA. Contributions to the code or documentation are always welcome.
The ClusterLabs stack ships with most modern enterprise distributions and has been deployed in many critical environments including Deutsche Flugsicherung GmbH (DFS) which uses Pacemaker with Heartbeat to ensure its air traffic control systems are always available