Automatic failover relies on two additional components in an HDFS: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC). In Cloudera Manager, the ZKFC process maps to the HDFS Failover Controller role.
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of HDFS automatic failover relies on ZooKeeper for the following functions:
- Failure detection – each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other Namenode that a failover should be triggered.
- Active NameNode election – ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active Namenode crashes, another node can take a special exclusive lock in ZooKeeper indicating that it should become the next active Namenode.
The ZKFailoverController (ZKFC) is a ZooKeeper client that also monitors and manages the state of the NameNode. Each of the hosts that run a Namenode also run a ZKFC. The ZKFC is responsible for:
- Health monitoring – the ZKFC contacts its local Namenode on a periodic basis with a health-check command. So long as the Namenode responds promptly with a healthy status, the ZKFC considers the Namenode healthy. If the NameNode has crashed, frozen, or otherwise entered an unhealthy state, the health monitor marks it as unhealthy.
- ZooKeeper session management – when the local Namenode is healthy, the ZKFC holds a session open in ZooKeeper. If the local Namenode is active, it also holds a special lock znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node is automatically deleted.
- ZooKeeper-based election – if the local NameNode is healthy, and the ZKFC sees that no other NameNode currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local Namenode active. The previous active is fenced if necessary, and then the local Namenode transitions to active state.