Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zab: What ZooKeeper Actually Uses

Every conversation about consensus protocols eventually arrives at ZooKeeper. Not because ZooKeeper is elegant — it is not — but because it is everywhere. ZooKeeper is the coordination service that half the distributed systems in the world depend on, and it does not use Paxos. It does not use Raft. It uses Zab: ZooKeeper Atomic Broadcast.

This fact surprises many people. ZooKeeper was built at Yahoo! in the late 2000s, when Paxos was the only consensus game in town (Raft wouldn’t arrive until 2014). The ZooKeeper team had the option to implement Paxos. They chose not to. Instead, they designed their own protocol, tailored to ZooKeeper’s specific requirements, and in doing so, they made a series of tradeoffs that illuminate the gap between academic consensus protocols and the systems that actually serve production traffic.

Zab is not as clean as Raft. It is not as theoretically minimal as Paxos. But it is what actually runs behind Kafka (historically), Hadoop, HBase, Solr, and hundreds of other systems that list ZooKeeper as a dependency. When your production system goes down at 3 AM, there is a non-trivial chance that Zab is involved somewhere in the dependency chain. You should understand what it does.

Why Not Paxos?

The ZooKeeper team’s decision to build their own protocol was not born of hubris. (Well, perhaps a little hubris. All good systems work starts with a little hubris.) It was born of a specific technical requirement that Paxos, as described in the literature, does not naturally provide: FIFO client ordering with prefix agreement.

ZooKeeper’s API requires:

  1. All updates from a given client are applied in the order the client issued them. If client C sends write A, then write B, every server that applies both must apply A before B.

  2. All updates are applied in a total order that is consistent with the per-client FIFO order. There is a single global order of operations, and it respects the causal ordering within each client’s session.

  3. A client that reads after a write must see the effect of that write (or a later state). This is session consistency — not full linearizability, but stronger than eventual consistency.

These properties are called causal ordering (or more precisely, FIFO ordering with respect to client sessions), and they map naturally to ZooKeeper’s use cases: distributed locks, leader election, configuration management, and service discovery.

Multi-Paxos can provide total ordering, but it does not inherently provide FIFO client ordering. You can build FIFO ordering on top of Multi-Paxos (by tracking per-client sequence numbers and enforcing ordering constraints), but it’s additional machinery. Zab provides it natively because the protocol was designed around it.

The other reason is more pragmatic: in 2007, when ZooKeeper was being built, Multi-Paxos was poorly documented (as we discussed in Chapter 6), and the gap between the academic description and a production system was enormous. Building a protocol from scratch, tailored to their exact requirements, seemed (and probably was) less risky than trying to fill in all the blanks that Multi-Paxos left unspecified.

Zab’s System Model

Zab assumes:

  • A set of servers, one of which is the leader (primary) and the rest are followers.
  • Crash-recovery fault model: servers can crash and restart.
  • Messages can be lost, duplicated, or reordered, but not corrupted.
  • The system tolerates f failures out of 2f+1 servers.

Zab provides atomic broadcast: the ability to deliver messages to all servers in the same order, with the guarantee that if any server delivers a message, all operational servers eventually deliver it.

The Two Modes of Zab

Zab operates in two distinct modes:

  1. Recovery mode — Used when the system starts up or when the leader fails. In this mode, the servers elect a new leader and synchronize their state.

  2. Broadcast mode — Used during normal operation. The leader receives client requests, broadcasts them to followers, and commits them when a majority acknowledges.

The transition between these modes is the heart of Zab, and getting it right is where most of the protocol’s complexity lives.

Zab Identifiers: Epochs and Zxids

Zab uses a two-part transaction identifier called a zxid (ZooKeeper transaction ID). A zxid is a 64-bit number with two 32-bit components:

  • Epoch (high 32 bits): Incremented each time a new leader is elected. Analogous to Raft’s “term” or VR’s “view number.”
  • Counter (low 32 bits): Incremented for each transaction within an epoch. Reset to 0 when a new epoch starts.
struct Zxid:
    epoch: uint32    // High 32 bits
    counter: uint32  // Low 32 bits

    function compare(other: Zxid) -> int:
        if self.epoch != other.epoch:
            return self.epoch - other.epoch
        return self.counter - other.counter

    function to_int64() -> int64:
        return (self.epoch << 32) | self.counter

The epoch serves the same purpose as Raft’s term: it identifies which leader issued a transaction. The counter provides ordering within a leader’s tenure. Together, they create a total order over all transactions, with a clean boundary between different leaders’ contributions.

Phase 1: Leader Election (Discovery)

When the system starts or the current leader fails, Zab enters recovery mode, beginning with leader election. Zab’s leader election is conceptually simpler than you might expect:

  1. Each server broadcasts a vote containing its proposed leader and the zxid of its last committed transaction.
  2. Servers update their vote if they see a “better” vote (higher epoch, or same epoch with higher counter, or same zxid but higher server ID).
  3. When a server observes that a majority have voted for the same server, that server is the prospective leader.
class ZabElection:
    function elect_leader():
        // Initially vote for ourselves
        my_vote = Vote {
            proposed_leader: self.my_id,
            zxid: self.last_committed_zxid,
            election_epoch: self.election_epoch + 1
        }
        self.election_epoch += 1

        // Broadcast our vote
        broadcast(my_vote)

        received_votes = {self.my_id: my_vote}

        while true:
            msg = receive(timeout=ELECTION_TIMEOUT)

            if msg == null:
                // Timeout — re-broadcast our vote
                broadcast(my_vote)
                continue

            if msg.election_epoch < self.election_epoch:
                continue  // Stale

            if msg.election_epoch > self.election_epoch:
                // We're behind — adopt their epoch
                self.election_epoch = msg.election_epoch
                received_votes = {}
                my_vote = self.determine_better_vote(my_vote, msg)
                broadcast(my_vote)

            // Record the vote
            received_votes[msg.from] = msg

            // Update our vote if we see a better candidate
            if self.is_better_candidate(msg, my_vote):
                my_vote = Vote {
                    proposed_leader: msg.proposed_leader,
                    zxid: msg.zxid,
                    election_epoch: self.election_epoch
                }
                broadcast(my_vote)
                received_votes[self.my_id] = my_vote

            // Check if a majority has voted for the same leader
            for candidate in self.all_servers:
                votes_for = count(v for v in received_votes.values()
                                  if v.proposed_leader == candidate)
                if votes_for > len(self.all_servers) / 2:
                    // Wait a bit for any late-arriving better votes
                    while more_votes_available(SHORT_TIMEOUT):
                        msg = receive(SHORT_TIMEOUT)
                        if msg and self.is_better_candidate(msg, my_vote):
                            // Better candidate appeared — continue election
                            break
                    else:
                        // Election complete
                        if candidate == self.my_id:
                            return LEADING
                        else:
                            self.current_leader = candidate
                            return FOLLOWING

    function is_better_candidate(vote_a, vote_b) -> bool:
        // Higher epoch wins, then higher counter, then higher server ID
        if vote_a.zxid.epoch != vote_b.zxid.epoch:
            return vote_a.zxid.epoch > vote_b.zxid.epoch
        if vote_a.zxid.counter != vote_b.zxid.counter:
            return vote_a.zxid.counter > vote_b.zxid.counter
        return vote_a.proposed_leader > vote_b.proposed_leader

The “better candidate” heuristic (higher zxid, breaking ties by server ID) ensures that the elected leader has the most complete transaction history. This is analogous to Raft’s election restriction but done through the voting protocol rather than vote-granting rules.

Phase 2: Synchronization (Recovery)

After a leader is elected, it must synchronize the followers’ state before entering broadcast mode. This is the discovery and synchronization phase, and it handles the messy reality that different servers may have processed different transactions before the old leader failed.

Step 1: Follower Connects to Leader

Each follower sends the leader its last committed zxid and its epoch.

// Follower side
function connect_to_leader(leader_id):
    send(leader_id, FollowerInfo {
        last_zxid: self.last_committed_zxid,
        current_epoch: self.accepted_epoch
    })

    // Wait for leader's response
    new_epoch_msg = receive_from(leader_id, timeout=SYNC_TIMEOUT)

    if new_epoch_msg.new_epoch > self.accepted_epoch:
        self.accepted_epoch = new_epoch_msg.new_epoch
        persist(self.accepted_epoch)
        send(leader_id, AckEpoch {
            last_zxid: self.last_committed_zxid,
            current_epoch: self.accepted_epoch
        })

Step 2: Leader Establishes New Epoch

The leader collects FollowerInfo from a majority of followers and determines the new epoch number (one more than the highest epoch it has seen).

// Leader side
function synchronize_followers():
    new_epoch = self.accepted_epoch + 1

    follower_infos = collect_from_majority(FollowerInfo)

    // Find the highest epoch and zxid among followers
    for info in follower_infos:
        new_epoch = max(new_epoch, info.current_epoch + 1)

    self.accepted_epoch = new_epoch

    // Send new epoch to all connected followers
    for follower in self.connected_followers:
        send(follower, NewEpoch { new_epoch: new_epoch })

    // Wait for AckEpoch from a majority
    ack_epochs = collect_from_majority(AckEpoch)

    // Now synchronize each follower's history
    self.sync_followers(ack_epochs)

Step 3: History Synchronization

The leader must bring each follower’s transaction history in line with its own. There are several cases:

function sync_follower(follower_id, follower_last_zxid):
    if follower_last_zxid == self.last_committed_zxid:
        // Follower is up to date — just send DIFF (empty)
        send(follower_id, Sync { type: DIFF, transactions: [] })

    elif follower_last_zxid < self.last_committed_zxid and
         self.has_transactions_since(follower_last_zxid):
        // Follower is behind but we have the needed transactions
        txns = self.get_transactions_since(follower_last_zxid)
        send(follower_id, Sync { type: DIFF, transactions: txns })

    elif follower_last_zxid > self.last_committed_zxid:
        // Follower has transactions we don't!
        // These must be from a failed leader — truncate them
        send(follower_id, Sync {
            type: TRUNC,
            truncate_to: self.last_committed_zxid
        })

    elif follower_last_zxid.epoch < self.last_committed_zxid.epoch:
        // Follower is from a previous epoch — might need TRUNC + DIFF
        // First truncate any uncommitted transactions from old epoch
        // Then send missing committed transactions
        send(follower_id, Sync {
            type: TRUNC_AND_DIFF,
            truncate_to: last_common_zxid,
            transactions: self.get_transactions_since(last_common_zxid)
        })

    else:
        // Follower is too far behind — send full snapshot
        send(follower_id, Sync {
            type: SNAP,
            snapshot: self.take_snapshot()
        })

This synchronization handles the crucial case where a follower has transactions that the new leader doesn’t — transactions that were proposed by the old leader but never committed. These transactions must be truncated (rolled back) because they are not part of the committed history. This is analogous to Raft truncating a follower’s log when it conflicts with the leader’s.

Step 4: Enter Broadcast Mode

Once the leader has synchronized a majority of followers, the system enters broadcast mode and can start processing client requests.

function enter_broadcast_mode():
    // Wait for sync acknowledgments from a majority
    sync_acks = collect_from_majority(SyncAck)

    // Reset the counter for the new epoch
    self.next_zxid = Zxid {
        epoch: self.accepted_epoch,
        counter: self.last_committed_zxid.counter + 1
    }

    self.mode = BROADCAST

    // Send UPTODATE to all synced followers
    for follower in self.synced_followers:
        send(follower, UpToDate {})

    // Now we can process client requests

Broadcast Mode: Normal Operation

Once in broadcast mode, Zab operates as a simple two-phase protocol that looks very much like two-phase commit (but with majority-based quorums, crucially):

Client          Leader              Follower 1        Follower 2
  |                |                    |                 |
  |-- Request ---->|                    |                 |
  |                |                    |                 |
  |                |-- PROPOSAL(zxid, txn) -->            |
  |                |-- PROPOSAL(zxid, txn) ------------>  |
  |                |                    |                 |
  |                |<-- ACK(zxid) ------|                 |
  |                |<-- ACK(zxid) ----------------------|
  |                |                    |                 |
  |                |  (Majority ACKed — can commit)      |
  |                |                    |                 |
  |                |-- COMMIT(zxid) --->|                 |
  |                |-- COMMIT(zxid) ----------------->   |
  |                |                    |                 |
  |<-- Response ---|                    |                 |
class ZabLeader:
    function handle_client_write(request):
        // Assign a zxid
        zxid = self.next_zxid
        self.next_zxid.counter += 1

        txn = Transaction {
            zxid: zxid,
            data: request.data,
            client_id: request.client_id,
            session_id: request.session_id
        }

        // Append to our own transaction log
        self.txn_log.append(txn)
        persist(txn)

        // Send PROPOSAL to all followers
        for follower in self.active_followers:
            send(follower, Proposal {
                zxid: zxid,
                transaction: txn
            })

        // Track acknowledgments
        self.pending_proposals[zxid] = PendingProposal {
            transaction: txn,
            acks: {self.my_id},  // Leader implicitly ACKs
            request: request
        }

    function on_ack(zxid, from_follower):
        if zxid not in self.pending_proposals:
            return  // Already committed or unknown

        proposal = self.pending_proposals[zxid]
        proposal.acks.add(from_follower)

        if |proposal.acks| > len(self.all_servers) / 2:
            self.commit(zxid)

    function commit(zxid):
        // IMPORTANT: Zab commits in FIFO order
        // We must commit all preceding transactions first
        while self.next_commit_zxid <= zxid:
            txn = self.pending_proposals[self.next_commit_zxid].transaction

            // Send COMMIT to all followers
            for follower in self.active_followers:
                send(follower, Commit { zxid: self.next_commit_zxid })

            // Apply to our state machine
            result = self.state_machine.apply(txn)

            // Respond to client
            respond_to_client(self.pending_proposals[self.next_commit_zxid].request, result)

            delete self.pending_proposals[self.next_commit_zxid]
            self.next_commit_zxid.counter += 1

Follower Side

class ZabFollower:
    function on_proposal(msg):
        // Verify this is from the current leader in the current epoch
        if msg.zxid.epoch != self.accepted_epoch:
            return  // Wrong epoch

        // Append to transaction log
        self.txn_log.append(msg.transaction)
        persist(msg.transaction)

        // Send ACK
        send(self.leader, Ack { zxid: msg.zxid })

    function on_commit(msg):
        // Apply transaction to state machine
        // Commits MUST be processed in zxid order
        if msg.zxid != self.next_expected_commit:
            // Queue for later — we might have received commits out of order
            self.commit_queue.add(msg.zxid)
            return

        self.apply_commit(msg.zxid)
        self.next_expected_commit.counter += 1

        // Apply any queued commits
        while self.next_expected_commit in self.commit_queue:
            self.commit_queue.remove(self.next_expected_commit)
            self.apply_commit(self.next_expected_commit)
            self.next_expected_commit.counter += 1

    function apply_commit(zxid):
        txn = self.txn_log.get(zxid)
        self.state_machine.apply(txn)
        self.last_committed_zxid = zxid

FIFO Ordering: Why It Matters

The FIFO ordering guarantee is what makes Zab distinct from generic consensus protocols. Zab guarantees:

  1. If a leader broadcasts proposal a before proposal b, every server that delivers b must deliver a first. This is the FIFO broadcast property.

  2. If a leader in epoch e commits transaction a, and a later leader in epoch e’ commits transaction b, then a is delivered before b. This is the causal ordering across epochs.

  3. A new leader must commit all transactions that were committed by previous leaders before it can broadcast new transactions. This is the prefix property.

Together, these guarantee that the transaction history forms a consistent prefix — every server’s history is a prefix of the same global sequence. There are no gaps, no reorderings, no orphaned transactions.

This matters for ZooKeeper because clients depend on ordering. Consider this sequence:

Client A: create /leader with value "node-1"
Client A: create /leader/config with value "settings"

If the second write could be applied without the first (due to a leader change between them), a reader might see /leader/config exist without /leader, which would be semantically nonsensical.

Zab’s FIFO guarantee ensures this cannot happen. All transactions from a given client session are applied in order, and the total order respects this per-session ordering.

Zab vs. Raft: A Detailed Comparison

Zab and Raft are both leader-based consensus protocols with similar high-level architectures. The differences are in the details:

AspectZabRaft
Transaction IDsEpoch + counter (zxid)Term + log index
Leader electionVote-based with best-zxid heuristicRequestVote with log up-to-date check
SynchronizationExplicit DIFF/TRUNC/SNAP phasesAppendEntries consistency check
Commit mechanismExplicit COMMIT messagePiggybacked leader_commit in AppendEntries
Log structureTransaction log (no gaps)Contiguous log (no gaps)
FIFO guaranteeBuilt into protocolAchieved via log ordering
Read semanticsSession consistency by defaultLinearizable with ReadIndex
RecoveryEpoch-based with sync phaseTerm-based with log matching

Key Difference 1: Explicit COMMIT Messages

In Raft, commit information is piggybacked on AppendEntries messages — followers learn about commits from the leader_commit field in the next AppendEntries. This is efficient (no extra messages) but means followers might not learn about a commit immediately.

In Zab, the leader sends an explicit COMMIT message after receiving a majority of ACKs. This is an extra message but means followers learn about commits sooner, which is important for ZooKeeper’s read consistency model (followers serve reads, and they need to know what’s committed).

Key Difference 2: Read Model

This is arguably the biggest practical difference. In Raft, reads from followers are stale by default — only the leader can serve linearizable reads (and even that requires special handling). Raft is designed for linearizable systems.

In ZooKeeper/Zab, followers serve reads directly from their local state. This provides session consistency (a client’s reads reflect its own writes) but not linearizability across clients. A client connected to a slow follower might see stale data.

ZooKeeper provides a sync operation that forces the follower to catch up with the leader before serving the next read. This bridges the gap to linearizability when needed, but it’s opt-in.

// ZooKeeper follower handling a read
function handle_read(session, path):
    if session.pending_sync:
        // Wait for sync to complete
        wait_until(self.last_committed_zxid >= session.sync_zxid)
        session.pending_sync = false

    return self.state_machine.get(path)

// ZooKeeper sync operation
function handle_sync(session):
    // Ask leader for current committed zxid
    leader_zxid = query_leader_commit()
    session.sync_zxid = leader_zxid
    session.pending_sync = true

This read model is a pragmatic choice. In many ZooKeeper use cases (configuration management, service discovery), slightly stale reads are acceptable and the throughput benefit of reading from any server is significant. When freshness matters (distributed locks), the client uses sync.

Key Difference 3: Synchronization Complexity

Raft’s log synchronization is elegant: the leader sends AppendEntries, the follower rejects if the log doesn’t match, the leader backs up and retries. It’s simple and handles all cases uniformly.

Zab’s synchronization has four distinct modes (DIFF, TRUNC, SNAP, TRUNC_AND_DIFF), each handling a specific case. This is more complex to implement but potentially more efficient — sending a DIFF of the last few transactions is cheaper than resending the entire log from the divergence point, and the explicit TRUNC mode makes it clear what’s happening when uncommitted transactions need to be rolled back.

The Relationship Between Zab and Virtual Synchrony

Zab is sometimes described as being related to virtual synchrony, and the connection is worth understanding.

Virtual synchrony, developed by Ken Birman in the 1980s, provides ordered message delivery within “groups” of processes. When a group membership changes (a process joins or leaves), all surviving members receive the same set of messages from the old group before transitioning to the new group. This is called a “view change” — the same term used in VR (Chapter 7).

Zab’s epoch-based approach echoes virtual synchrony’s view-based approach:

  • An epoch in Zab corresponds to a view in virtual synchrony.
  • The synchronization phase (where the new leader ensures all followers have the same history) corresponds to the “flush” in virtual synchrony (where all pending messages are delivered before the view change completes).
  • The guarantee that transactions from previous epochs are committed before new transactions are accepted mirrors virtual synchrony’s guarantee that all messages from the old view are delivered before the new view begins.

The key difference is that Zab uses a centralized leader, while virtual synchrony is typically implemented with a distributed protocol (like total order broadcast based on sequencer or token-based approaches). Zab’s centralized leader is simpler to implement and reason about, at the cost of making the leader a bottleneck.

Practical Experience: What ZooKeeper Operators Actually Deal With

If you operate ZooKeeper in production, the consensus protocol is rarely your primary concern. The operational challenges are dominated by issues that sit above or beside Zab:

Session Management

ZooKeeper clients maintain a session with the server, and sessions have a timeout. If the session expires (because the client can’t reach any server for the timeout period), all ephemeral nodes created by that client are deleted. This is how distributed locks work in ZooKeeper — the lock is an ephemeral node, and if the lock holder dies, the node is automatically deleted.

But session management is fragile:

  • GC pauses on the client can cause session expiration even when the client is healthy. A 10-second GC pause with a 6-second session timeout means your locks just got released.
  • Network partitions between the client and the ZooKeeper cluster cause session expiration. The client can’t just reconnect and resume — it must re-establish all its ephemeral nodes.
  • ZooKeeper server failures during a client’s session require the client to reconnect to a different server. The session is maintained (because session state is replicated), but there’s a window where the client might not know what state its session is in.
// The ZooKeeper session state machine (simplified)
enum SessionState:
    CONNECTING      // Trying to establish connection
    CONNECTED       // Normal operation
    RECONNECTING    // Lost connection, trying to reconnect
    EXPIRED         // Session timed out — all ephemeral nodes deleted
    CLOSED          // Client explicitly closed

// What operators deal with at 3 AM:
// 1. Client GC pause -> session expired -> ephemeral nodes deleted
//    -> distributed lock released -> two processes think they hold the lock
//    -> data corruption

Ephemeral Nodes and the Herd Effect

ZooKeeper’s watch mechanism notifies clients when a node changes. A common anti-pattern is to have all clients watch the same node (e.g., the lock node). When the lock is released, ALL clients are notified simultaneously, and they all try to acquire the lock at the same time. This is the “herd effect” or “thundering herd,” and it can overwhelm the ZooKeeper cluster.

The solution is the “sequential node” pattern: clients create sequential ephemeral nodes and watch only the node immediately before theirs. When the lock holder releases, only the next client in line is notified.

Disk Latency

Zab requires fsync on every write (the leader fsyncs the proposal, followers fsync the ACK). If the disk is slow (a saturated SSD, a virtual disk on a noisy neighbor cloud instance), write latency increases, heartbeats are delayed, and the leader might be declared dead — causing an unnecessary leader election.

ZooKeeper operators learn quickly that dedicated, fast disks for the ZooKeeper transaction log are not optional. A separate disk from the snapshots. A separate disk from the operating system. SSDs, not spinning disks. Seriously.

The Four-Letter Words

ZooKeeper exposes monitoring via “four-letter word” commands (literally: you telnet to the ZooKeeper port and type four letters like stat, ruok, mntr). These are crude but effective monitoring tools:

  • ruok — “Are you OK?” Responds with imok. If it doesn’t, you have a problem.
  • stat — Shows connected clients and outstanding requests.
  • mntr — Shows detailed metrics including proposal latency, sync latency, and outstanding requests.

Experienced operators watch the avg_proposal_latency and avg_sync_latency metrics. When these spike, something is wrong with the disk, the network, or both. The consensus protocol itself is almost never the problem — it’s the infrastructure underneath it.

Zab’s Strengths and Weaknesses

Strengths

Native FIFO ordering. Zab’s ordering guarantees are exactly what ZooKeeper needs. Building the same guarantees on top of Raft or Paxos is possible but requires additional infrastructure.

Explicit synchronization. The DIFF/TRUNC/SNAP synchronization modes handle recovery efficiently. A follower that’s only a few transactions behind gets a quick DIFF rather than a full log replay.

Battle-tested. Zab has been running in production at massive scale for over fifteen years. The bugs have been found and fixed. The edge cases have been discovered and handled. This is not a theoretical protocol — it’s a production-hardened one.

Weaknesses

Tight coupling to ZooKeeper. Zab was designed for ZooKeeper and it shows. The protocol is not easily extracted and used as a general-purpose consensus library. If you want a reusable consensus implementation, etcd/raft or a Paxos library is a better choice.

Complex recovery. The four synchronization modes (DIFF, TRUNC, SNAP, TRUNC_AND_DIFF) are more complex than Raft’s uniform AppendEntries-based catch-up. More modes means more code paths means more potential for bugs.

Leader bottleneck. Like all leader-based protocols, the leader is a throughput bottleneck. All writes must go through the leader, and the leader must communicate with all followers for every write. ZooKeeper mitigates this by serving reads from followers, but writes are still centralized.

Documentation. The original Zab paper is reasonable, but the actual ZooKeeper implementation has diverged from the paper in various ways over the years. Understanding what ZooKeeper actually does requires reading the source code, not just the paper. This is a common problem in long-lived systems, but it’s particularly acute for Zab because there’s no “Zab Made Simple” equivalent — no accessible secondary description to complement the original paper.

Snapshotting in ZooKeeper

ZooKeeper’s snapshotting mechanism is worth discussing because it takes an unusual approach: fuzzy snapshots.

Most systems require a consistent snapshot — a snapshot that represents the state at a specific point in time. Taking a consistent snapshot of a large data structure while continuing to process requests is expensive (it requires either copy-on-write or a pause).

ZooKeeper takes a different approach: it takes a “fuzzy” snapshot by iterating over the in-memory data tree and writing it to disk without acquiring a global lock. This means the snapshot might include some transactions that were applied after the snapshot started but not others — it’s an inconsistent view of the state.

This is fine because ZooKeeper replays the transaction log from the snapshot point forward when recovering. The fuzzy snapshot gives a (potentially inconsistent) starting state, and the transaction log replay brings it to a consistent state. As long as the transactions in the log are idempotent (which ZooKeeper’s are), replaying them on top of a fuzzy snapshot produces the correct result.

function take_fuzzy_snapshot():
    // Record the current last-committed zxid
    snapshot_zxid = self.last_committed_zxid

    // Iterate over the data tree WITHOUT a global lock
    // This means the snapshot might include transactions
    // committed after snapshot_zxid — that's OK
    snapshot_data = {}
    for path, node in self.data_tree.iterate():
        snapshot_data[path] = serialize(node)

    write_to_disk(snapshot_data, snapshot_zxid)

    // On recovery:
    // 1. Load fuzzy snapshot
    // 2. Replay transaction log from snapshot_zxid forward
    // 3. Result is consistent state

This is a clever optimization. Consistent snapshots require either a pause (bad for latency) or copy-on-write (complex and memory-expensive). Fuzzy snapshots avoid both by relying on the idempotency of the transaction log replay. It’s the kind of practical engineering trick that you develop only by building and operating real systems.

The ZooKeeper Ecosystem

Zab’s influence extends beyond ZooKeeper itself. Systems that depend on ZooKeeper (and thus transitively on Zab) include:

  • Apache Kafka (historically, for broker coordination and topic metadata — Kafka has since introduced KRaft to remove this dependency)
  • Apache HBase (for region server coordination)
  • Apache Solr (for cluster state management)
  • Apache Hadoop (HDFS NameNode high availability)
  • Kubernetes (via etcd, though etcd uses Raft — some older Kubernetes-adjacent systems used ZooKeeper)

The trend in recent years has been away from ZooKeeper as a dependency. Kafka’s KRaft mode eliminates the ZooKeeper dependency entirely. New systems tend to embed Raft rather than depend on an external ZooKeeper cluster. But the installed base is enormous, and ZooKeeper (and thus Zab) will be running in production data centers for many years to come.

Summary

Zab is a purpose-built consensus protocol for ZooKeeper. It provides FIFO-ordered atomic broadcast with epoch-based recovery, designed specifically for the coordination primitives that ZooKeeper exposes (locks, configuration, service discovery). It is not the most elegant consensus protocol, nor the most general, but it is among the most battle-tested.

The decision to build a custom protocol rather than using Paxos reflects a pragmatic reality: generic consensus protocols are building blocks, not systems. The gap between the protocol and the system must be filled, and sometimes it’s easier to design a protocol that fills the gap from the start than to retrofit a generic protocol to your specific requirements.

Whether this was the right decision is debatable — the ZooKeeper team has certainly spent more time maintaining and debugging Zab than they would have spent adapting Multi-Paxos. But the result works, it has worked for fifteen years, and it has handled the coordination needs of some of the largest distributed systems ever built. In the agony of consensus algorithms, that is a success by any measure.