Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Problem of Agreement

You would think that getting a handful of computers to agree on a single value would be straightforward. After all, we solved distributed telephony in the 1960s, we landed on the moon with less computing power than your thermostat, and your average database handles millions of transactions per day. Surely “pick a value and tell everyone” is a solved problem.

It is not. It is, in fact, one of the deepest problems in computer science, and the source of more production outages, data loss incidents, and engineer-years of debugging than perhaps any other single class of problem. Welcome to consensus.

What We Mean by “Agreement”

Let us be precise, because imprecision is what gets systems killed.

Consensus is the problem of getting a set of N processes (nodes, servers, replicas — pick your preferred term) to agree on a single value. That is the one-sentence version. The formal version requires three properties:

Agreement. No two correct processes decide on different values. If node A decides the value is “X” and node B decides the value is “Y”, your system is broken. This is the non-negotiable property. Everything else is optimization.

Validity. The decided value must have been proposed by some process. This sounds trivially obvious, but without it, you could satisfy Agreement by having every node always decide “42” regardless of input. Validity prevents degenerate solutions.

Termination. Every correct process eventually decides some value. This is the liveness property, and it is where all the trouble lives. A protocol that satisfies Agreement and Validity by never deciding anything is technically safe but entirely useless.

Some formulations add Integrity (a process decides at most once) as a separate property. Others fold it into Agreement. The distinction matters less than understanding that these three properties — safety (Agreement + Validity) and liveness (Termination) — are in fundamental tension with each other when failures enter the picture. We will see exactly why in Chapter 3.

The Two Generals Problem: A Parable

Before we tackle distributed consensus proper, let us consider its simpler, more hopeless cousin.

Two armies, each commanded by a general, are camped on opposite sides of a valley. In the valley sits an enemy city. The armies can only win if they attack simultaneously. If only one attacks, it will be destroyed. The generals can communicate only by sending messengers through the valley, but messengers may be captured by the enemy — that is, messages may be lost.

General A sends a message: “Attack at dawn.” But how does A know that B received the message? B could send an acknowledgment. But how does B know that A received the acknowledgment? A could acknowledge the acknowledgment. You see where this is going.

No finite number of message exchanges can give both generals certainty that they agree on the plan. This is provably impossible, and the proof is elegant: suppose some protocol P solves the problem using k messages. Consider the last message sent. The sender must have already decided to attack before knowing whether this last message was received (otherwise, the sender’s decision depends on a response that might never come). But if the sender can decide without the last message being received, then the receiver’s participation in this last exchange is unnecessary. Remove it. Now the protocol uses k-1 messages. Apply the same argument. Eventually you reach zero messages, which is absurd.

The Two Generals Problem tells us something fundamental: in a system where messages can be lost, you cannot achieve guaranteed agreement in a finite number of steps. Full stop.

“But wait,” you say, “TCP gives us reliable message delivery.” Does it? TCP gives you reliable delivery or a timeout. When the timeout fires, you do not know whether your message was delivered, only that you did not receive a response in time. You are right back with the generals.

What Goes Wrong Without Consensus

If the theoretical argument does not move you, perhaps some war stories will.

Split-Brain in Databases

Consider a primary-replica database setup. The primary handles writes. The replica handles reads and stands by to take over if the primary fails. Some monitoring system watches the primary and, if it appears dead, promotes the replica.

Now: the network between the monitor and the primary develops packet loss. The primary is fine — it is happily serving writes. But the monitor cannot reach it, declares it dead, and promotes the replica. You now have two nodes that both believe they are the primary. Both accept writes. The data diverges. When the network heals, you have two incompatible copies of your database, and the merge process — if one even exists — will lose data.

This is split-brain, and it happens in production with depressing regularity. Every major database vendor has a post-mortem involving this scenario. The root cause is always the same: the system made a decision (promote the replica) without achieving consensus among all participants about the state of the world.

Leader Election Gone Wrong

Distributed systems frequently elect a leader to coordinate work. ZooKeeper, etcd, and Consul all provide leader election primitives. But leader election is consensus — you are getting N nodes to agree on which one is the leader.

A naive approach: each node broadcasts “I am the leader” and the first one to receive a majority of acknowledgments wins. Sounds reasonable. Now consider:

  1. Nodes A and B both broadcast simultaneously.
  2. Nodes C, D, and E each receive one of these broadcasts first (due to network delays).
  3. C and D acknowledge A. D and E acknowledge B.
  4. Node D acknowledged both because messages arrived in different orders on different network paths.
  5. A believes it has a majority (C, D, A = 3 of 5). B believes it has a majority (D, E, B = 3 of 5).
  6. Two leaders. Data corruption ensues.

The bug is that node D’s acknowledgments were not mutually exclusive. A proper consensus protocol ensures that once a node votes for a proposal, that vote cannot be reused for a conflicting proposal. This sounds simple to bolt on. It is not.

Inconsistent Reads Across Replicas

You write a value to a replicated data store. You read it back from a different replica. The value is not there yet. Or worse: you read the value, make a decision based on it, and then another process reads from a third replica that has not received the write and makes an incompatible decision.

Without consensus on the order of operations, replicas can disagree about which writes have been applied, which have been rolled back, and which are still in flight. Linearizability — the gold standard for consistency — requires consensus.

A Naive Protocol (and Why It Fails)

Let us try to build a consensus protocol from scratch. We have N nodes, each with a proposed value. We want them all to agree on one value.

Attempt 1: Broadcast and Majority

// Run on each node i
procedure NAIVE_CONSENSUS(my_value):
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from all nodes  // already doomed
    values = collect all received proposals
    decision = most_common(values)  // tie-breaking by node ID
    return decision

How it breaks: The “wait for PROPOSE messages from all nodes” step is the problem. If even one node has crashed, we wait forever. Termination is violated.

Attempt 2: Wait for a Majority Instead

procedure NAIVE_CONSENSUS_V2(my_value):
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from a majority of nodes
    values = collect all received proposals
    decision = most_common(values)
    return decision

How it breaks: Different nodes may receive different subsets of proposals, depending on message delays and crash timing. Node A might see proposals from {A, B, C} while node D sees proposals from {B, D, E}. They can compute different most_common values. Agreement is violated.

Let us trace through an example. Five nodes, each proposing a different value:

Node A proposes: "red"
Node B proposes: "blue"
Node C proposes: "red"
Node D proposes: "blue"
Node E proposes: "green"

Due to network delays, Node A receives proposals from {A, B, C} before its majority threshold: two “red”, one “blue”. Decision: “red.”

Node D receives proposals from {B, D, E}: one “blue”, one “blue”, one “green”. Decision: “blue.”

Agreement violated. The system is broken.

Attempt 3: Two-Phase Approach

Fine. Let us add a round to fix this.

procedure NAIVE_CONSENSUS_V3(my_value):
    // Phase 1: Collect proposals
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from a majority of nodes
    values = collect all received proposals
    candidate = most_common(values)

    // Phase 2: Vote on the candidate
    broadcast (VOTE, candidate) to all nodes
    wait for VOTE messages from a majority of nodes
    if all received votes agree:
        return candidate
    else:
        // ???
        restart from Phase 1?  // divergence risk

How it breaks: We have improved matters — now at least we have a confirmation step. But the “else” branch is the problem. If votes disagree, what do we do? If we restart, we may never terminate. Two nodes might keep proposing different candidates, each getting enough votes from their local neighborhood to proceed to Phase 2 but never achieving unanimous votes. This is a livelock, and it is exactly the failure mode that makes consensus hard.

What if we just retry forever with random backoff? You have reinvented an unreliable version of Paxos without the properties that make Paxos correct. The issue is that between phases, the set of participating nodes can change (due to crashes and recoveries), and different nodes can be in different phases simultaneously. Without careful bookkeeping about which round you are in and which proposals have been accepted, you get inconsistency.

Attempt 4: Designated Leader

procedure NAIVE_CONSENSUS_V4(my_value):
    if i am the leader:
        broadcast (DECIDE, my_value) to all nodes
        return my_value
    else:
        wait for (DECIDE, v) from the leader
        return v

How it breaks: This actually works perfectly — as long as the leader never fails. The moment the leader crashes before sending or during sending its DECIDE message, some nodes have decided and some have not. Now you need to elect a new leader, which is itself a consensus problem. You have defined consensus in terms of itself.

Also: some nodes may have received the DECIDE and some may not. The new leader needs to figure out what, if anything, the old leader decided. Without a way to query the other nodes and reconcile their states, the new leader might propose a different value, violating Agreement for any node that already decided the old leader’s value.

This, by the way, is essentially the starting point for Paxos and Raft. They are what you get when you take the designated-leader approach and solve all of the problems I just described. The solving takes about 30 pages of proofs.

The Shape of the Real Problem

Every naive attempt above fails for one or more of these reasons:

  1. Asynchrony. You do not know whether a node is crashed or just slow. Setting a timeout is a heuristic, not a guarantee.

  2. Partial failures. A node can crash mid-broadcast, delivering its message to some nodes but not others. This creates asymmetric knowledge — different nodes have different information about the same event.

  3. State divergence during recovery. After a failure, nodes must reconcile their states before making progress. But reconciliation requires communication, which brings us back to the consensus problem.

  4. No global clock. Without a shared notion of time, you cannot determine the order of events across nodes. Lamport showed us this in 1978, and we have been dealing with the consequences ever since.

Let us illustrate the asynchrony problem more concretely.

Synchronous vs. Asynchronous Models

A synchronous system provides known upper bounds on message delivery time and processing speed. If I send you a message, it will arrive within delta time units. If a node takes a step, it completes within phi time units. These bounds are known to all participants.

In a synchronous system, consensus is straightforward — even with failures. Here is a protocol for crash failures in a synchronous system:

// Synchronous consensus tolerating f crash failures
// Requires f+1 rounds
procedure SYNC_CONSENSUS(my_value, f):
    known_values = {my_value}

    for round = 1 to f + 1:
        broadcast known_values to all nodes
        wait delta time units  // guaranteed delivery bound
        for each received message values_from_j:
            known_values = known_values UNION values_from_j

    return min(known_values)  // deterministic tie-breaking

This works because after f+1 rounds, even if f nodes crash, at least one round had no crashes (pigeonhole principle). In that round, all surviving nodes exchanged complete information. The min function is a deterministic tiebreaker that all nodes apply to the same set of values.

The catch: real networks are not synchronous. TCP retransmission timeouts, garbage collection pauses, VM migrations, switch buffer overflows, cosmic rays flipping bits in router memory — any of these can cause message delays that exceed any bound you care to set. Set the bound too low and you falsely suspect crashed nodes. Set it too high and your system stalls waiting for a genuinely crashed node.

An asynchronous system makes no timing assumptions whatsoever. Messages are delivered eventually, but there is no bound on how long “eventually” takes. This is the model that most closely matches real-world networks (though it is overly pessimistic in some ways).

In the asynchronous model, as we will see in Chapter 3, deterministic consensus is impossible even with a single crash failure. This is the FLP impossibility result, and it is perhaps the most important theorem in distributed systems.

The practical sweet spot is the partially synchronous model: the system is asynchronous, but eventually becomes synchronous (or there exists an unknown upper bound on message delivery that holds after some unknown point in time). This captures the behavior of real networks: they are usually well-behaved, occasionally terrible, but eventually recover. Protocols like Paxos and Raft are designed for this model — they are always safe, and they make progress once the network stabilizes.

Message Flow: When Things Go Right and Wrong

Let us trace through a simple three-node scenario to build intuition about why message ordering creates problems.

Scenario 1: Everything Works

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
                                        receive("X")
t2                      ack("X") ->     ack("X") ->
t3      receive acks
        DECIDE("X") ->
t4                      decide("X")     decide("X")

Everyone agrees. Life is good. Now let us break things.

Scenario 2: Proposer Crashes Mid-Broadcast

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
        ** CRASH **                     (message lost)
t2                      waiting...      waiting...
t3                      timeout         timeout
t4                      ???             ???

Node B received the proposal. Node C did not. Node B knows about “X”, Node C does not. If we elect a new leader and it happens to be Node C, it might propose “Y”. Node B is now in a bind: it already knows about “X” — should it accept “Y”?

This is the core dilemma that Paxos resolves with its two-phase prepare/accept mechanism. The new leader must first learn what the old leader might have decided before proposing anything new. Without this, you get inconsistency.

Scenario 3: Network Partition

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
                                        (partitioned from A and B)
t2                      ack("X") ->
t3      has 2/3 acks
        DECIDE("X") ->
t4                      decide("X")     propose("Y")  // thinks A is dead

Node C, partitioned from A and B, might time out and start its own proposal. If C can reach a majority (it cannot, in a 3-node cluster with a 1:2 partition), it would decide a different value. The majority requirement is what saves us here: C cannot get a majority because it is alone on its side of the partition. But this only works because we defined “majority” as more than N/2. With N=4 and a 2:2 partition, neither side can make progress. The system stalls until the partition heals.

This is the CAP theorem in action: during a partition, you can have Consistency (all nodes agree) or Availability (all nodes can make progress), but not both. Consensus protocols choose Consistency.

Real-World Consequences

The theoretical problems described above manifest as real, costly production failures.

GitHub’s 2018 outage was caused by a network partition between their primary database and its replicas. The failover system promoted a replica, creating two primaries. When the partition healed, they had to reconcile 24 hours of divergent writes. The outage lasted over 24 hours.

Amazon’s 2011 EBS outage involved a cascading failure in their distributed replication system. A network configuration change created a “re-mirroring storm” where nodes could not agree on which replicas were authoritative. The lack of clean consensus about replica state turned a minor network issue into a major multi-day outage.

MongoDB’s (formerly frequent) rollback behavior in older versions used a primary-secondary replication model without proper consensus for write acknowledgment. If a primary accepted a write, crashed, and a secondary was promoted before replicating that write, the write was silently rolled back when the old primary rejoined. This is exactly the scenario our Attempt 4 protocol fails at.

All of these are consensus failures. The systems either did not use consensus when they should have, or used consensus protocols that made incorrect assumptions about the failure model.

What Consensus Prevents

When used correctly, consensus provides the following guarantees:

Consistent leader election. At most one node believes it is the leader at any given logical time. Split-brain is prevented because the leader must be elected by a majority, and two majorities necessarily overlap.

Atomic broadcast. All nodes deliver the same messages in the same order. This is equivalent to consensus (provably so) and is the foundation of replicated state machines.

Consistent configuration changes. Adding or removing nodes from a cluster is itself a consensus problem. If nodes disagree about who is in the cluster, majority calculations break down. This is why Raft dedicates a significant portion of its paper to joint consensus for membership changes, and why getting this wrong has caused some of the most notorious distributed systems bugs.

Linearizable reads and writes. Clients observe a single, consistent ordering of operations, as if there were one copy of the data. This requires consensus to order concurrent writes and to ensure reads reflect the latest committed write.

The Cost of Consensus

Consensus is not free. The minimum cost is:

  • Latency: At least one round-trip to a majority of nodes for each decision. In a geographically distributed cluster, this can be tens or hundreds of milliseconds. Paxos and Raft both require at least two message delays in the common case (leader to followers and back).

  • Throughput: The leader is a bottleneck. Every decision goes through it. Multi-Paxos and Raft both pipeline decisions to amortize the cost, but the leader’s network and CPU remain the ceiling.

  • Availability: The system cannot make progress unless a majority of nodes are reachable. In a 5-node cluster, you can tolerate 2 failures. In a 3-node cluster, you can tolerate 1. The math is unforgiving: to tolerate f failures, you need 2f+1 nodes.

These costs are why many systems that claim to use consensus actually cut corners: they use consensus for metadata (who is the leader, what is the configuration) but not for the data path. This is a legitimate architecture, but the boundary between “consensus-protected” and “not consensus-protected” operations is where bugs hide.

Looking Ahead

The rest of Part I will build the theoretical foundation you need to understand why consensus protocols are designed the way they are:

  • Chapter 2 examines failure models — the assumptions about what can go wrong that determine which protocols are possible.
  • Chapter 3 covers the FLP impossibility result — the theorem that says deterministic asynchronous consensus is impossible, and how practical protocols sidestep it.
  • Chapter 4 addresses Byzantine failures — what happens when nodes can lie, not just crash.

These are not abstract curiosities. Every design decision in every consensus protocol — every quorum size, every timeout, every phase of message exchange — exists because of the constraints these results establish. Understanding the constraints is prerequisite to understanding the solutions.

If you take one thing from this chapter, let it be this: consensus is hard not because we are bad at engineering, but because the problem itself has fundamental lower bounds imposed by the physics of distributed communication. Messages take time. Nodes can fail. You cannot distinguish a slow node from a dead one. Every consensus protocol is a different set of tradeoffs within these constraints, and understanding which tradeoffs your system makes is the difference between an architecture and a hope.