Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

There is a particular kind of suffering reserved for engineers who must make multiple computers agree on something. It’s not the clean suffering of a hard mathematical proof or the dramatic suffering of a production outage at 3 AM (though consensus algorithms cause plenty of both). It’s the slow, grinding suffering of reading a paper that says “the protocol is straightforward” and then spending six months discovering all the ways it isn’t.

This book is about that suffering.

What This Book Covers

We’re going to walk through the major consensus algorithms — Paxos, Raft, PBFT, and their many cousins, variants, and competitors — with a level of honesty that most textbooks avoid. Academic papers have an incentive to make their protocols sound elegant. Vendor documentation has an incentive to make their implementations sound easy. We have neither incentive.

Specifically, we’ll cover:

  • Why consensus is hard — not just the impossibility theorems, but the practical reasons that make every implementation a minefield
  • The classic protocols — Paxos, Multi-Paxos, Viewstamped Replication, Raft, and Zab, compared on what matters: understandability, performance, and how much you’ll hate debugging them
  • Byzantine fault tolerance — PBFT, HotStuff, Tendermint, and the question of whether BFT is worth the overhead outside of blockchains
  • Modern variants — EPaxos, Flexible Paxos, Kafka’s ISR, CRDTs, and virtual consensus approaches that try to sidestep the problem entirely
  • The honest comparison — tradeoff matrices, decision frameworks, and a frank discussion of why so many teams end up just using whatever Kafka does

Who This Book Is For

This book assumes you know what a distributed system is and have at least a passing familiarity with concepts like replication, partitions, and the general unfairness of network communication. You don’t need to have implemented a consensus algorithm before — but if you have, you’ll appreciate the commiseration.

The ideal reader is someone who:

  • Is evaluating consensus algorithms for a real system and wants to understand the tradeoffs without reading twenty papers
  • Has read the Raft paper and wonders why it was supposedly the “understandable” one
  • Needs to explain to their manager why “just use Paxos” is not a complete engineering plan
  • Wants to understand why their ZooKeeper cluster keeps doing weird things
  • Is tired of blog posts that explain consensus with analogies about pizza delivery

How to Read This Book

The book is structured in five parts that build on each other, but you don’t have to read linearly.

If you’re new to consensus, start from Part I. The foundations matter more than you think, and skipping the impossibility results will leave you confused about why every protocol has the limitations it does.

If you’re comparing specific protocols, jump to the relevant chapters in Parts II through IV, then read Part V for the comparison framework. Each protocol chapter is designed to be relatively self-contained.

If you’re making a decision right now and don’t have time for theory, go straight to Chapter 20: “When to Use What (and When to Give Up).” It’s the chapter I wish someone had written for me years ago.

If you’re here for the Byzantine stuff, Part III stands on its own, though you’ll want to read Chapter 4 (The Byzantine Generals Problem) first.

A Note on Pseudocode

Throughout this book, we use pseudocode to describe protocol behavior. The pseudocode is designed to be readable by anyone with programming experience — it’s not tied to any particular language. Where the pseudocode necessarily simplifies the actual protocol, we say so explicitly. Where the papers simplify the actual protocol and don’t say so, we also say so explicitly.

A Note on Honesty

Every chapter title in this book contains editorial commentary. “Lamport’s Beautiful Nightmare.” “The One Nobody Reads.” “When You Can’t Trust Anyone.” These aren’t clickbait — they’re honest descriptions of what it’s like to work with these protocols. The academic community has produced brilliant work on consensus, and we respect that work deeply. But respecting it doesn’t mean pretending it’s easy. The gap between a consensus algorithm on a whiteboard and a consensus algorithm in production is where careers go to age prematurely.

Let’s begin.

The Problem of Agreement

You would think that getting a handful of computers to agree on a single value would be straightforward. After all, we solved distributed telephony in the 1960s, we landed on the moon with less computing power than your thermostat, and your average database handles millions of transactions per day. Surely “pick a value and tell everyone” is a solved problem.

It is not. It is, in fact, one of the deepest problems in computer science, and the source of more production outages, data loss incidents, and engineer-years of debugging than perhaps any other single class of problem. Welcome to consensus.

What We Mean by “Agreement”

Let us be precise, because imprecision is what gets systems killed.

Consensus is the problem of getting a set of N processes (nodes, servers, replicas — pick your preferred term) to agree on a single value. That is the one-sentence version. The formal version requires three properties:

Agreement. No two correct processes decide on different values. If node A decides the value is “X” and node B decides the value is “Y”, your system is broken. This is the non-negotiable property. Everything else is optimization.

Validity. The decided value must have been proposed by some process. This sounds trivially obvious, but without it, you could satisfy Agreement by having every node always decide “42” regardless of input. Validity prevents degenerate solutions.

Termination. Every correct process eventually decides some value. This is the liveness property, and it is where all the trouble lives. A protocol that satisfies Agreement and Validity by never deciding anything is technically safe but entirely useless.

Some formulations add Integrity (a process decides at most once) as a separate property. Others fold it into Agreement. The distinction matters less than understanding that these three properties — safety (Agreement + Validity) and liveness (Termination) — are in fundamental tension with each other when failures enter the picture. We will see exactly why in Chapter 3.

The Two Generals Problem: A Parable

Before we tackle distributed consensus proper, let us consider its simpler, more hopeless cousin.

Two armies, each commanded by a general, are camped on opposite sides of a valley. In the valley sits an enemy city. The armies can only win if they attack simultaneously. If only one attacks, it will be destroyed. The generals can communicate only by sending messengers through the valley, but messengers may be captured by the enemy — that is, messages may be lost.

General A sends a message: “Attack at dawn.” But how does A know that B received the message? B could send an acknowledgment. But how does B know that A received the acknowledgment? A could acknowledge the acknowledgment. You see where this is going.

No finite number of message exchanges can give both generals certainty that they agree on the plan. This is provably impossible, and the proof is elegant: suppose some protocol P solves the problem using k messages. Consider the last message sent. The sender must have already decided to attack before knowing whether this last message was received (otherwise, the sender’s decision depends on a response that might never come). But if the sender can decide without the last message being received, then the receiver’s participation in this last exchange is unnecessary. Remove it. Now the protocol uses k-1 messages. Apply the same argument. Eventually you reach zero messages, which is absurd.

The Two Generals Problem tells us something fundamental: in a system where messages can be lost, you cannot achieve guaranteed agreement in a finite number of steps. Full stop.

“But wait,” you say, “TCP gives us reliable message delivery.” Does it? TCP gives you reliable delivery or a timeout. When the timeout fires, you do not know whether your message was delivered, only that you did not receive a response in time. You are right back with the generals.

What Goes Wrong Without Consensus

If the theoretical argument does not move you, perhaps some war stories will.

Split-Brain in Databases

Consider a primary-replica database setup. The primary handles writes. The replica handles reads and stands by to take over if the primary fails. Some monitoring system watches the primary and, if it appears dead, promotes the replica.

Now: the network between the monitor and the primary develops packet loss. The primary is fine — it is happily serving writes. But the monitor cannot reach it, declares it dead, and promotes the replica. You now have two nodes that both believe they are the primary. Both accept writes. The data diverges. When the network heals, you have two incompatible copies of your database, and the merge process — if one even exists — will lose data.

This is split-brain, and it happens in production with depressing regularity. Every major database vendor has a post-mortem involving this scenario. The root cause is always the same: the system made a decision (promote the replica) without achieving consensus among all participants about the state of the world.

Leader Election Gone Wrong

Distributed systems frequently elect a leader to coordinate work. ZooKeeper, etcd, and Consul all provide leader election primitives. But leader election is consensus — you are getting N nodes to agree on which one is the leader.

A naive approach: each node broadcasts “I am the leader” and the first one to receive a majority of acknowledgments wins. Sounds reasonable. Now consider:

  1. Nodes A and B both broadcast simultaneously.
  2. Nodes C, D, and E each receive one of these broadcasts first (due to network delays).
  3. C and D acknowledge A. D and E acknowledge B.
  4. Node D acknowledged both because messages arrived in different orders on different network paths.
  5. A believes it has a majority (C, D, A = 3 of 5). B believes it has a majority (D, E, B = 3 of 5).
  6. Two leaders. Data corruption ensues.

The bug is that node D’s acknowledgments were not mutually exclusive. A proper consensus protocol ensures that once a node votes for a proposal, that vote cannot be reused for a conflicting proposal. This sounds simple to bolt on. It is not.

Inconsistent Reads Across Replicas

You write a value to a replicated data store. You read it back from a different replica. The value is not there yet. Or worse: you read the value, make a decision based on it, and then another process reads from a third replica that has not received the write and makes an incompatible decision.

Without consensus on the order of operations, replicas can disagree about which writes have been applied, which have been rolled back, and which are still in flight. Linearizability — the gold standard for consistency — requires consensus.

A Naive Protocol (and Why It Fails)

Let us try to build a consensus protocol from scratch. We have N nodes, each with a proposed value. We want them all to agree on one value.

Attempt 1: Broadcast and Majority

// Run on each node i
procedure NAIVE_CONSENSUS(my_value):
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from all nodes  // already doomed
    values = collect all received proposals
    decision = most_common(values)  // tie-breaking by node ID
    return decision

How it breaks: The “wait for PROPOSE messages from all nodes” step is the problem. If even one node has crashed, we wait forever. Termination is violated.

Attempt 2: Wait for a Majority Instead

procedure NAIVE_CONSENSUS_V2(my_value):
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from a majority of nodes
    values = collect all received proposals
    decision = most_common(values)
    return decision

How it breaks: Different nodes may receive different subsets of proposals, depending on message delays and crash timing. Node A might see proposals from {A, B, C} while node D sees proposals from {B, D, E}. They can compute different most_common values. Agreement is violated.

Let us trace through an example. Five nodes, each proposing a different value:

Node A proposes: "red"
Node B proposes: "blue"
Node C proposes: "red"
Node D proposes: "blue"
Node E proposes: "green"

Due to network delays, Node A receives proposals from {A, B, C} before its majority threshold: two “red”, one “blue”. Decision: “red.”

Node D receives proposals from {B, D, E}: one “blue”, one “blue”, one “green”. Decision: “blue.”

Agreement violated. The system is broken.

Attempt 3: Two-Phase Approach

Fine. Let us add a round to fix this.

procedure NAIVE_CONSENSUS_V3(my_value):
    // Phase 1: Collect proposals
    broadcast (PROPOSE, my_value) to all nodes
    wait for PROPOSE messages from a majority of nodes
    values = collect all received proposals
    candidate = most_common(values)

    // Phase 2: Vote on the candidate
    broadcast (VOTE, candidate) to all nodes
    wait for VOTE messages from a majority of nodes
    if all received votes agree:
        return candidate
    else:
        // ???
        restart from Phase 1?  // divergence risk

How it breaks: We have improved matters — now at least we have a confirmation step. But the “else” branch is the problem. If votes disagree, what do we do? If we restart, we may never terminate. Two nodes might keep proposing different candidates, each getting enough votes from their local neighborhood to proceed to Phase 2 but never achieving unanimous votes. This is a livelock, and it is exactly the failure mode that makes consensus hard.

What if we just retry forever with random backoff? You have reinvented an unreliable version of Paxos without the properties that make Paxos correct. The issue is that between phases, the set of participating nodes can change (due to crashes and recoveries), and different nodes can be in different phases simultaneously. Without careful bookkeeping about which round you are in and which proposals have been accepted, you get inconsistency.

Attempt 4: Designated Leader

procedure NAIVE_CONSENSUS_V4(my_value):
    if i am the leader:
        broadcast (DECIDE, my_value) to all nodes
        return my_value
    else:
        wait for (DECIDE, v) from the leader
        return v

How it breaks: This actually works perfectly — as long as the leader never fails. The moment the leader crashes before sending or during sending its DECIDE message, some nodes have decided and some have not. Now you need to elect a new leader, which is itself a consensus problem. You have defined consensus in terms of itself.

Also: some nodes may have received the DECIDE and some may not. The new leader needs to figure out what, if anything, the old leader decided. Without a way to query the other nodes and reconcile their states, the new leader might propose a different value, violating Agreement for any node that already decided the old leader’s value.

This, by the way, is essentially the starting point for Paxos and Raft. They are what you get when you take the designated-leader approach and solve all of the problems I just described. The solving takes about 30 pages of proofs.

The Shape of the Real Problem

Every naive attempt above fails for one or more of these reasons:

  1. Asynchrony. You do not know whether a node is crashed or just slow. Setting a timeout is a heuristic, not a guarantee.

  2. Partial failures. A node can crash mid-broadcast, delivering its message to some nodes but not others. This creates asymmetric knowledge — different nodes have different information about the same event.

  3. State divergence during recovery. After a failure, nodes must reconcile their states before making progress. But reconciliation requires communication, which brings us back to the consensus problem.

  4. No global clock. Without a shared notion of time, you cannot determine the order of events across nodes. Lamport showed us this in 1978, and we have been dealing with the consequences ever since.

Let us illustrate the asynchrony problem more concretely.

Synchronous vs. Asynchronous Models

A synchronous system provides known upper bounds on message delivery time and processing speed. If I send you a message, it will arrive within delta time units. If a node takes a step, it completes within phi time units. These bounds are known to all participants.

In a synchronous system, consensus is straightforward — even with failures. Here is a protocol for crash failures in a synchronous system:

// Synchronous consensus tolerating f crash failures
// Requires f+1 rounds
procedure SYNC_CONSENSUS(my_value, f):
    known_values = {my_value}

    for round = 1 to f + 1:
        broadcast known_values to all nodes
        wait delta time units  // guaranteed delivery bound
        for each received message values_from_j:
            known_values = known_values UNION values_from_j

    return min(known_values)  // deterministic tie-breaking

This works because after f+1 rounds, even if f nodes crash, at least one round had no crashes (pigeonhole principle). In that round, all surviving nodes exchanged complete information. The min function is a deterministic tiebreaker that all nodes apply to the same set of values.

The catch: real networks are not synchronous. TCP retransmission timeouts, garbage collection pauses, VM migrations, switch buffer overflows, cosmic rays flipping bits in router memory — any of these can cause message delays that exceed any bound you care to set. Set the bound too low and you falsely suspect crashed nodes. Set it too high and your system stalls waiting for a genuinely crashed node.

An asynchronous system makes no timing assumptions whatsoever. Messages are delivered eventually, but there is no bound on how long “eventually” takes. This is the model that most closely matches real-world networks (though it is overly pessimistic in some ways).

In the asynchronous model, as we will see in Chapter 3, deterministic consensus is impossible even with a single crash failure. This is the FLP impossibility result, and it is perhaps the most important theorem in distributed systems.

The practical sweet spot is the partially synchronous model: the system is asynchronous, but eventually becomes synchronous (or there exists an unknown upper bound on message delivery that holds after some unknown point in time). This captures the behavior of real networks: they are usually well-behaved, occasionally terrible, but eventually recover. Protocols like Paxos and Raft are designed for this model — they are always safe, and they make progress once the network stabilizes.

Message Flow: When Things Go Right and Wrong

Let us trace through a simple three-node scenario to build intuition about why message ordering creates problems.

Scenario 1: Everything Works

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
                                        receive("X")
t2                      ack("X") ->     ack("X") ->
t3      receive acks
        DECIDE("X") ->
t4                      decide("X")     decide("X")

Everyone agrees. Life is good. Now let us break things.

Scenario 2: Proposer Crashes Mid-Broadcast

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
        ** CRASH **                     (message lost)
t2                      waiting...      waiting...
t3                      timeout         timeout
t4                      ???             ???

Node B received the proposal. Node C did not. Node B knows about “X”, Node C does not. If we elect a new leader and it happens to be Node C, it might propose “Y”. Node B is now in a bind: it already knows about “X” — should it accept “Y”?

This is the core dilemma that Paxos resolves with its two-phase prepare/accept mechanism. The new leader must first learn what the old leader might have decided before proposing anything new. Without this, you get inconsistency.

Scenario 3: Network Partition

Time    Node A          Node B          Node C
----    ------          ------          ------
t1      propose("X") ->
                        receive("X")
                                        (partitioned from A and B)
t2                      ack("X") ->
t3      has 2/3 acks
        DECIDE("X") ->
t4                      decide("X")     propose("Y")  // thinks A is dead

Node C, partitioned from A and B, might time out and start its own proposal. If C can reach a majority (it cannot, in a 3-node cluster with a 1:2 partition), it would decide a different value. The majority requirement is what saves us here: C cannot get a majority because it is alone on its side of the partition. But this only works because we defined “majority” as more than N/2. With N=4 and a 2:2 partition, neither side can make progress. The system stalls until the partition heals.

This is the CAP theorem in action: during a partition, you can have Consistency (all nodes agree) or Availability (all nodes can make progress), but not both. Consensus protocols choose Consistency.

Real-World Consequences

The theoretical problems described above manifest as real, costly production failures.

GitHub’s 2018 outage was caused by a network partition between their primary database and its replicas. The failover system promoted a replica, creating two primaries. When the partition healed, they had to reconcile 24 hours of divergent writes. The outage lasted over 24 hours.

Amazon’s 2011 EBS outage involved a cascading failure in their distributed replication system. A network configuration change created a “re-mirroring storm” where nodes could not agree on which replicas were authoritative. The lack of clean consensus about replica state turned a minor network issue into a major multi-day outage.

MongoDB’s (formerly frequent) rollback behavior in older versions used a primary-secondary replication model without proper consensus for write acknowledgment. If a primary accepted a write, crashed, and a secondary was promoted before replicating that write, the write was silently rolled back when the old primary rejoined. This is exactly the scenario our Attempt 4 protocol fails at.

All of these are consensus failures. The systems either did not use consensus when they should have, or used consensus protocols that made incorrect assumptions about the failure model.

What Consensus Prevents

When used correctly, consensus provides the following guarantees:

Consistent leader election. At most one node believes it is the leader at any given logical time. Split-brain is prevented because the leader must be elected by a majority, and two majorities necessarily overlap.

Atomic broadcast. All nodes deliver the same messages in the same order. This is equivalent to consensus (provably so) and is the foundation of replicated state machines.

Consistent configuration changes. Adding or removing nodes from a cluster is itself a consensus problem. If nodes disagree about who is in the cluster, majority calculations break down. This is why Raft dedicates a significant portion of its paper to joint consensus for membership changes, and why getting this wrong has caused some of the most notorious distributed systems bugs.

Linearizable reads and writes. Clients observe a single, consistent ordering of operations, as if there were one copy of the data. This requires consensus to order concurrent writes and to ensure reads reflect the latest committed write.

The Cost of Consensus

Consensus is not free. The minimum cost is:

  • Latency: At least one round-trip to a majority of nodes for each decision. In a geographically distributed cluster, this can be tens or hundreds of milliseconds. Paxos and Raft both require at least two message delays in the common case (leader to followers and back).

  • Throughput: The leader is a bottleneck. Every decision goes through it. Multi-Paxos and Raft both pipeline decisions to amortize the cost, but the leader’s network and CPU remain the ceiling.

  • Availability: The system cannot make progress unless a majority of nodes are reachable. In a 5-node cluster, you can tolerate 2 failures. In a 3-node cluster, you can tolerate 1. The math is unforgiving: to tolerate f failures, you need 2f+1 nodes.

These costs are why many systems that claim to use consensus actually cut corners: they use consensus for metadata (who is the leader, what is the configuration) but not for the data path. This is a legitimate architecture, but the boundary between “consensus-protected” and “not consensus-protected” operations is where bugs hide.

Looking Ahead

The rest of Part I will build the theoretical foundation you need to understand why consensus protocols are designed the way they are:

  • Chapter 2 examines failure models — the assumptions about what can go wrong that determine which protocols are possible.
  • Chapter 3 covers the FLP impossibility result — the theorem that says deterministic asynchronous consensus is impossible, and how practical protocols sidestep it.
  • Chapter 4 addresses Byzantine failures — what happens when nodes can lie, not just crash.

These are not abstract curiosities. Every design decision in every consensus protocol — every quorum size, every timeout, every phase of message exchange — exists because of the constraints these results establish. Understanding the constraints is prerequisite to understanding the solutions.

If you take one thing from this chapter, let it be this: consensus is hard not because we are bad at engineering, but because the problem itself has fundamental lower bounds imposed by the physics of distributed communication. Messages take time. Nodes can fail. You cannot distinguish a slow node from a dead one. Every consensus protocol is a different set of tradeoffs within these constraints, and understanding which tradeoffs your system makes is the difference between an architecture and a hope.

Failure Models and Assumptions

Every distributed systems paper begins with a model. This is not academic throat-clearing — the model is the contract between you and reality. Get it wrong, and your beautifully proven protocol provides guarantees about a universe that does not exist. Get it right, and you have a fighting chance.

The model specifies two things: what kinds of failures can occur, and what timing assumptions the system provides. These two axes — failure model and synchrony model — form a grid, and your position on that grid determines which consensus protocols are available to you, how many nodes you need, and how much performance you sacrifice.

Most engineers pick a model implicitly, by choosing a consensus library and hoping the library’s authors made reasonable assumptions. This chapter is about making those assumptions explicit, because the assumptions will bite you eventually, and it is better to know where the teeth are.

The Failure Model Spectrum

Failures range from polite to adversarial. The more adversarial the failure model, the more expensive your protocol.

Crash-Stop Failures

A crash-stop process executes correctly until it fails, at which point it stops forever. It does not send corrupted messages. It does not come back to life with amnesia. It simply ceases to exist.

// A crash-stop process
procedure CRASH_STOP_NODE():
    while true:
        msg = receive()
        response = process(msg)  // always correct
        send(response)           // always correct
        // At some point, this loop simply stops executing.
        // No partial messages. No corruption. Just silence.

This is the gentlest failure model, and it is the one assumed by the original Paxos paper and by Raft. It is also the least realistic. Real nodes crash and come back. Disks persist data across restarts. Networks drop some messages but not others.

The appeal of crash-stop is that it is simple to reason about. A node is either correct (executing the protocol faithfully) or dead (not participating at all). There is no middle ground, no weird states, no partial failures. Protocols designed for crash-stop failures are elegant. They are also insufficient for real systems, which is why nobody actually deploys them — they deploy crash-recovery protocols instead.

Crash-Recovery Failures

A crash-recovery process can crash and later resume execution. The critical question is: what does it remember?

If the process has stable storage (disk) that survives crashes, it can recover its state and rejoin the protocol. If not, it is equivalent to a new process that knows nothing — potentially dangerous if other nodes think it still holds promises from before the crash.

// A crash-recovery process with stable storage
procedure CRASH_RECOVERY_NODE():
    state = recover_from_disk()  // may be empty on first boot

    while true:
        msg = receive()
        response, new_state = process(msg, state)
        write_to_disk(new_state)  // MUST be durable before sending
        fsync()                    // yes, really
        state = new_state
        send(response)
        // May crash at any point. After crash:
        // - state on disk is consistent (due to fsync before send)
        // - any message sent was backed by durable state
        // - any unsent message is safe to re-derive from disk state

The fsync() before send() ordering is critical, and getting it wrong is one of the most common bugs in consensus implementations. If you send a response before the state is durable, a crash loses the state but the recipient already acted on the response. The protocol’s invariants are violated.

This is the model that practical consensus implementations use. Raft’s persistent state — currentTerm, votedFor, and the log — must be durable before any messages are sent. The Raft paper says this clearly. Many implementations get it wrong anyway, either by batching fsync calls (correct but adds latency) or by skipping them entirely (fast but will corrupt data on power loss).

The fsync tax. A single fsync to a spinning disk takes 5-10ms. To an SSD, 0.1-1ms. To an NVMe drive, 10-100 microseconds. This is the floor on your consensus latency. Every consensus decision requires at least one fsync on the leader and one on each follower in the majority. Many production systems “optimize” by disabling fsync and relying on battery-backed write caches or UPS systems. This works until it does not.

Omission Failures

An omission failure occurs when a process fails to send or receive a message it should have. The process itself continues executing correctly — it just misses some messages.

// A process with omission failures
procedure OMISSION_NODE():
    while true:
        msg = receive()  // might silently drop the message
        if msg != NULL:
            response = process(msg)  // always correct if msg received
            send(response)           // might silently fail to deliver

Omission failures model real network behavior: packets get dropped, TCP connections silently break, firewalls eat messages without sending RSTs. A process experiencing omission failures is harder to detect than a crashed process because it is still alive and partially functional.

Send-omission and receive-omission are sometimes distinguished. A process that can neither send nor receive is equivalent to a crashed process. A process that can send but not receive (or vice versa) creates interesting asymmetries: it may act on stale information while still appearing alive to its peers.

In practice, omission failures are often handled by treating them as crash failures with extra steps. If a node misses enough heartbeats (because it is not receiving them, or its responses are not being delivered), the rest of the cluster treats it as crashed. This is conservative but safe.

Byzantine Failures

A Byzantine process can do anything: send contradictory messages to different nodes, lie about its state, selectively delay messages, or behave correctly for years before turning malicious at the worst possible moment.

// A Byzantine process (from the protocol's perspective)
procedure BYZANTINE_NODE():
    while true:
        msg = receive()
        // May do any of the following:
        //   - Process correctly and respond honestly
        //   - Respond with fabricated data
        //   - Send different responses to different nodes
        //   - Respond to some nodes and not others
        //   - Respond after an arbitrary delay
        //   - Coordinate with other Byzantine nodes
        //   - Do nothing
        response = ADVERSARY_CHOOSES()
        send(response)

Byzantine fault tolerance is the strongest (and most expensive) failure model. We will cover it in depth in Chapter 4. For now, the key fact is: tolerating f Byzantine faults requires 3f+1 nodes, compared to 2f+1 for crash faults. That extra f nodes is the price of not trusting anyone.

Most production distributed systems do not use Byzantine fault tolerance, and for good reason. Your servers are not actively trying to sabotage each other. (If they are, you have bigger problems than consensus.) Byzantine tolerance is primarily relevant in blockchain systems, multi-tenant systems with untrusted participants, and safety-critical systems where hardware bit-flips could corrupt behavior.

The Failure Model Comparison

PropertyCrash-StopCrash-RecoveryOmissionByzantine
Nodes for f faults2f+12f+12f+13f+1
Can node recover?NoYes (with stable storage)N/A (still running)N/A
Message integrityGuaranteedGuaranteedGuaranteedNot guaranteed
Node behaviorCorrect or silentCorrect, silent, or recoveringCorrect but lossyArbitrary
Real-world matchPower loss, kill -9Server reboot, process restartNetwork issues, GC pausesBugs, hackers, bit-flips
Protocol complexityLowMediumMediumHigh
Performance overheadLowMedium (fsync costs)LowHigh (extra rounds, signatures)
Common protocolsBasic Paxos, Raft (simplified)Multi-Paxos, Raft (production)Paxos with retransmissionPBFT, HotStuff, Tendermint

The Synchrony Spectrum

Orthogonal to the failure model is the timing model: what assumptions does the protocol make about how long things take?

Synchronous Model

In a synchronous system, there exist known upper bounds:

  • Message delivery: Every message is delivered within delta time units.
  • Processing time: Every computation step completes within phi time units.
  • Clock drift: All clocks advance at the same rate (or within a known bound of each other).

These bounds are known to all participants and are never violated.

// In a synchronous system, this is safe:
procedure SYNC_FAILURE_DETECTION(node_j):
    send PING to node_j
    start timer(2 * delta + phi)  // round-trip + processing
    if receive PONG from node_j before timer expires:
        node_j is alive
    else:
        node_j has DEFINITELY crashed  // this conclusion is sound

The power of synchrony is that you can make definitive conclusions from the absence of messages. If a response does not arrive within the known bound, the sender has failed. No ambiguity.

Synchronous consensus is relatively straightforward. The classic algorithm by Dolev and Strong achieves Byzantine consensus in f+1 synchronous rounds using digital signatures. Without signatures, the lower bound is f+1 rounds for crash failures (proved by Fischer and Lynch, 1982) and 2f+1 rounds for Byzantine failures.

The problem: real systems are not synchronous. Your “known upper bound” is a fiction. GC pauses, page faults, noisy neighbors on shared hardware, congested network links — any of these can blow past your assumed bounds. And when they do, a synchronous protocol makes incorrect conclusions (“that node is dead”) that can violate safety.

The Jepsen testing project has found numerous real-world bugs caused by exactly this: systems that assume synchrony and break when the assumption is violated. Clock skew, in particular, is a perennial source of bugs in protocols that use timestamps for ordering.

Asynchronous Model

An asynchronous system makes no timing assumptions. Messages are eventually delivered, but there is no bound on delivery time. Processes take steps, but there is no bound on the time between steps.

// In an asynchronous system, this is the best you can do:
procedure ASYNC_FAILURE_DETECTION(node_j):
    send PING to node_j
    // wait... how long?
    // If no response after T seconds:
    //   - Maybe node_j crashed
    //   - Maybe the network is slow
    //   - Maybe node_j is in a GC pause
    //   - Maybe our PING was delayed
    //   - We genuinely cannot tell
    return MAYBE_FAILED  // the best we can offer

The asynchronous model is brutally honest about what you can determine. You can never conclude that a node has failed — only that you have not heard from it recently. This makes failure detection fundamentally unreliable in asynchronous systems.

And, as the FLP impossibility result shows (Chapter 3), deterministic consensus is impossible in a purely asynchronous system, even with just one crash failure. Not “hard” — impossible.

Partially Synchronous Model

The partially synchronous model, introduced by Dwork, Lynch, and Stockmeyer in 1988, provides the practical middle ground. It comes in two flavors:

Flavor 1: Unknown bound. There exists a fixed upper bound delta on message delivery, but this bound is not known to the participants. The protocol must be correct regardless of delta’s value, but may use it implicitly (e.g., by assuming that eventually messages arrive before the timeout fires).

Flavor 2: Global Stabilization Time (GST). The system is asynchronous until some unknown time GST, after which it becomes synchronous with known bounds. The protocol must be safe at all times (even before GST) but is only required to make progress (liveness) after GST.

// Partially synchronous failure detection
procedure PARTIAL_SYNC_FAILURE_DETECTION(node_j):
    timeout = INITIAL_TIMEOUT

    while true:
        send PING to node_j
        start timer(timeout)
        if receive PONG from node_j before timer:
            node_j is SUSPECTED_ALIVE
            timeout = max(timeout / 2, MIN_TIMEOUT)  // decrease timeout
        else:
            node_j is SUSPECTED_FAILED  // not certain!
            timeout = timeout * 2  // back off: maybe we were too aggressive
            // Continue pinging — it might come back

The key insight is that the protocol never makes safety-critical decisions based on timeout conclusions. Timeouts are used for liveness only — to trigger leader elections, retransmissions, and view changes. Safety is ensured by quorum-based voting that does not depend on timing.

This is the model used by Paxos, Raft, PBFT, and virtually every practical consensus protocol. They guarantee:

  • Safety always. Even during arbitrary asynchrony (before GST), no two nodes decide different values.
  • Liveness after GST. Once the network stabilizes, the protocol will eventually make progress.

In practice, “after GST” means “when the network is behaving reasonably.” During a network partition or severe congestion, the system stalls but does not produce incorrect results. When the network heals, it resumes. This is exactly the behavior you want from a distributed database or coordination service.

The Synchrony Comparison

PropertySynchronousPartially SynchronousAsynchronous
Timing boundsKnown and fixedUnknown or eventualNone
Failure detectionPerfectEventually accurateImpossible
Deterministic consensusPossiblePossible (after GST)Impossible (FLP)
Safety guaranteeDepends on bound correctnessAlwaysAlways
Liveness guaranteeAlways (if bounds hold)After GSTNot guaranteed
Real-world matchLAN (sometimes)Most networks (usually)Adversarial networks
Round complexity (crash)f+1 roundsUnbounded (but finite)N/A

Network Partitions

A network partition divides the nodes into two or more groups that can communicate within their group but not across groups. Partitions are the most common real-world failure that consensus must handle, and they are frequently misunderstood.

What Partitions Look Like

Before partition:
    A <---> B <---> C <---> D <---> E
    All nodes can reach all other nodes.

During partition:
    A <---> B <---> C          D <---> E
    {A, B, C} can communicate.  {D, E} can communicate.
    No messages cross the partition boundary.

Asymmetric partition:
    A ----> B       (A can send to B)
    A <-/-- B       (B cannot send to A)
    This happens more often than you'd think.

Asymmetric partitions are particularly nasty. Node B receives messages from A and believes A is alive. Node A never hears from B and suspects B is dead. If A is the leader, it might step down (believing it has lost its majority), while B still thinks A is leading. The system can oscillate between states as the asymmetry plays out.

How Consensus Handles Partitions

The majority quorum mechanism is the primary defense against partitions:

procedure PARTITION_SAFE_DECISION(value):
    // To commit a value, need acks from a majority
    acks = {self}
    broadcast (PROPOSE, value) to all nodes

    while |acks| <= N / 2:
        msg = receive_with_timeout()
        if msg is ACK:
            acks = acks + {msg.sender}
        if timeout:
            // Cannot reach majority — stall, do not decide
            retry or wait

    // If we reach here, we have a majority
    commit(value)

During a partition, at most one side has a majority. The minority side stalls. This is the correct behavior: it is better to be unavailable than inconsistent.

But partitions interact with leader election in subtle ways:

  1. Leader is on the majority side: system continues operating. The minority side cannot elect a new leader (no majority). Everything is fine.

  2. Leader is on the minority side: the minority side stalls (leader cannot commit, no majority for acks). The majority side eventually times out and elects a new leader. This works, but there is a window where the old leader might still be trying to commit entries that will never succeed. Raft handles this with term numbers — the old leader’s proposals are from an older term and will be rejected by nodes that have moved on.

  3. Leader is on the partition boundary (asymmetric): this is where things get ugly. The leader might be able to send to some nodes but not receive from them, or vice versa. Different nodes have different views of whether the leader is alive. Multiple leader elections can trigger in rapid succession, leading to livelock if timeouts are not carefully randomized.

Message Delays, Reordering, and Duplication

Real networks mangle messages in every conceivable way.

Message Delays

TCP provides reliable, ordered delivery within a connection. But TCP connections break and are re-established. And at the application level, a node might batch messages, introduce processing delays, or buffer responses. The end-to-end delay between “application sends” and “application receives” is variable and unbounded.

Consensus protocols must be correct under arbitrary delays. This means:

  • A message from round 1 might arrive after messages from round 5.
  • A response to a proposal might arrive after the proposer has moved on to a new proposal.
  • An “I voted for you” message might arrive after a different leader has already been elected.

Every message in a consensus protocol must carry enough context (round number, term, ballot, epoch — different protocols use different names) for the recipient to determine whether the message is still relevant.

Message Reordering

Even though TCP preserves order within a connection, consensus protocols use multiple connections (one to each peer). Messages sent to different peers travel different network paths and arrive in different orders. And if a node uses UDP or a custom transport, all ordering bets are off.

// This is unsafe:
procedure UNSAFE_LEADER():
    send (PREPARE, ballot=5) to all followers
    send (ACCEPT, ballot=5, value="X") to all followers
    // A follower might receive ACCEPT before PREPARE!

// This is safe:
procedure SAFE_LEADER():
    send (PREPARE, ballot=5) to all followers
    wait for PROMISE responses from majority
    // Only then:
    send (ACCEPT, ballot=5, value="X") to all followers

The SAFE_LEADER version works not because messages cannot be reordered, but because the ACCEPT is only sent after receiving PROMISE responses, which creates a causal ordering. The follower must have processed the PREPARE (and sent PROMISE) before the leader sends ACCEPT, so even with reordering, the follower has already processed PREPARE by the time it needs to process ACCEPT.

This is a general pattern in consensus protocols: causal ordering is established by waiting for responses, not by assuming network ordering.

Message Duplication

Networks can duplicate messages. TCP usually prevents this, but at-least-once delivery semantics at the application level (retries after timeout) can cause duplicates. Consensus protocols must be idempotent: processing the same message twice must not change the outcome.

// Idempotent vote handling
procedure HANDLE_VOTE_REQUEST(ballot, candidate):
    if ballot < current_ballot:
        ignore  // stale message, possibly duplicate
    else if ballot == current_ballot and voted_for == candidate:
        send VOTE_GRANTED  // duplicate request, safe to re-ack
    else if ballot == current_ballot and voted_for != candidate:
        send VOTE_DENIED   // already voted for someone else
    else:  // ballot > current_ballot
        current_ballot = ballot
        voted_for = candidate
        persist(current_ballot, voted_for)
        send VOTE_GRANTED

Which Model Matches Your System?

This is the practical question, and the answer is almost always: crash-recovery failures in a partially synchronous network.

Here is why:

Your servers crash and restart. They do not crash and stay dead forever (crash-stop). They have disks. They have persistent state. When they come back, they need to rejoin the protocol with their previous promises intact. This is crash-recovery.

Your servers are not malicious. They are running your code, in your datacenter, on your hardware. They might have bugs (which can look Byzantine), but the threat model of arbitrary adversarial behavior does not apply. Byzantine tolerance costs you an extra f nodes and significant performance overhead. Unless you are building a blockchain or a system with mutually untrusting participants, you do not need it.

Your network is eventually reliable. It drops packets sometimes. It has variable latency. Occasionally, a switch dies and creates a partition. But eventually, the network heals. This is partial synchrony. You do not have guaranteed bounds (synchronous), but you are not in a perpetual adversarial network (asynchronous).

There are exceptions:

  • Safety-critical systems (aircraft, medical devices, nuclear reactors) might use synchronous models with validated timing bounds on dedicated hardware.
  • Blockchain and cryptocurrency systems use Byzantine models because participants are untrusted.
  • Systems crossing trust boundaries (multi-cloud, federated systems) might need Byzantine tolerance for the cross-boundary communication, even if each individual cluster uses crash-recovery.

How Failure Models Affect Protocol Behavior

Let us trace through the same simple protocol under different failure models to see how the model changes everything.

The Protocol: Simple Majority Vote

// A leader tries to commit a value
procedure COMMIT(value):
    send (PROPOSE, value) to all followers
    acks = {self}
    while |acks| <= N/2:
        response = receive()
        if response == ACK:
            acks = acks + {response.sender}
    broadcast (COMMITTED, value) to all

Under Crash-Stop

Node C crashes before receiving the PROPOSE. The leader never receives C’s ACK, but still reaches majority from {A, B, D} (in a 5-node cluster). Node C never recovers. The remaining four nodes have a consistent view. Simple, clean.

Leader A: PROPOSE("X") --> B, C, D, E
Node C: ** CRASH ** (never receives)
Leader A: receives ACK from B, D, E (majority = 3 of 5)
Leader A: COMMITTED("X") --> B, D, E
Result: A, B, D, E agree on "X". C is gone forever.

No complications. C’s crash is permanent and clean. The system shrinks by one node.

Under Crash-Recovery

Node C crashes, but comes back 30 seconds later. Now what?

Leader A: PROPOSE("X") --> B, C, D, E
Node C: ** CRASH ** (after receiving PROPOSE but before sending ACK)
Leader A: receives ACK from B, D, E (majority). Commits.
Leader A: COMMITTED("X") --> B, D, E
Node C: ** RECOVERS **
Node C: reads disk... did I ack anything? What is the current state?

Node C must recover its state from disk. If it wrote the PROPOSE to disk before crashing, it knows about “X” and can catch up. If not, it is missing a committed entry and needs to be brought up to date. The protocol must have a mechanism for C to learn about committed values it missed.

In Raft, this is handled by the leader sending missing log entries to recovered followers. In Paxos, it is handled by the recovered node running the Paxos protocol for each missing slot (or by state transfer from another node).

The fsync ordering matters here. If C had acknowledged the PROPOSE before fsyncing and then crashed, the leader counted C’s ack toward the majority. On recovery, C has no record of its acknowledgment. If the leader’s majority depended on C’s ack (e.g., C was the deciding vote), and the leader also crashes, a new leader might not find a majority that accepted “X” and could choose a different value. Safety violated.

Under Omission Failures

Node C is alive but its network interface is dropping packets.

Leader A: PROPOSE("X") --> B, C, D, E
Node C: receives PROPOSE, sends ACK (but ACK is dropped by network)
Leader A: receives ACK from B, D, E (majority). Commits.
Leader A: COMMITTED("X") --> B, C, D, E
Node C: might or might not receive COMMITTED
Node C: believes it acked but never sees commit confirmation

C is in an awkward state: it accepted the proposal but does not know if it was committed. It cannot unilaterally decide “X” (maybe the leader chose a different value after another round). It must wait for more information, or proactively ask the leader for the current state.

Under omission failures, protocols need more aggressive retransmission and state synchronization. A node that suspects it is experiencing omission failures (because it is not receiving expected messages) should request retransmission from multiple peers, not just the leader.

Under Byzantine Failures

Node C is compromised and actively trying to break the protocol.

Leader A: PROPOSE("X") --> B, C, D, E
Node C: receives PROPOSE("X")
Node C: sends ACK("X") to Leader A (plays along)
Node C: sends PROPOSE("Y") to Nodes D and E (pretending to be leader)
Node C: sends NACK to Node B (trying to slow down the real commit)

In a crash-failure protocol, this causes chaos. Nodes D and E might accept “Y” from the fake leader. The protocol has no defense against forged messages.

A Byzantine fault tolerant protocol handles this through:

  1. Digital signatures: C cannot forge messages from A because it does not have A’s private key.
  2. Quorum intersection: The protocol requires 2f+1 out of 3f+1 nodes to agree. Even if f nodes are Byzantine, the remaining 2f+1 honest nodes’ quorums overlap sufficiently.
  3. View change protocol: If the leader is Byzantine, honest nodes can detect misbehavior and trigger a leader change.

The cost: more nodes, more messages per round, cryptographic operations on every message, and significantly more complex protocol logic.

The “Eventually Synchronous” Sweet Spot

Let us be concrete about why partial synchrony is the right model for most systems.

Consider a 5-node Raft cluster running across three availability zones in a cloud provider. Under normal conditions:

  • Intra-zone message latency: 0.1-1ms
  • Cross-zone message latency: 1-5ms
  • fsync latency: 0.1-1ms (SSD)
  • Leader heartbeat interval: 150ms
  • Election timeout: 300-500ms (randomized)

This system is effectively synchronous 99.9% of the time. Messages arrive in under 5ms. Heartbeats arrive well before the election timeout. Consensus decisions complete in under 10ms.

But 0.1% of the time:

  • A GC pause stalls a node for 200ms, causing it to miss heartbeats.
  • A cross-zone link drops packets for 2 seconds during a routing reconvergence.
  • An NVMe drive stalls for 500ms due to wear leveling.

During these events, the system triggers unnecessary leader elections, commits stall, and latency spikes. But — and this is the crucial part — no incorrect decisions are made. The system’s safety relies on quorum voting, not on timing. The timing assumptions only affect liveness: the system might stall during the disruption, but it never produces inconsistent results.

When the disruption ends (the network stabilizes, the GC pause completes, the drive finishes wear leveling), the system resumes normal operation automatically. No manual intervention, no data reconciliation, no split-brain.

This is partial synchrony in action, and it is why every serious consensus implementation targets this model.

Common Misconceptions

“Crash failures are enough because our servers are reliable.” Your servers are reliable 99.99% of the time. The consensus protocol exists for the 0.01%. And “server” is not the only failure domain — NICs, switches, power supplies, cables, hypervisors, kernels, and your own application code all fail in ways that consensus must handle.

“We don’t need to worry about Byzantine failures because our nodes are trusted.” Mostly true, but bugs can cause Byzantine-like behavior. A bug that causes a node to send different values to different peers is, from the protocol’s perspective, a Byzantine failure. If you have ever seen a bug where a serialization library produces different output on different platforms, you have seen a non-malicious Byzantine failure.

“Our network is synchronous because we use TCP.” TCP provides reliable delivery, not bounded-time delivery. A TCP connection can be alive but stalled for arbitrary periods due to congestion control, buffer bloat, or retransmission backoff. And TCP connections break, requiring re-establishment, during which messages are delayed.

“Partial synchrony means the network is usually good.” Not quite. Partial synchrony means there exists a time after which the network behaves synchronously. It says nothing about how long the asynchronous period lasts or how long the synchronous period lasts. The guarantee is existential, not probabilistic.

“We can detect failures with heartbeats.” You can detect suspected failures with heartbeats. In an asynchronous system, you cannot distinguish a crashed node from a slow one. Your heartbeat timeout is a guess — make it too short and you get false positives (unnecessary elections, wasted work); make it too long and you get slow failure detection (long unavailability windows).

Practical Recommendations

  1. Assume crash-recovery with persistent state. Your protocol must handle nodes that crash and rejoin. Every promise, vote, and log entry must be durable before being acted upon.

  2. Assume partial synchrony. Never make safety depend on timing. Use timeouts for liveness only.

  3. Design for network partitions. Assume your network will partition. Test under partition conditions. Jepsen-style testing is not optional — it is how you find the bugs that only manifest under failure.

  4. Use Byzantine tolerance only when you need it. The overhead is real: 3f+1 instead of 2f+1 nodes, more message rounds, cryptographic overhead, and dramatically more complex code. Most internal systems do not need it.

  5. Validate your assumptions. If your protocol assumes fsync is durable, test it. (It is not always durable — some filesystems lie, some disks have buggy firmware, and virtual machines add another layer of uncertainty.) If your protocol assumes clocks are roughly synchronized, measure the actual skew. If your protocol assumes a maximum message size, enforce it.

  6. Test the transitions. The steady state is easy. The transitions — node joins, node leaves, leader changes, partition forms, partition heals — are where bugs live. Every combination of “these nodes are up, these are down, this link works, that one does not” is a potential test case. You will not test all of them. Test the ones that have caused outages before.

The failure model is not a theoretical exercise. It is the foundation on which your system’s correctness rests. Choose it deliberately, validate it continuously, and be prepared for reality to exceed your model’s assumptions — because it will.

FLP Impossibility and What It Means for You

In 1985, Michael Fischer, Nancy Lynch, and Michael Paterson published a two-page result that changed distributed computing forever. The paper, “Impossibility of Distributed Consensus with One Faulty Process,” proved that no deterministic protocol can guarantee consensus in an asynchronous system if even a single process can crash.

Read that again. One faulty process. Not a majority. Not a Byzantine adversary. One crash. That is all it takes to make deterministic consensus impossible in an asynchronous system.

This result, universally known as “FLP” after its authors’ initials, won the Dijkstra Prize in 2001. It is the most important impossibility result in distributed computing, and it is widely misunderstood. People cite it to argue that consensus is impossible (it is not), that distributed systems are doomed (they are not), or that their particular hack to avoid consensus is justified (it usually is not).

Let us understand what FLP actually says, why it is true, and what it means for practical system design.

What FLP Actually Says

Theorem (FLP, 1985). There is no deterministic protocol that solves consensus in an asynchronous system with reliable channels if even one process may crash.

Let us unpack every word:

  • Deterministic. The protocol’s next step is entirely determined by its current state and the messages it has received. No coin flips, no random timeouts, no external oracles.
  • Consensus. Agreement, Validity, and Termination as defined in Chapter 1. All three. At the same time.
  • Asynchronous. No bounds on message delivery time or processing speed. Messages are eventually delivered, but “eventually” has no upper bound.
  • Reliable channels. Messages are not lost, duplicated, or corrupted. They are delivered exactly once, eventually. (This makes the result stronger — even with perfect channels, consensus is impossible.)
  • One process may crash. Not “one process will crash.” The protocol must be correct in executions where up to one process crashes, and also in executions where no process crashes. The adversary gets to choose.

The result says nothing about:

  • Randomized protocols (which can solve consensus in asynchronous systems)
  • Partially synchronous systems (which can solve consensus deterministically)
  • Synchronous systems (which trivially solve consensus)
  • Safety alone (which is achievable; it is the combination of safety and liveness that is impossible)

The Proof Intuition

The full proof is only a few pages but quite dense. I will walk through the intuition, which is more valuable than memorizing the formal argument.

Configurations and Decisions

Imagine the global state of the system at any point in time: the state of every process and every message in transit. Call this a configuration.

A configuration is 0-valent if, no matter what happens from this point forward, the only possible decision value is 0. It is 1-valent if the only possible decision is 1. It is bivalent if both 0 and 1 are still possible outcomes, depending on what happens next.

Think of it like a ball on a ridge. A 0-valent configuration has the ball firmly on the left slope — it can only roll left (decide 0). A 1-valent configuration has it on the right slope. A bivalent configuration has the ball balanced on the ridge — it could go either way.

Step 1: The Initial Configuration Is Bivalent

The first claim is that there exists an initial configuration that is bivalent. This follows from the Validity property: if all processes propose 0, the decision must be 0. If all processes propose 1, the decision must be 1. Now consider a sequence of initial configurations where we change one process’s input at a time, from all-0 to all-1:

Config C0: All propose 0       -> must decide 0 (0-valent)
Config C1: One proposes 1      -> decides 0 or 1
Config C2: Two propose 1       -> decides 0 or 1
...
Config Cn: All propose 1       -> must decide 1 (1-valent)

Somewhere in this sequence, there must be adjacent configurations Ck (0-valent or bivalent) and Ck+1 (1-valent or bivalent) that differ in only one process’s input. If Ck is 0-valent and Ck+1 is 1-valent, consider what happens if the process that differs crashes immediately. The remaining processes cannot tell whether they are in Ck or Ck+1 (the crashed process’s input is the only difference, and it never communicated). So they must decide the same value in both cases. But Ck requires 0 and Ck+1 requires 1. Contradiction — unless one of them is bivalent.

Therefore, at least one initial configuration is bivalent. The system starts in a state of genuine uncertainty about what it will decide.

Step 2: You Can Always Stay Bivalent

This is the core of the proof and the most subtle part.

Suppose the system is in a bivalent configuration C. There is some message m that is the “earliest” pending message — the one that the asynchronous scheduler could choose to deliver next. Consider two scenarios:

  • Scenario A: Deliver message m, reaching configuration C’.
  • Scenario B: Deliver some other message first, then eventually deliver m.

The proof shows that from any bivalent configuration, there exists a sequence of steps that keeps the configuration bivalent. The adversary (the asynchronous scheduler) can always choose to delay the “deciding” message just long enough to prevent the system from committing.

Here is the argument in more detail. Suppose C is bivalent, and m is a message to process p. Let D be the set of configurations reachable from C by delivering messages other than m first (keeping m pending). Then consider delivering m to each configuration in D.

Either some configuration in D is still bivalent (and we are done — the adversary stays in the bivalent region), or all configurations in D are univalent. If all are univalent, then there must be two “adjacent” configurations in D — say D0 (which becomes 0-valent after m) and D1 (which becomes 1-valent after m) — that differ by a single step e (delivering a message to some process q).

Now there are two cases:

Case 1: p and q are different processes. The step e (message to q) and the delivery of m (message to p) are independent — they involve different processes. So delivering e then m gives the same result as delivering m then e. But delivering m to D0 gives a 0-valent configuration, and delivering e then m starting from the same configuration gives a 1-valent configuration. Since these are the same configuration (by commutativity), we have a contradiction.

Case 2: p and q are the same process. This is where the crash comes in. If p crashes before either e or m is delivered, the remaining processes cannot distinguish the two configurations (they are identical without p’s state). So from those processes’ perspective, the system must decide the same value in both cases. But one is 0-valent and the other is 1-valent. Contradiction — unless the crashed scenario is bivalent, which means the adversary can keep the system bivalent by crashing p.

// The FLP adversary's strategy (conceptual)
procedure FLP_ADVERSARY(protocol):
    start in a bivalent initial configuration C  // Step 1 guarantees this exists

    while protocol has not terminated:
        // Look at all pending messages
        // Find the one that would force a univalent configuration
        // Delay it (asynchrony permits this)
        // Deliver a different message that preserves bivalency
        // If forced, crash the critical process (we get one crash)

        m = find_deciding_message(C)
        if can_avoid(m):
            deliver some other message  // stay bivalent
        else:
            crash the process that m is addressed to
            // remaining processes are in a bivalent state

The adversary does not need to know the protocol in advance. It merely needs to examine the current configuration and choose which message to deliver next. The asynchronous model gives it this power: any message ordering is a valid execution.

Step 3: Putting It Together

We have shown:

  1. The system can start in a bivalent configuration.
  2. From any bivalent configuration, the adversary can keep it bivalent forever (possibly using one crash).

Therefore, the adversary can construct an execution where the system never decides. Termination is violated. QED.

The beauty of the proof is its economy: it uses only one crash, only reliable channels, and only the inherent nondeterminism of asynchronous message delivery. It identifies the fundamental problem: in an asynchronous system, you cannot force a decision because you cannot distinguish “the message is delayed” from “the sender crashed.” As long as this ambiguity exists, the adversary can exploit it to prevent progress.

What FLP Does NOT Say

FLP is frequently over-interpreted. Here is what it does not prohibit:

FLP Does Not Prohibit Safety

You can always have Agreement and Validity in an asynchronous system. A protocol that never decides satisfies both trivially (vacuously true Agreement, and no incorrect decisions). More usefully, you can design protocols that maintain safety invariants under all executions — they just might not always terminate.

Paxos, for example, never violates safety. Two different values are never committed. This holds regardless of asynchrony, partitions, crashes, or any other failure. The FLP result says that Paxos might not terminate (not make progress) in some executions. And indeed, in pathological executions with competing proposers, Paxos can livelock indefinitely. But it never produces an incorrect result.

FLP Does Not Prohibit Randomized Consensus

The FLP result applies to deterministic protocols. Ben-Or (1983) showed that randomized consensus is solvable in asynchronous systems. The trick: when the protocol reaches a point where it cannot determine the decision, flip a coin.

// Ben-Or's randomized binary consensus (simplified)
procedure BEN_OR_CONSENSUS(my_value):
    v = my_value
    round = 0

    while true:
        round = round + 1

        // Phase 1: Broadcast value
        broadcast (ROUND1, round, v) to all
        wait for N - f ROUND1 messages for this round

        if more than N/2 messages carry the same value w:
            proposal = w
        else:
            proposal = UNDECIDED

        // Phase 2: Broadcast proposal
        broadcast (ROUND2, round, proposal) to all
        wait for N - f ROUND2 messages for this round

        if more than f messages carry the same decided value w:
            v = w
            if more than 2f messages carry w:
                return w  // DECIDE
        else:
            v = random_coin_flip()  // THIS breaks FLP

The randomized coin flip means the adversary cannot predict what the process will do, so it cannot keep the system in a bivalent state forever. With probability 1, the system eventually decides. But “with probability 1” is different from “definitely” — there is no upper bound on how many rounds it might take. In expectation, binary consensus takes O(2^n) rounds with Ben-Or’s protocol (later improved to O(1) expected rounds with common coins by Rabin and others).

FLP Does Not Apply to Partially Synchronous Systems

In a partially synchronous system, the adversary cannot delay messages forever. After the (unknown) Global Stabilization Time, messages are delivered within a bounded time. This breaks the adversary’s ability to indefinitely delay deciding messages.

DLS (Dwork, Lynch, Stockmeyer, 1988) showed that consensus is solvable in the partially synchronous model. This is the theoretical basis for Paxos, PBFT, and every other practical consensus protocol.

FLP Does Not Mean “Give Up”

FLP says that no single protocol can guarantee all three consensus properties in all asynchronous executions. It does not say that consensus is impractical. It says you must give something up — but you get to choose what:

What You Give UpWhat You GetExample
Termination guaranteeAlways-safe, deterministicPaxos (may livelock with competing proposers)
DeterminismProbabilistic terminationBen-Or, randomized protocols
Full asynchronyTermination after GSTRaft, PBFT (assume partial synchrony)
Fault toleranceTermination with no faultsTrivial: just wait for everyone

How Real Protocols Work Around FLP

Every production consensus protocol works around FLP. Here is how.

Timeouts (The Partial Synchrony Escape Hatch)

The most common approach: use timeouts to detect failures and trigger leader changes. The protocol is always safe, regardless of whether the timeouts are accurate. But the protocol only makes progress when timeouts are “reasonable” — that is, when the system is in its synchronous phase.

// Raft's approach to FLP
procedure RAFT_FOLLOWER():
    while true:
        reset election_timer to random(150ms, 300ms)

        while election_timer has not expired:
            if receive AppendEntries from leader:
                reset election_timer
                process entries
            if receive RequestVote from candidate:
                if candidate's term > my term and candidate's log is up to date:
                    vote for candidate
                    reset election_timer

        // Timer expired: leader is presumed dead
        // Start election (might be wrong — leader could just be slow)
        become candidate
        increment term
        request votes from all nodes

The election timeout is the protocol’s concession to FLP. In a purely asynchronous system, no timeout value is correct: it might expire before a message from a live leader arrives. But in a partially synchronous system, there exists a timeout value that works — we just do not know what it is. So Raft randomizes the timeout, and eventually some node’s timeout is long enough to avoid false positives while short enough to detect real failures.

The randomization also prevents livelock: if two candidates start elections simultaneously, their randomized timeouts make it likely that one will complete before the other starts, breaking the tie.

Failure Detectors (The Theoretical Escape Hatch)

Chandra and Toueg (1996) showed that consensus can be solved with an unreliable failure detector — a module that sometimes suspects a process has failed. The failure detector need not be accurate; it only needs to satisfy two properties:

  • Completeness: Every crashed process is eventually suspected by every correct process.
  • Eventual accuracy: There exists a time after which no correct process is suspected.

This is called an “eventually perfect” failure detector (denoted diamond-P). It captures the intuition behind timeouts: they are sometimes wrong, but eventually they stabilize.

// An eventually perfect failure detector
procedure FAILURE_DETECTOR(node_j):
    timeout = INITIAL_TIMEOUT
    suspected = false

    while true:
        send HEARTBEAT_REQUEST to node_j
        start timer(timeout)

        if receive HEARTBEAT_RESPONSE from node_j:
            suspected = false
            // Optionally decrease timeout (we are in sync)
        else if timer expires:
            suspected = true
            timeout = timeout + DELTA  // increase timeout
            // Eventually, timeout > actual_delay
            // At that point, we stop falsely suspecting node_j
            // This is when "eventual accuracy" kicks in

        report suspected status to consensus module

The failure detector abstracts the timing assumptions out of the consensus protocol. The protocol itself is purely asynchronous (and therefore cannot violate safety due to timing issues). The failure detector handles liveness by providing hints about which nodes are alive. If the hints are eventually correct, the protocol terminates. If the hints are wrong, the protocol remains safe but may not make progress.

In practice, failure detectors are timeouts — the abstraction just makes the reasoning cleaner.

Randomization (The Probabilistic Escape Hatch)

As mentioned earlier, randomized protocols sidestep FLP entirely. The adversary controls the schedule but not the coin flips. With probability 1, the protocol terminates. The expected time to termination depends on the protocol:

  • Ben-Or (1983): O(2^n) expected rounds. Theoretically important but impractical.
  • Rabin (1983): O(1) expected rounds using a common coin (shared randomness). Requires a trusted dealer or a distributed coin-flipping protocol.
  • Cachin, Kursawe, Shoup (2000): O(1) expected rounds using threshold cryptography for the common coin. This is the basis for modern asynchronous BFT protocols.
// Common coin approach (simplified)
procedure COMMON_COIN_CONSENSUS(my_value):
    v = my_value

    for round = 1, 2, 3, ...:
        // Phase 1: Propose
        broadcast (PROPOSE, round, v)
        proposals = wait for N - f proposals

        if all proposals have the same value w:
            v = w
            broadcast (COMMIT, round, w)
            return w

        // Phase 2: Common coin
        coin = common_coin(round)  // all honest nodes get the same value
        // (implemented via threshold signatures or similar)

        if more than N/2 proposals had value w and w == coin:
            v = w  // converging
        else:
            v = coin  // reset to coin value

    // With constant probability per round, all nodes converge
    // Expected rounds to consensus: O(1)

The common coin ensures that when nodes are undecided, they all “reset” to the same random value with constant probability. Once they agree (even by accident), the agreement propagates in the next round.

Leader-Based Protocols (The Practical Approach)

Most production protocols (Paxos, Raft, Viewstamped Replication) use a stable leader to drive consensus. As long as the leader is alive and can communicate with a majority, decisions are made in two message delays (propose + acknowledge). The leader serializes proposals, eliminating the competing-proposer livelock that is the practical manifestation of FLP.

// Leader-driven consensus (common case)
procedure LEADER_CONSENSUS(value):
    // Assumes: I am the leader, no contention
    entry = create_log_entry(value, current_term)

    // Phase 1: Replicate
    send AppendEntries(entry) to all followers
    acks = wait for majority of followers to acknowledge
    // (In a 5-node cluster, need 2 follower acks)

    // Phase 2: Commit
    mark entry as committed
    apply entry to state machine
    send commitment notification to followers
    return result to client

    // Total: 2 message delays in the common case
    // No contention, no livelock, no FLP problems

The FLP adversary’s power is neutralized because there is only one proposer. The adversary cannot create competing proposals — there is nobody to compete with. The only way to trigger the FLP scenario is to crash the leader, forcing a leader election.

During leader election, FLP rears its head: competing candidates can split the vote indefinitely. Raft handles this with randomized election timeouts. Paxos handles it with random backoff. Both are probabilistic solutions, consistent with our understanding that deterministic termination is impossible.

The Practical Impact of FLP

What FLP Means for System Design

  1. Your consensus protocol will sometimes stall. There will be executions where no leader is elected for multiple timeout periods. This is not a bug — it is a fundamental consequence of FLP. Design your system to tolerate brief stalls.

  2. Timeouts are not correctness mechanisms. They are liveness mechanisms. If your system’s safety depends on a timeout being accurate (e.g., “if the leader has not responded in 500ms, it is definitely dead, so act accordingly”), your system has a bug. Real protocols treat timeouts as hints: “the leader might be dead, so let us try to elect a new one, but do not discard anything the old leader might have committed.”

  3. Test under adversarial scheduling. The FLP adversary is a useful mental model for testing. What happens if messages are delivered in the worst possible order? What if the node that would break a tie crashes at the worst moment? Jepsen and other testing frameworks explore these schedules.

  4. Embrace nondeterminism for liveness. Randomized election timeouts, randomized backoff, and randomized protocol choices are not hacks — they are theoretically motivated solutions to a fundamental impossibility. Do not try to remove the randomness in pursuit of “determinism.” Determinism is exactly what FLP says you cannot have.

A Taxonomy of FLP Workarounds in Production Systems

SystemFLP WorkaroundNotes
etcd (Raft)Randomized election timeouts150-300ms default range
ZooKeeper (Zab)Randomized election, TCP orderingLeader election uses randomized timeouts
CockroachDB (Raft)Same as etcdRaft as a library (etcd/raft)
Spanner (Paxos)Multi-Paxos with leader leasesTrueTime for lease-based leader optimization
TendermintRandomized timeouts + gossipAsynchronous BFT with timeouts for liveness
HotStuffPacemaker module (timeouts)Separates safety (protocol) from liveness (pacemaker)

Every single one of these uses timeouts as the liveness mechanism and quorum voting for safety. The difference is in the details: how long the timeouts are, how they adapt, and how the system recovers from a bad timeout decision.

The FLP Mindset

FLP teaches a way of thinking about distributed systems that is more valuable than the theorem itself:

Always separate safety from liveness. If you cannot clearly identify which parts of your protocol ensure safety (and work under all conditions) versus which parts ensure liveness (and require timing assumptions), your protocol is probably wrong. Safety mechanisms should never depend on timeouts. Liveness mechanisms can depend on timeouts, randomization, or other heuristics.

Impossibility results are not limitations — they are design guides. FLP does not say “give up.” It says “you must make an explicit tradeoff.” Knowing that the tradeoff is required prevents you from spending months trying to build the impossible, and instead directs your energy toward choosing the right tradeoff for your system.

The adversary is the network. In practice, the “FLP adversary” is not a malicious actor — it is your network, your kernel scheduler, your garbage collector, and your disk controller. They conspire (unintentionally) to deliver messages in the worst possible order at the worst possible time. The fact that this happens rarely is why consensus protocols work in practice. The fact that it can happen is why the protocols need to be correct under arbitrary scheduling.

A Concrete Example: FLP in Action

Let us trace through a specific scenario where FLP manifests in a Raft cluster.

Five nodes: A, B, C, D, E. Node A is the current leader in term 1.

Time    Event
----    -----
t0      A is leader (term 1). Everything is fine.
t1      Network partition: {A, B} | {C, D, E}
t2      C, D, E time out (no heartbeats from A).
t3      C starts election for term 2. Votes for itself.
t4      D starts election for term 2. Votes for itself.
        (C and D started at the same time due to similar timeouts)
t5      C receives vote from E. C has {C, E} = 2 votes. Needs 3.
        D has {D} = 1 vote. D needs 3.
        Neither wins. Term 2 election fails.
t6      C starts election for term 3. D starts election for term 3.
        Same problem. Split vote again.
t7      This could continue indefinitely.

This is FLP manifesting as livelock. The system is safe — no two leaders are elected, no data is corrupted. But the system is unavailable: no new entries can be committed because neither the {A, B} minority nor the leaderless {C, D, E} majority can make progress.

Raft’s solution: randomized election timeouts. After the failed term 2 election, C might wait 287ms and D might wait 412ms. C starts its term 3 election first, and D has not yet timed out, so D votes for C. C wins with {C, D, E} = 3 votes. The livelock is broken probabilistically.

But notice: there is no guarantee that the randomization works on any given attempt. C and D could, by extraordinary bad luck, keep choosing nearly identical timeouts. In theory, this could continue forever. In practice, the probability decreases exponentially with each attempt, so the expected time to elect a leader is quite small. But the guarantee is probabilistic, not deterministic. FLP says it has to be.

FLP and the CAP Theorem

FLP and the CAP theorem are related but distinct results.

CAP (Brewer, 2000; Gilbert and Lynch, 2002): A distributed system can provide at most two of: Consistency, Availability, and Partition tolerance. During a partition, you must choose between C and A.

FLP (1985): Deterministic consensus is impossible in an asynchronous system with even one crash failure. This holds even without a partition.

FLP is, in a sense, stronger than CAP: FLP says you have a problem even without partitions. The mere possibility that a single node might crash is enough to prevent deterministic consensus. CAP says that partitions force a choice between consistency and availability. FLP says that asynchrony forces a choice between safety and liveness — and you cannot wait for partitions to make it happen.

In practice, both results point to the same conclusion: you cannot have everything. You must choose which properties to sacrifice and under what conditions. The protocols we study in the rest of this book are different choices in this tradeoff space.

Summary

FLP tells us that the universe imposes a tax on agreement. In an asynchronous system, you cannot deterministically guarantee that processes will agree on a value if even one of them might fail. This is not an engineering limitation — it is a mathematical fact about what distributed computation can achieve.

But FLP also tells us exactly where the impossibility lies, and therefore how to work around it:

  1. Add timing assumptions (partial synchrony) and the impossibility vanishes. This is what Paxos and Raft do.
  2. Add randomization and the impossibility vanishes (probabilistically). This is what Ben-Or and modern asynchronous BFT do.
  3. Give up termination and keep everything else. This is what some safety-critical systems do: they would rather stall than risk an incorrect decision.

The impossibility is real. The workarounds are also real. The art of consensus protocol design is choosing the right workaround for your system’s requirements and convincing yourself (and ideally proving) that the workaround does not introduce new problems.

Every consensus protocol you will encounter in the rest of this book exists in the shadow of FLP. When you see a timeout, it is there because of FLP. When you see a randomized election, it is there because of FLP. When you see a leader-based protocol, it is structured that way to minimize the window where FLP can cause livelock. Understanding FLP is not about understanding an impossibility — it is about understanding why the possible solutions look the way they do.

The Byzantine Generals Problem

In 1982, Leslie Lamport, Robert Shostak, and Marshall Pease published “The Byzantine Generals Problem,” a paper that gave distributed computing its most evocative metaphor and its most expensive failure model. The paper’s contribution is not the metaphor — it is the proof that tolerating arbitrary (Byzantine) failures requires fundamentally more resources than tolerating crash failures, and the precise characterization of exactly how much more.

The Byzantine failure model is named not because it was invented in Byzantium, but because Lamport wanted a metaphor involving treacherous generals and chose the Byzantine Empire for historical color. The name stuck, and now every distributed systems paper must explain that “Byzantine” means “arbitrarily faulty” and not “unnecessarily complicated” — though in practice, Byzantine fault tolerance is both.

The Problem

An army surrounds a city. The army is divided into divisions, each commanded by a general. The generals can communicate only by messenger. They must agree on a common plan of action: attack or retreat. Some generals may be traitors. The traitors can send different messages to different generals, lie about what messages they received, and generally do anything to prevent the loyal generals from reaching agreement.

The requirements:

  1. All loyal generals decide on the same plan. (Agreement)
  2. If all loyal generals propose the same plan, that plan is the decision. (Validity)

Notice that we do not care what the traitors decide. We only care that the loyal generals agree with each other. Also notice that there is no requirement that the loyal generals detect who the traitors are — only that they reach agreement despite the traitors’ interference.

Translated to computer science: we have N processes, up to f of which may exhibit arbitrary (Byzantine) failures. The correct processes must agree on a value, despite the Byzantine processes sending contradictory, fabricated, or strategically timed messages.

Why 3f+1: The Impossibility with 3 Generals

The most important result in the paper is negative: with only 3 generals, consensus is impossible if even 1 is a traitor (without digital signatures). Let us see why.

Scenario Setup

Three generals: A (the commander, who proposes a value), B, and C. One of them is a traitor. We consider all cases.

Case 1: The Commander (A) Is the Traitor

        A (TRAITOR)
       / \
      /   \
  "attack" "retreat"
    /         \
   B           C

B receives: "attack" from A
C receives: "retreat" from A

Phase 2: B and C exchange what they heard from A.
B tells C: "A said attack"
C tells B: "A said retreat"

B sees: A said "attack", C says A said "retreat"
    -> B does not know who is lying: A or C
C sees: A said "retreat", B says A said "attack"
    -> C does not know who is lying: A or B

B and C cannot determine whether A sent inconsistent messages (A is the traitor) or whether the other lieutenant is lying about what A said (B or C is the traitor). They have symmetric, contradictory information.

Case 2: Lieutenant C Is the Traitor

        A (honest)
       / \
      /   \
  "attack" "attack"
    /         \
   B           C (TRAITOR)

B receives: "attack" from A
C receives: "attack" from A (but C is a traitor)

Phase 2: B and C exchange what they heard from A.
B tells C: "A said attack" (honest)
C tells B: "A said retreat" (LIE)

B sees: A said "attack", C says A said "retreat"
    -> This looks IDENTICAL to Case 1 from B's perspective

From B’s perspective, Cases 1 and 2 are indistinguishable. B received “attack” from A and heard C claim “retreat.” B cannot determine whether A is a traitorous commander who sent different messages, or C is a traitorous lieutenant who is lying.

Since B cannot distinguish the cases, B must take the same action in both. But in Case 2, the correct decision is “attack” (that is what the honest commander ordered), while in Case 1, the decision could go either way (the commander is the traitor, so there is no “correct” order, but B and C must still agree).

If B decides “attack” in both cases (following the commander), then in Case 1, C (who received “retreat” from the commander) would follow the commander and decide “retreat.” B and C disagree. Agreement violated.

If B decides based on majority vote (2 of the 3 received values), the traitor can always craft messages to prevent agreement by sending the value that creates a tie.

This is not a matter of finding a cleverer protocol. No protocol can solve the problem with 3 generals and 1 traitor when messages are unauthenticated. The proof generalizes: with N generals and f traitors, the problem is unsolvable unless N > 3f. You need 3f+1 generals to tolerate f traitors.

The Generalized Impossibility Argument

The 3-generals impossibility generalizes through a simulation argument. Suppose you have a protocol P that works with N = 3f generals tolerating f faults. Lamport, Shostak, and Pease show that you can use P to construct a protocol for 3 generals tolerating 1 fault (each of the 3 generals simulates f of the 3f generals). But we just showed 3-generals-1-fault is impossible. Contradiction. Therefore P cannot exist. You need N >= 3f + 1.

The intuition is this: with 3f generals and f traitors, the 2f honest generals need to “outvote” the f traitors. But the traitors can pretend to be on either side of a disagreement, effectively casting f votes for each side. The 2f honest generals split into two groups — those who heard “attack” and those who heard “retreat” — and each group might be as small as f. The traitors can always make the groups equal, preventing a majority.

With 3f+1 generals and f traitors, the 2f+1 honest generals form a strict majority. Even if the traitors conspire optimally, they cannot make two conflicting values both appear to have majority support among the honest generals.

The Oral Messages Algorithm: OM(m)

Lamport, Shostak, and Pease provided a constructive algorithm for the case N >= 3f+1. The Oral Messages algorithm OM(m) solves Byzantine agreement for m traitors using N >= 3m+1 generals.

“Oral messages” means messages are unauthenticated: the receiver knows who sent a message, but cannot prove to a third party what was sent. A traitor can claim to have received any message from any general.

OM(0): Base Case

When m = 0 (no traitors), the algorithm is trivial:

procedure OM(0, commander, value):
    // Commander sends value to all lieutenants
    commander sends value to all N-1 lieutenants
    each lieutenant decides on the value received

No traitors, no problems. If a lieutenant does not receive a value (should not happen with m=0), it defaults to a predetermined value, say RETREAT.

OM(1): One Traitor

This is the interesting case. We need N >= 4 generals to tolerate 1 traitor.

procedure OM(1, commander, value):
    // Step 1: Commander sends value to all lieutenants
    commander sends v_i to each lieutenant i
    // (If commander is honest, all v_i are the same)
    // (If commander is traitor, v_i may differ)

    // Step 2: Each lieutenant relays what it received
    for each lieutenant i:
        lieutenant i sends "commander told me v_i"
            to all other lieutenants
        // (If lieutenant i is honest, it sends the true v_i)
        // (If lieutenant i is a traitor, it can lie)

    // Step 3: Each lieutenant decides by majority vote
    for each lieutenant i:
        // i has its own value v_i from the commander
        // i has received reported values from all other lieutenants
        // Collect all values (own + reported)
        values = [v_i] + [reported values from each other lieutenant]
        decision = majority(values)  // or default if no majority

Let us trace through with 4 generals (A = commander, B, C, D = lieutenants) and 1 traitor.

Case: Commander A is the traitor.

A sends "attack" to B, "retreat" to C, "attack" to D

Step 2 (relaying):
B tells C: "A said attack"     B tells D: "A said attack"
C tells B: "A said retreat"    C tells D: "A said retreat"
D tells B: "A said attack"     D tells C: "A said attack"

Step 3 (voting):
B sees: attack(self), retreat(from C), attack(from D) -> majority: ATTACK
C sees: retreat(self), attack(from B), attack(from D) -> majority: ATTACK
D sees: attack(self), retreat(from C), attack(from B) -> majority: ATTACK

All loyal lieutenants decide ATTACK. Agreement holds!

Even though the commander sent different values, the relay step allows the honest lieutenants to reconstruct a majority. The key: there are 3 honest lieutenants and only 1 traitor, so the honest majority’s relay messages dominate.

Case: Lieutenant D is the traitor.

A sends "attack" to B, C, D (honest commander, all the same)

Step 2 (relaying):
B tells C: "A said attack"    B tells D: "A said attack"
C tells B: "A said attack"    C tells D: "A said attack"
D tells B: "A said retreat"   D tells C: "A said retreat"  (LIES!)

Step 3 (voting):
B sees: attack(self), attack(from C), retreat(from D) -> majority: ATTACK
C sees: attack(self), attack(from B), retreat(from D) -> majority: ATTACK
D: traitor, we don't care what it decides

Both honest lieutenants decide ATTACK. Agreement holds!

D’s lies are outvoted by the honest majority. This is why 3f+1 is needed: the honest lieutenants’ voices must outnumber the traitors’ lies.

OM(m): The General Case

For m traitors, the algorithm is recursive. Each level of recursion reduces the problem by one traitor, at the cost of N-1 sub-invocations.

procedure OM(m, commander, value):
    if m == 0:
        // Base case: trust the commander
        commander sends value v to all lieutenants
        each lieutenant uses v (or DEFAULT if not received)
        return

    // Step 1: Commander sends value to all lieutenants
    commander sends v_i to each lieutenant i

    // Step 2: Each lieutenant acts as commander for OM(m-1)
    for each lieutenant i:
        lieutenant i runs OM(m-1) as commander,
            sending its received value v_i
            to all other lieutenants (excluding original commander)

    // Step 3: Each lieutenant decides by majority
    for each lieutenant i:
        // For each other lieutenant j:
        //   Let w_j = the value that i decided from j's OM(m-1) sub-call
        // i's own value from commander = v_i
        values = {v_j for each lieutenant j != i} union {v_i}
        decision = majority(values)

The message complexity is exponential: O(N^(m+1)). For m=1 and N=4, this is 4^2 = 16 messages. For m=2 and N=7, this is 7^3 = 343 messages. This is not a typo. The algorithm is theoretically important but impractical for more than a handful of traitors.

The Cost of OM(m)

m (traitors)N (minimum nodes)MessagesRounds
14O(N^2)2
27O(N^3)3
310O(N^4)4
f3f+1O(N^(f+1))f+1

Nobody uses OM(m) in production. It exists to prove that the problem is solvable with 3f+1 nodes. Practical Byzantine protocols (PBFT, HotStuff, Tendermint) achieve polynomial message complexity.

Signed Messages: Changing the Rules

The impossibility result (N >= 3f+1 for unauthenticated messages) relies on a critical assumption: messages are “oral,” meaning a traitor can claim to have received any message from anyone, and no one can prove otherwise.

Digital signatures change this. If every message is signed with the sender’s private key, a traitor cannot forge messages from honest generals. This changes the bounds dramatically.

With signed messages, Byzantine consensus is solvable with N >= f+2 nodes (rather than 3f+1). The Signed Messages algorithm SM(m) from the same paper achieves this.

procedure SM(m, commander, value):
    // Commander signs and sends value to all lieutenants
    commander sends (value, sign(commander, value)) to all

    // Each lieutenant that receives a valid message:
    //   - Adds it to its set of known values
    //   - If it hasn't relayed this message yet, signs and forwards it

    procedure ON_RECEIVE(msg, signatures):
        if all signatures are valid:
            add msg.value to known_values
            if |signatures| <= m:
                // Still need more relay steps
                add my_signature to signatures
                forward (msg.value, signatures) to all who haven't signed

    // After m+1 rounds:
    if |known_values| == 1:
        decide the single value
    else:
        decide DEFAULT  // commander sent different values (is a traitor)

Why does this work with fewer nodes? Because signatures prevent the core attack: a traitor cannot impersonate an honest general. When B tells C “A said attack,” B’s signature proves the message is from B, and A’s signature (forwarded by B) proves A actually sent “attack.” C can verify the chain of signatures without trusting B.

The tradeoff: every message requires cryptographic operations. Signing and verifying digital signatures has non-trivial CPU cost. In a high-throughput system, this can be a bottleneck.

Practical Implications of Signatures

In the era of Lamport, Shostak, and Pease, digital signatures were expensive and their security assumptions were debatable. Today, Ed25519 signature verification takes about 70 microseconds on commodity hardware, and we have well-understood signature schemes.

Most modern Byzantine protocols use signatures liberally:

ProtocolSignature UsageBenefit
PBFTSignatures on all protocol messagesPrevents impersonation, enables view change proofs
HotStuffThreshold signatures for quorum certificatesReduces message complexity to O(N)
TendermintSignatures on votesEnables evidence-based slashing

The SM(m) algorithm’s f+2 bound is theoretically optimal for signed messages, but practical protocols still use 3f+1 nodes. Why? Because the SM(m) algorithm has exponential message complexity (like OM(m)), and the additional nodes in PBFT-style protocols enable polynomial-complexity algorithms. The extra nodes buy you efficiency, not just fault tolerance.

Crash vs. Byzantine: A Practical Comparison

The distinction between crash and Byzantine failures is not just academic. It has concrete implications for system design.

What Crash Failures Look Like

A crashed node stops responding. It does not send any messages. It does not corrupt data. It either participates correctly or not at all.

// Crash failure behavior
procedure CRASH_NODE():
    state = CORRECT
    while state == CORRECT:
        msg = receive()
        response = PROTOCOL_CORRECT_BEHAVIOR(msg)
        send(response)
        if random_hardware_event():
            state = DEAD  // stops executing
    // Silent forever after

Detection: eventually, other nodes notice the silence (missing heartbeats). The detection may be slow, but there is no ambiguity about the type of failure once it is detected.

What Byzantine Failures Look Like

A Byzantine node can do anything. In practice, “anything” usually manifests as one of these patterns:

Software bugs causing inconsistent behavior:

// A bug that looks Byzantine
procedure BUGGY_SERIALIZE(value):
    if platform == "x86":
        return serialize_little_endian(value)
    else:
        return serialize_big_endian(value)
    // Different nodes running different architectures
    // send different byte representations of the same value
    // Recipients interpret them as different values

Hardware corruption:

// A bit flip in memory or on the wire
procedure CORRUPTED_RESPONSE(msg):
    response = correct_response(msg)
    if cosmic_ray_hits_memory():
        response.value = response.value XOR random_bit
    return response
    // The node THINKS it sent the right answer

Equivocation (the classic Byzantine behavior):

// A compromised node deliberately sends different values
procedure EQUIVOCATING_NODE(msg):
    if msg.sender == "Node A":
        send (VOTE, "X") to Node A
    else if msg.sender == "Node B":
        send (VOTE, "Y") to Node B
    // A and B each think this node voted for their preferred value

Equivocation is the hardest Byzantine behavior to detect and the one that most protocols are designed to prevent. It is also the behavior that requires 3f+1 nodes to overcome: with 2f+1 nodes and f Byzantine nodes equivocating, honest nodes cannot determine which of two conflicting claims is supported by a true majority.

The Cost Comparison

PropertyCrash ToleranceByzantine Tolerance
Nodes for f faults2f+13f+1
Messages per decisionO(N)O(N^2) typical
Latency (rounds)2 (common case)3-4 (common case)
Crypto requiredOptional (TLS for transport)Essential (signatures on every message)
Implementation complexityModerateHigh
Typical throughput10K-100K decisions/sec1K-10K decisions/sec
Code complexity~1000 lines (Raft)~5000-15000 lines (PBFT)

The 3f+1 requirement alone is significant. To tolerate 1 fault, you need 4 nodes instead of 3. To tolerate 2, you need 7 instead of 5. In a geo-distributed system where each “node” is a datacenter, the infrastructure cost is substantial.

When Do You Actually Need BFT?

This is the question that separates pragmatists from purists. The answer depends on your threat model.

You Probably Need BFT If:

Your nodes are operated by different organizations that do not trust each other. This is the blockchain use case. If node operators have financial incentive to cheat, crash-fault tolerance is insufficient. A rational adversary will not just crash — it will equivocate, fork the chain, or double-spend.

Your system processes financial transactions across trust boundaries. Cross-organizational payment systems, clearing houses, and settlement systems operate in an environment where any participant might try to cheat. BFT ensures that no coalition of up to f participants can forge or alter transactions.

Regulatory or safety requirements mandate it. Some aviation and medical systems require Byzantine fault tolerance because the consequences of a single corrupted node are catastrophic. Aircraft flight control systems, for example, use Byzantine agreement among redundant flight computers.

Your system is exposed to active adversaries. If an attacker can compromise some (but not all) of your nodes, BFT prevents the compromised nodes from corrupting the system’s decisions. This is relevant for military systems, some critical infrastructure, and high-value targets.

You Probably Do Not Need BFT If:

All your nodes run the same software on hardware you control. If you are running a 5-node etcd cluster in your own datacenter, the primary threats are hardware failure and software bugs. Hardware failures manifest as crash failures. Software bugs affect all nodes simultaneously (because they all run the same code), which BFT does not help with anyway — if all nodes have the same bug, 3f+1 nodes with f+1 buggy ones all exhibit the same wrong behavior.

Your performance requirements are tight. BFT’s overhead — extra nodes, extra messages, cryptographic operations — is real. If you need sub-millisecond consensus latency or tens of thousands of decisions per second, BFT protocols may not meet your requirements.

Your failure model is actually crash-recovery. Most datacenter failures are crashes, network partitions, and transient errors — all of which are handled by crash-fault-tolerant protocols without BFT’s overhead. Using BFT “just in case” is like wearing a radiation suit to the grocery store: technically safer, but the cost-benefit analysis does not work out.

Your system already has a trusted component. If all nodes share a trusted hardware module (TPM, SGX enclave) or a trusted external service, you can use that trust to simplify the protocol. For example, if nodes can attest their software version via a TPM, you can rule out software-based Byzantine behavior and use a simpler crash-fault protocol.

The Gray Area: Non-Malicious Byzantine Faults

There is an awkward middle ground: what about Byzantine faults that are not caused by adversaries?

  • Bit flips from cosmic rays (real, but extremely rare: ~1 bit flip per GB of RAM per year in space, less at sea level)
  • Firmware bugs in disk controllers that silently corrupt data (real, and documented by studies from CERN and others)
  • Network equipment that corrupts packets in ways not caught by TCP checksums (rare but documented)
  • Software bugs that cause different behavior on different nodes (common during rolling upgrades)

These are technically Byzantine failures — the affected node sends incorrect data — but they are not adversarial. A compromised flight computer does not know it is sending wrong data.

For most systems, the practical defense against non-malicious Byzantine faults is not BFT. It is:

  • ECC memory (corrects single-bit errors)
  • Checksums at the application level (detects data corruption)
  • Crash on inconsistency (turn a Byzantine fault into a crash fault)
  • Testing and code review (prevent software bugs)

These defenses are cheaper and more effective than BFT for the non-adversarial case. Use BFT when the threat is adversarial, not when it is accidental.

PBFT: The Practical Byzantine Protocol

The OM(m) algorithm is exponential and impractical. The protocol that made BFT feasible for real systems is PBFT (Practical Byzantine Fault Tolerance) by Castro and Liskov, published in 1999. We will cover PBFT in detail later in the book, but a brief sketch is relevant here.

PBFT achieves O(N^2) message complexity (compared to OM’s exponential) through a three-phase protocol:

// PBFT normal case (simplified)
// N = 3f + 1 nodes, one is the primary (leader)

procedure PBFT_NORMAL_CASE(client_request):
    // Phase 0: Client sends request to primary
    primary receives request R

    // Phase 1: PRE-PREPARE
    primary assigns sequence number n to R
    primary broadcasts <PRE-PREPARE, view, n, R> to all replicas
    // Signed by primary

    // Phase 2: PREPARE
    each replica i (that accepts the pre-prepare):
        broadcasts <PREPARE, view, n, digest(R), i> to all replicas
        // Signed by i

    // A replica has "prepared" when it has:
    //   - The pre-prepare
    //   - 2f matching PREPARE messages from different replicas
    // (2f prepares + 1 pre-prepare = 2f+1 out of 3f+1 nodes agree)

    // Phase 3: COMMIT
    each replica i (that has prepared):
        broadcasts <COMMIT, view, n, i> to all replicas
        // Signed by i

    // A replica has "committed" when it has:
    //   - 2f+1 matching COMMIT messages (including its own)
    // The request is now committed and can be executed

    // Phase 4: REPLY
    each replica sends result to client
    client accepts result when it has f+1 matching replies

The three phases ensure that:

  1. PRE-PREPARE: The primary assigns an order to requests.
  2. PREPARE: Replicas agree on the order (prevents a Byzantine primary from assigning different orders to different replicas).
  3. COMMIT: Replicas confirm that enough others have prepared (ensures the decision survives view changes).

The O(N^2) message complexity comes from the all-to-all broadcasts in PREPARE and COMMIT phases. Each of the N nodes sends a message to each of the N-1 others, twice (for PREPARE and COMMIT). Total: ~2N^2 messages per consensus decision.

For N=4 (f=1), that is 32 messages per decision. For N=7 (f=2), it is 98. For N=100 (f=33), it is 20,000. The N^2 scaling is why BFT systems typically run with small replica counts (4-7 nodes), not the hundreds or thousands common in crash-fault systems.

Modern BFT protocols (HotStuff, 2019; Narwhal/Bullshark, 2022) reduce this to O(N) messages using threshold signatures and leader-based communication patterns, at the cost of additional cryptographic assumptions.

The Real Cost of Byzantine Tolerance

Let us be concrete about what BFT costs in practice.

Throughput

A well-optimized Raft implementation (crash-fault tolerant) can achieve:

  • 50,000-100,000 operations per second
  • 1-5ms latency per operation
  • On a 5-node cluster with commodity hardware

A well-optimized PBFT implementation (Byzantine-fault tolerant) achieves:

  • 5,000-20,000 operations per second
  • 5-20ms latency per operation
  • On a 4-node cluster with commodity hardware

That is a 5-10x throughput reduction and a 4-5x latency increase. Modern BFT protocols (HotStuff, Tendermint) narrow the gap, but there is still a meaningful overhead.

Operational Complexity

BFT protocols require:

  • Key management for all nodes (private keys, certificates)
  • Secure channels between all pairs of nodes
  • View change protocols that are significantly more complex than Raft’s leader election
  • Client-side logic to collect and verify f+1 matching responses

Each of these is a source of operational overhead and potential bugs.

The Irony of Byzantine Tolerance

Here is the uncomfortable truth that BFT researchers acknowledge in private but rarely write in papers: the biggest threat to a BFT system is usually a bug in the BFT implementation itself.

A PBFT implementation is roughly 5-15x more code than a Raft implementation. More code means more bugs. A bug in the BFT protocol implementation can cause all nodes to fail in the same way (a correlated failure), which no amount of replication can tolerate. The protocol tolerates f arbitrary failures out of 3f+1 nodes, but a universal implementation bug is an f=N failure.

This is not hypothetical. Multiple blockchain systems have had consensus bugs that affected all validators simultaneously. The BFT protocol was working perfectly — it was the code implementing it that was wrong.

This does not mean BFT is useless. It means that BFT is a defense against heterogeneous failures — failures that affect some nodes differently than others. For homogeneous failures (same bug on all nodes), you need diverse implementations, N-version programming, or other orthogonal defenses.

Historical Context and Legacy

The 1982 paper by Lamport, Shostak, and Pease is one of the most cited papers in computer science. Its contributions are:

  1. Formalizing the problem. Before this paper, “fault tolerance” was vague. After it, we had precise definitions and provable bounds.

  2. The 3f+1 lower bound. The impossibility proof for N <= 3f (without signatures) is clean and surprising. Many people’s intuition says 2f+1 should suffice (it does for crash failures). The gap between 2f+1 and 3f+1 is the price of distrust.

  3. The signed messages result. Showing that digital signatures reduce the bound to f+2 anticipated the importance of cryptography in distributed systems by decades.

  4. The recursive algorithm. OM(m) is impractical but proves that the problem is solvable. This existence proof motivated decades of work on practical BFT protocols.

The paper also has its share of hand-waving, to be fair. The timing model is synchronous (messages are delivered within a known bound), which is unrealistic for most systems. The extension to asynchronous or partially synchronous models required another 17 years of research (culminating in PBFT in 1999). The message complexity bounds are exponential, which the paper acknowledges but does not resolve. And the treatment of digital signatures assumes a perfect signature scheme, ignoring key management, revocation, and the possibility of key compromise.

These are not criticisms — the paper solved the fundamental problem and left the engineering to others. But they are worth noting because the gap between the paper’s model and reality is where production bugs live.

Summary

The Byzantine Generals Problem establishes the fundamental costs of tolerating arbitrary failures:

  • Without signatures: You need 3f+1 nodes to tolerate f Byzantine faults. No fewer will suffice.
  • With signatures: You need f+2 nodes, but practical protocols still use 3f+1 for efficiency.
  • The cost is real: More nodes, more messages, more latency, more code, more operational complexity.
  • Most systems do not need it: If your nodes are trusted and your failure model is crash-recovery, crash-fault tolerance (2f+1 nodes) is sufficient and significantly cheaper.

The decision to use or not use BFT is an engineering tradeoff, not a religious one. Understand what Byzantine failures look like in your environment, assess the probability and cost of those failures, and compare that against the overhead of BFT. For most internal systems, the answer is “crash-fault tolerance is sufficient.” For cross-organizational systems, multi-tenant systems with untrusted participants, and high-value targets, BFT earns its keep.

In Part II of this book, we will study the protocols that solve consensus under both crash and Byzantine failure models. You will see that the theoretical constraints from this chapter — 2f+1 for crash, 3f+1 for Byzantine, the FLP impossibility, the role of timeouts and randomization — manifest directly in every design choice those protocols make. The theory is not separate from the practice; it is the practice, expressed in mathematical language.

Paxos: Lamport’s Beautiful Nightmare

Let us begin with the protocol that launched a thousand implementations, none of which agree with each other. Paxos is the foundational consensus algorithm in distributed systems, the way quicksort is foundational in algorithms courses — except that nobody has ever been confused about how quicksort works after reading the original paper.

Leslie Lamport first described Paxos in 1989 in “The Part-Time Parliament,” a paper written as an archaeological report about a fictional Greek island where legislators needed to agree on laws despite irregular attendance. The paper was rejected. It was submitted again. It was rejected again. It was finally published in 1998, nearly a decade later. Lamport has publicly stated that the reviewers simply didn’t get the joke. The reviewers have, diplomatically, suggested that perhaps a foundational result in distributed computing deserved a presentation style that didn’t require readers to first decode an extended metaphor about olive oil merchants.

In 2001, Lamport published “Paxos Made Simple,” a paper whose opening line — “The Paxos algorithm, when presented in plain English, is very simple” — remains the most audacious claim in the history of computer science.

Let us see if he was right.

The Problem: Single-Decree Consensus

Before we get to the protocol, let us be precise about what we are trying to solve. Single-decree Paxos solves the following problem:

  • A collection of processes may propose values.
  • A single value must be chosen.
  • Once a value is chosen, processes should be able to learn the chosen value.

And it must do so despite:

  • Processes may crash and restart.
  • Messages may be delayed, duplicated, reordered, or lost.
  • Messages are not corrupted (we assume non-Byzantine faults).

This is the consensus problem from Chapter 3, instantiated for crash-fault environments with asynchronous networks. FLP tells us we cannot solve this deterministically while guaranteeing liveness, so Paxos makes the pragmatic choice: safety is always guaranteed, liveness is guaranteed only when the system is “sufficiently synchronous” (we’ll define what that means later, or rather, we’ll wave our hands about it the same way the paper does).

The Three Roles

Paxos defines three roles:

Proposers — These are the processes that propose values. They initiate the protocol by suggesting a value they’d like the system to agree on. In practice, proposers are typically servers that have received a client request.

Acceptors — These are the processes that vote on proposals. They form the “memory” of the system. A value is chosen when a majority (quorum) of acceptors have accepted it. You need at least 2f+1 acceptors to tolerate f failures.

Learners — These are the processes that need to find out what value was chosen. In many implementations, learners are the same processes as acceptors, but conceptually they are distinct.

A single physical process may play all three roles simultaneously — and in most real deployments, it does. But separating the roles is essential for understanding the protocol. This is one of those things that seems like pedantic academic nonsense until you try to implement it, at which point you realize the role separation is doing real work in the correctness argument.

Proposal Numbers: The Unsung Hero

Before we dive into the phases, we need to talk about proposal numbers (also called ballot numbers). Every proposal in Paxos carries a unique number, and these numbers must be:

  1. Totally ordered — Any two proposal numbers can be compared.
  2. Unique — No two proposers ever use the same proposal number.
  3. Increasing — A proposer always uses a higher number than any it has seen before.

The standard trick is to use a pair (sequence_number, proposer_id) where the proposer_id breaks ties. So proposer 1 might use numbers 1.1, 2.1, 3.1 and proposer 2 might use 1.2, 2.2, 3.2, ordered lexicographically by the first component, then the second.

This seems trivial. It is not. In a real implementation, you need to ensure that proposal numbers survive crashes (otherwise a restarting proposer might reuse old numbers), that they increase monotonically (otherwise the proposer gets permanently ignored), and that the gap between a proposer’s numbers isn’t so large that it burns through the number space (which matters in long-running systems, though you’d need to be impressively careless).

function generate_proposal_number(proposer_id, last_seen_number):
    // Ensure we always go higher than anything we've seen
    new_sequence = last_seen_number.sequence + 1
    proposal = (new_sequence, proposer_id)

    // CRITICAL: persist this before using it
    persist_to_disk(proposal)

    return proposal

Phase 1: Prepare and Promise

Phase 1 is where a proposer stakes its claim. Think of it as the proposer walking into a room and asking, “Is anyone already committed to something? And will you promise to listen to me?”

The Proposer’s Side (Prepare)

The proposer selects a proposal number n that is higher than any it has previously used, and sends a Prepare(n) message to a majority of acceptors.

Note: the proposer does NOT include a value in the Prepare message. This is crucial. The proposer is not yet proposing a value — it is asking for permission to propose and learning what, if anything, the acceptors have already accepted.

function proposer_start(value):
    n = generate_proposal_number(my_id, highest_seen)
    highest_seen = n

    // Send Prepare to a majority of acceptors
    for acceptor in choose_majority(all_acceptors):
        send(acceptor, Prepare(n))

    // Wait for responses (with timeout)
    promises = wait_for_majority_responses(timeout)

    if |promises| < majority_size:
        // Failed to get a majority — retry with higher number
        return proposer_start(value)

    // Phase 1 complete — move to Phase 2
    return proposer_phase2(n, value, promises)

The Acceptor’s Side (Promise)

When an acceptor receives a Prepare(n) message, it does one of two things:

If n is greater than any proposal number it has already responded to: The acceptor promises not to accept any proposal numbered less than n, and responds with Promise(n, accepted_proposal, accepted_value) — where accepted_proposal and accepted_value are the highest-numbered proposal it has already accepted (if any).

If n is less than or equal to a proposal number it has already responded to: The acceptor ignores the request (or sends a Nack to be polite — the paper doesn’t require this, but implementations always do it because otherwise the proposer just sits there waiting for a response that will never come).

// Acceptor state (must be persistent across crashes!)
state:
    highest_promised = 0       // highest proposal number we've promised
    accepted_proposal = null   // highest proposal number we've accepted
    accepted_value = null      // the value we accepted

function acceptor_on_prepare(n, from_proposer):
    if n > highest_promised:
        highest_promised = n

        // CRITICAL: persist state before responding
        persist_to_disk(highest_promised, accepted_proposal, accepted_value)

        send(from_proposer, Promise(n, accepted_proposal, accepted_value))
    else:
        // Optional but strongly recommended
        send(from_proposer, Nack(n, highest_promised))

That persist_to_disk call is doing enormous work. If the acceptor crashes after sending a Promise but before persisting, and then restarts and accepts a lower-numbered proposal, the safety of the entire protocol is violated. This is the kind of thing that “Paxos Made Simple” mentions in one sentence and that takes a month to get right in production.

Phase 2: Accept and Accepted

Phase 2 is where the actual value gets proposed and (hopefully) chosen.

The Proposer’s Side (Accept)

The proposer examines the promises it received. If any acceptor reported having already accepted a value, the proposer must propose that value — specifically, the value from the highest-numbered accepted proposal among all promises. If no acceptor has accepted anything, the proposer is free to propose its own value.

This is the key insight of Paxos, the thing that makes the whole protocol work: a proposer does not get to freely choose its value. It must defer to any value that might already be chosen. This constraint is what preserves safety.

function proposer_phase2(n, my_value, promises):
    // Find the highest-numbered already-accepted value
    highest_accepted = null
    for promise in promises:
        if promise.accepted_proposal != null:
            if highest_accepted == null or
               promise.accepted_proposal > highest_accepted.proposal:
                highest_accepted = {
                    proposal: promise.accepted_proposal,
                    value: promise.accepted_value
                }

    // Choose value: defer to any already-accepted value
    if highest_accepted != null:
        value = highest_accepted.value
        // Our original value is lost. This is fine. This is consensus.
    else:
        value = my_value

    // Send Accept request to a majority of acceptors
    for acceptor in choose_majority(all_acceptors):
        send(acceptor, Accept(n, value))

    // Wait for Accepted responses
    accepted_responses = wait_for_majority_responses(timeout)

    if |accepted_responses| >= majority_size:
        // Value is chosen!
        notify_learners(value)
        return value
    else:
        // Failed — some acceptor must have promised a higher number
        return proposer_start(my_value)  // Retry from scratch

The Acceptor’s Side (Accepted)

When an acceptor receives an Accept(n, value) message:

If n >= highest_promised: The acceptor accepts the proposal — it updates its state to record (n, value) as accepted, and responds with Accepted(n, value).

If n < highest_promised: The acceptor has promised not to accept proposals below highest_promised, so it ignores the request (or sends a Nack).

function acceptor_on_accept(n, value, from_proposer):
    if n >= highest_promised:
        highest_promised = n
        accepted_proposal = n
        accepted_value = value

        // CRITICAL: persist before responding
        persist_to_disk(highest_promised, accepted_proposal, accepted_value)

        send(from_proposer, Accepted(n, value))

        // Also notify learners (discussed below)
        for learner in all_learners:
            send(learner, Accepted(n, value))
    else:
        send(from_proposer, Nack(n, highest_promised))

Learning a Chosen Value

A value is chosen when a majority of acceptors have accepted the same proposal number. Learners need to find out about this. There are a few strategies:

  1. Each acceptor sends Accepted to all learners. This is O(acceptors * learners) messages but is the fastest — learners find out as soon as possible.

  2. Each acceptor sends Accepted to a distinguished learner, who tells the others. This reduces messages but adds latency and a single point of failure.

  3. Each acceptor sends Accepted to a set of distinguished learners. A compromise.

In practice, the proposer typically acts as the distinguished learner: it counts Accepted responses, and once it has a majority, it knows the value is chosen and can notify clients.

function learner_on_accepted(n, value, from_acceptor):
    // Track which acceptors have accepted which proposals
    if n not in acceptance_counts:
        acceptance_counts[n] = {value: value, acceptors: {}}

    acceptance_counts[n].acceptors.add(from_acceptor)

    if |acceptance_counts[n].acceptors| >= majority_size:
        // Value is chosen!
        chosen_value = acceptance_counts[n].value
        deliver_to_application(chosen_value)

The Happy Path: A Walk-Through

Let’s trace through a successful run with three acceptors (A1, A2, A3) and one proposer (P1) proposing value “X”.

Proposer P1                 Acceptors A1, A2, A3
    |                            |    |    |
    |--- Prepare(1) ----------->|    |    |
    |--- Prepare(1) ---------------->|    |
    |--- Prepare(1) --------------------->|
    |                            |    |    |
    |<-- Promise(1, null, null) -|    |    |
    |<-- Promise(1, null, null) ------|    |
    |<-- Promise(1, null, null) -----------|
    |                            |    |    |
    |  (No prior accepted values, so P1 proposes "X")
    |                            |    |    |
    |--- Accept(1, "X") ------->|    |    |
    |--- Accept(1, "X") ------------>|    |
    |--- Accept(1, "X") ---------------->|
    |                            |    |    |
    |<-- Accepted(1, "X") ------|    |    |
    |<-- Accepted(1, "X") -----------|    |
    |<-- Accepted(1, "X") ----------------|
    |                            |    |    |
    |  (Majority accepted: "X" is chosen)

Two round trips. Four message types. Beautifully simple when nothing goes wrong.

Things always go wrong.

Failure Case 1: Competing Proposers

This is where Paxos earns its reputation. Two proposers, P1 and P2, both trying to get their value chosen.

P1 (value "X")   P2 (value "Y")    A1        A2        A3
    |                  |             |         |         |
    |--- Prepare(1) ------------------>       |         |
    |--- Prepare(1) ----------------------------->      |
    |                  |             |         |         |
    |                  |--- Prepare(2) -->     |         |
    |                  |--- Prepare(2) ------------>     |
    |                  |             |         |         |
    |<-- Promise(1, null) -------------|       |         |
    |<-- Promise(1, null) ----------------------|        |
    |                  |             |         |         |
    |                  |<-- Promise(2, null) ---|        |
    |                  |<-- Promise(2, null) ------------|
    |                  |             |         |         |
    |  (P1 got majority promises)   |         |         |
    |                  (P2 got majority promises)       |
    |                  |             |         |         |
    |--- Accept(1, "X") ----------->|         |         |
    |--- Accept(1, "X") -------------------->|          |
    |                  |             |         |         |
    |  A2 promised proposal 2 to P2, so it REJECTS Accept(1, "X")
    |                  |             |         |         |
    |<-- Accepted(1, "X") ----------|         |         |
    |<-- Nack(1, 2) ------------------------------|     |
    |                  |             |         |         |
    |  (P1 only got 1 Accepted, not a majority — fails) |

P1’s Accept was rejected by A2 because A2 had already promised proposal number 2 to P2. P1 did not get a majority of Accepted responses, so it must retry with a higher proposal number. Meanwhile, P2 can proceed with its Phase 2.

But here’s the nasty part: when P1 retries with proposal number 3, it might preempt P2’s Accept phase. Then P2 retries with 4, preempting P1. This is the dueling proposers problem, also known as livelock.

Failure Case 2: Dueling Proposers (Livelock)

P1 (value "X")   P2 (value "Y")    A1        A2        A3
    |                  |             |         |         |
    | Prepare(1) succeeds            |         |         |
    |                  | Prepare(2) succeeds              |
    | Accept(1) rejected             |         |         |
    |                  | Accept(2) -- about to send...    |
    | Prepare(3) succeeds            |         |         |
    |                  | Accept(2) rejected     |         |
    |                  | Prepare(4) succeeds              |
    | Accept(3) rejected             |         |         |
    |                  | Accept(4) -- about to send...    |
    | Prepare(5) succeeds            |         |         |
    ...forever...

This is not a theoretical concern. This happens in real systems, especially under high load when multiple clients are submitting requests simultaneously.

The standard mitigations:

  1. Randomized backoff. When a proposer is preempted, it waits a random amount of time before retrying. This is the “just add jitter” school of distributed systems design, and it works surprisingly well in practice.

  2. Leader election. Elect a single distinguished proposer. If only one proposer is active, there’s no contention. This is what Multi-Paxos does (Chapter 6), and it’s what every production system does.

  3. Exponential backoff. Like randomized backoff but more aggressive — double the wait time on each failure. Most implementations combine random jitter with exponential backoff.

function proposer_start_with_backoff(value, attempt = 0):
    n = generate_proposal_number(my_id, highest_seen)

    // ... run Phase 1 and Phase 2 ...

    if failed:
        // Back off before retrying
        max_delay = min(BASE_DELAY * 2^attempt, MAX_DELAY)
        delay = random(0, max_delay)
        sleep(delay)
        return proposer_start_with_backoff(value, attempt + 1)

Failure Case 3: Acceptor Crash

Suppose acceptor A3 crashes permanently. With three acceptors (2f+1 = 3, so f = 1), we can tolerate one failure. The protocol proceeds normally with A1 and A2 forming the majority.

But what if A3 crashes after sending a Promise to P1 but before P1’s Accept reaches it? No problem — P1 only needs a majority of Accepted responses. If A1 and A2 both accept, the value is chosen.

What if A3 crashes after accepting a value but before anyone else has? This is more interesting. The accepted value is “stuck” on A3’s disk. If A3 comes back, that value will be reported in future Promise messages, and the protocol ensures it can still be chosen. If A3 never comes back, and no other acceptor accepted that value, then it was never chosen (because a majority never accepted it), and the system can safely choose a different value.

This is a subtle point that trips up many implementers: an individual acceptor accepting a value does not mean the value is chosen. The value is chosen only when a majority of acceptors accept the same proposal number.

Failure Case 4: Message Loss

Paxos handles message loss by… not handling it. The protocol is correct regardless of which messages are lost. If a Prepare is lost, the proposer times out and retries. If a Promise is lost, the proposer might not get a majority and retries. If an Accept is lost, same thing. If an Accepted is lost, the learner might not find out the value is chosen, but the value IS still chosen, and a future round will discover it.

This is the beauty of Paxos: the safety argument does not depend on message delivery guarantees at all. Liveness requires that messages eventually get through (the “sufficient synchrony” assumption), but safety holds even in a fully asynchronous network with arbitrary message loss.

Why Paxos Is Correct: Intuition

The full proof is in Lamport’s papers, and I won’t reproduce it here. But the intuition is worth understanding.

The key invariant is: if a value v has been chosen (accepted by a majority) with proposal number n, then every higher-numbered proposal that is accepted by any acceptor also has value v.

Why does this hold? Because of the Phase 1 constraint:

  1. If value v was chosen with proposal n, then a majority of acceptors accepted (n, v).
  2. Any future proposer with proposal number m > n must do Phase 1 and get promises from a majority.
  3. Any two majorities overlap in at least one acceptor (this is the pigeonhole principle, doing heroic work).
  4. That overlapping acceptor has accepted (n, v) and will report it in its Promise.
  5. The Phase 2 rule forces the new proposer to propose v (since it must use the value from the highest-numbered accepted proposal among the promises it received).

The overlapping quorum is doing all the heavy lifting. This is why you need 2f+1 acceptors — not for availability (though that helps), but for the intersection property that makes the safety proof work.

Majority 1 (chose value v):    { A1, A2, A3 }     (3 out of 5)
Majority 2 (new proposer):     { A3, A4, A5 }     (3 out of 5)
                                  ^^
                          Overlap: A3 reports v

There is one subtlety here that the “simple” explanations gloss over: proposal numbers between n and m might have started Phase 1 but never completed Phase 2. The proof handles this by induction on proposal numbers, showing the invariant holds for each number in sequence. It’s not complicated, but it is fiddly, and getting the induction right is where most people’s eyes glaze over.

What the Paper Doesn’t Tell You

Now for the part that matters if you’re trying to build something. Lamport’s papers describe the protocol. They do not describe a system. The gap between the two is vast and full of tears.

1. State Machine Identity

Single-decree Paxos agrees on one value. You want to agree on a sequence of commands — a replicated log. The paper waves at this with “just run multiple instances.” Chapter 6 covers this in detail, but the short version is: the devil is in every single detail of how you index those instances, how you handle gaps, how you compact old entries, and how you bring up new replicas.

2. Disk Persistence

The protocol requires that acceptor state survives crashes. This means fsync(). Do you know how slow fsync() is? On a spinning disk, you’re looking at 5-10ms. On an SSD, maybe 0.1-0.5ms. Either way, you’re fsyncing on the critical path — before you can send a Promise or Accepted message. Batching helps. Group commit helps. But the paper doesn’t discuss any of this.

3. Network Timeouts

The paper describes no timeout mechanism. In practice, you need timeouts everywhere:

  • How long does a proposer wait for Promise responses?
  • How long does it wait for Accepted responses?
  • When does a learner give up and ask again?
  • How long before a client request is considered failed?

Getting these timeouts wrong doesn’t violate safety (Paxos is always safe!), but it destroys liveness. Set them too low and you get thrashing. Set them too high and the system is unresponsive to failures.

4. How Learners Actually Learn

The paper is remarkably vague about how learners discover chosen values. In practice, this requires either:

  • Having the proposer announce the chosen value (but what if the proposer crashes right after the value is chosen but before announcing it?)
  • Having each acceptor send Accepted messages to all learners (lots of messages)
  • Having learners periodically query acceptors (latency)

Most implementations punt on this by having the proposer tell the client directly and relying on the next Paxos instance to implicitly confirm the previous one.

5. Read Operations

Paxos agrees on a value. It says nothing about reading that value later. If you want linearizable reads (and you do, or why are you bothering with consensus?), you need either:

  • Run a full Paxos round for every read (expensive)
  • Use leases (the leader promises not to accept new proposals for some time period, so reads from the leader are safe within the lease period)
  • Read from any majority and take the highest-numbered accepted value (which might not be chosen yet — subtle!)

6. Client Interaction

What happens when a client submits a request? The paper doesn’t say. You need to figure out:

  • Which proposer handles the request?
  • What if the proposer crashes after the value is chosen but before responding to the client?
  • What if the client retries and the request gets executed twice?
  • How does the client find the current leader?

All of these require idempotency tokens, client tables, leader redirect mechanisms, and other infrastructure that the protocol description simply assumes exists.

The Full Acceptor: Putting It Together

Here is the complete acceptor pseudocode with all the details:

class PaxosAcceptor:
    // Persistent state (MUST survive crashes)
    persistent:
        highest_promised: ProposalNumber = 0
        accepted_proposal: ProposalNumber = null
        accepted_value: Value = null

    function on_prepare(n: ProposalNumber, from: ProposerId):
        if n > self.highest_promised:
            self.highest_promised = n
            self.persist()  // fsync!

            reply(from, Promise {
                proposal: n,
                accepted_proposal: self.accepted_proposal,
                accepted_value: self.accepted_value
            })
        else:
            reply(from, Nack {
                proposal: n,
                highest_promised: self.highest_promised
            })

    function on_accept(n: ProposalNumber, value: Value, from: ProposerId):
        if n >= self.highest_promised:
            self.highest_promised = n
            self.accepted_proposal = n
            self.accepted_value = value
            self.persist()  // fsync!

            reply(from, Accepted {
                proposal: n,
                value: value
            })

            // Notify learners
            for learner in self.known_learners:
                send(learner, Accepted {
                    proposal: n,
                    value: value
                })
        else:
            reply(from, Nack {
                proposal: n,
                highest_promised: self.highest_promised
            })

    function persist():
        write_to_disk(self.highest_promised,
                      self.accepted_proposal,
                      self.accepted_value)
        fsync()  // DO NOT SKIP THIS

    function recover():
        // Called on restart
        (self.highest_promised,
         self.accepted_proposal,
         self.accepted_value) = read_from_disk()
        // Resume normal operation — no special recovery needed!

The Full Proposer: Putting It Together

class PaxosProposer:
    state:
        my_id: ProposerId
        highest_seen: SequenceNumber = 0
        quorum_size: int  // majority of acceptors

    function propose(value: Value) -> Value:
        attempt = 0
        while true:
            result = try_propose(value, attempt)
            if result.success:
                return result.chosen_value
            attempt += 1

    function try_propose(value: Value, attempt: int) -> Result:
        // Generate proposal number
        n = (self.highest_seen + 1, self.my_id)
        self.highest_seen = n.sequence
        persist(self.highest_seen)  // Must survive crashes

        // Phase 1: Prepare
        promises = []
        nacks = []
        send_to_all_acceptors(Prepare(n))

        deadline = now() + phase1_timeout
        while |promises| < self.quorum_size and now() < deadline:
            msg = receive(deadline - now())
            if msg is Promise and msg.proposal == n:
                promises.append(msg)
            elif msg is Nack and msg.proposal == n:
                nacks.append(msg)
                // Update highest_seen from nack
                self.highest_seen = max(self.highest_seen,
                                        msg.highest_promised.sequence)

        if |promises| < self.quorum_size:
            backoff(attempt)
            return Result(success=false)

        // Phase 2: Determine value
        highest_accepted = null
        for promise in promises:
            if promise.accepted_proposal != null:
                if highest_accepted == null or
                   promise.accepted_proposal > highest_accepted.proposal:
                    highest_accepted = promise

        if highest_accepted != null:
            value = highest_accepted.accepted_value
        // else: use our original value

        // Phase 2: Accept
        accepted_count = 0
        send_to_all_acceptors(Accept(n, value))

        deadline = now() + phase2_timeout
        while accepted_count < self.quorum_size and now() < deadline:
            msg = receive(deadline - now())
            if msg is Accepted and msg.proposal == n:
                accepted_count += 1
            elif msg is Nack and msg.proposal == n:
                self.highest_seen = max(self.highest_seen,
                                        msg.highest_promised.sequence)
                backoff(attempt)
                return Result(success=false)

        if accepted_count >= self.quorum_size:
            return Result(success=true, chosen_value=value)
        else:
            backoff(attempt)
            return Result(success=false)

    function backoff(attempt: int):
        max_delay = min(BASE_DELAY_MS * 2^attempt, MAX_DELAY_MS)
        sleep(random(0, max_delay))

Lamport’s Writing and Its Effect on the Field

It would be incomplete to discuss Paxos without discussing how it was communicated to the world, because the communication strategy — or perhaps the miscommunication strategy — materially affected how the field developed.

“The Part-Time Parliament” is a work of dry humor wrapped around a profound result. The fictional conceit — legislators on the island of Paxos agreeing on decrees — maps onto the protocol with remarkable precision. The roles of proposers, acceptors, and learners correspond to Paxon politicians, and the protocol phases correspond to parliamentary procedures.

The problem is that most readers are not trying to appreciate literary craft when reading a distributed systems paper. They are trying to understand how to make their database not lose data. The fictional frame adds cognitive overhead that, for many readers, obscures rather than illuminates.

“Paxos Made Simple,” for all its claims of simplicity, is dense. It presents the protocol in about five pages, which is admirably concise but leaves no room for the examples and failure-case analysis that most engineers need to build intuition. The paper’s brevity is itself a kind of opacity.

The result was that for years, Paxos was viewed as a mysterious, almost mystical algorithm — understood by a priesthood of distributed systems researchers and feared by everyone else. This reputation was not entirely undeserved (the protocol IS subtle), but it was amplified enormously by the way it was presented. The success of Raft (Chapter 8), which solves essentially the same problem but was designed explicitly for understandability, is Exhibit A in the case that Paxos’s difficulty was at least partly a communication failure.

Common Misconceptions

“Paxos requires three rounds of messages.” No. The happy path is two rounds: Prepare/Promise and Accept/Accepted. The learner notification is a third round but is not part of the consensus protocol per se.

“Paxos requires all acceptors to respond.” No. It requires a majority (quorum). That’s the whole point — it tolerates f failures out of 2f+1 acceptors.

“A proposer can choose any value.” Only if no acceptor reports a previously accepted value. Otherwise the proposer is constrained.

“Paxos guarantees liveness.” No. FLP says it can’t, and the dueling proposers scenario is a concrete demonstration. It guarantees safety always and liveness under reasonable timing assumptions.

“Once an acceptor accepts a value, that value is chosen.” No. The value is chosen when a majority accepts the same proposal number. An individual accept means nothing by itself.

“Paxos is just two-phase commit.” Absolutely not. 2PC requires all participants to vote yes. Paxos requires only a majority. 2PC blocks if the coordinator crashes. Paxos makes progress as long as a majority is reachable. They solve fundamentally different problems, though the phase structure looks superficially similar.

Performance Characteristics

In the steady state (single proposer, no failures):

  • Message complexity: 4n messages for n acceptors (n Prepares + n Promises + n Accepts + n Accepteds). With a distinguished learner, it’s 4n. Without, add n*m where m is the number of learners.
  • Round trips: 2 (Prepare/Promise, then Accept/Accepted).
  • Latency: Dominated by the two sequential fsyncs on the acceptor side (one for Promise, one for Accepted) and network round trip time.
  • Throughput: One value chosen per two round trips. This is terrible if you’re choosing one value at a time. Multi-Paxos (Chapter 6) fixes this.

The Bridge to Multi-Paxos

Single-decree Paxos is a building block, not a system. To build a replicated state machine, you need to agree on a sequence of commands, not just one. The naive approach — run a separate instance of Paxos for each command — works but is absurdly expensive: two round trips per command, each requiring fsyncs.

Multi-Paxos optimizes this by electing a stable leader who can skip Phase 1 for most commands, reducing the steady-state cost to a single round trip. But as we’ll see in the next chapter, the distance between “just run multiple Paxos instances with a stable leader” and a working system is approximately the distance between Earth and Alpha Centauri.

Single-decree Paxos is beautiful. It is minimal, elegant, and provably correct. It is also, by itself, almost entirely useless for building real systems. That is the agony of Paxos: the core insight is perfect, and everything around it is pain.

Multi-Paxos and the Gap Between Paper and Production

Here is a dirty secret of distributed systems: there is no paper called “Multi-Paxos.” Lamport sketched the idea in a few paragraphs at the end of “The Part-Time Parliament” and elaborated slightly in “Paxos Made Simple.” The rest — the actual protocol that every production system implements — was reverse-engineered from those sketches, from folklore passed between implementers, from conference hallway conversations, and from the painful experience of people who tried to build real systems from a description that amounted to “just run Paxos repeatedly, with a stable leader, and optimize the obvious things.”

This is the gap between paper and production. It is not a gap. It is a chasm, and the bodies of failed implementations line the bottom.

Why Single-Decree Paxos Is Not Enough

Single-decree Paxos agrees on one value. A replicated state machine needs to agree on a sequence of commands — a log of operations that every replica executes in the same order. You need consensus not once but continuously, for every operation the system processes.

The naive approach is straightforward: for the i-th entry in the log, run an independent instance of single-decree Paxos. Instance 1 agrees on the first command. Instance 2 agrees on the second command. And so on.

This works. It is also catastrophically slow:

  • Each instance requires two round trips (Phase 1 and Phase 2).
  • Each round trip includes at least one fsync on each acceptor.
  • Instances must be sequential if commands can depend on each other.
  • Every instance might face contention from competing proposers.

At, say, 1ms per round trip and 0.2ms per fsync, you’re looking at roughly 2.4ms per command in the best case. That’s about 400 operations per second. Your average laptop’s SQLite database laughs at this throughput.

We need to do better.

The Multi-Paxos Optimization

The key insight is that Phase 1 is only necessary to establish the proposer’s right to propose. If a proposer has already completed Phase 1 for some proposal number n and nobody has tried to preempt it, then that proposer can skip Phase 1 for subsequent instances and go directly to Phase 2.

This is the stable leader optimization, and it transforms the performance of the protocol:

  • In steady state, the leader sends Accept messages and receives Accepted responses — one round trip per command.
  • The leader can pipeline — sending Accept for instance i+1 before receiving responses for instance i.
  • Phase 1 is only needed when the leader changes (which should be rare in a healthy system).

Here’s what steady-state Multi-Paxos looks like:

Leader                          Followers (Acceptors)
  |                               F1       F2       F3
  |                               |        |        |
  |--- Accept(n, slot=1, cmd_A)->|        |        |
  |--- Accept(n, slot=1, cmd_A)--------->|        |
  |--- Accept(n, slot=1, cmd_A)------------------>|
  |                               |        |        |
  |--- Accept(n, slot=2, cmd_B)->|        |        |
  |--- Accept(n, slot=2, cmd_B)--------->|        |
  |--- Accept(n, slot=2, cmd_B)------------------>|
  |                               |        |        |
  |<-- Accepted(n, slot=1) ------|        |        |
  |<-- Accepted(n, slot=1) --------------|        |
  |                               |        |        |
  |  (slot 1 committed — majority accepted)       |
  |                               |        |        |
  |<-- Accepted(n, slot=2) ------|        |        |
  |<-- Accepted(n, slot=2) --------------|        |
  |                               |        |        |
  |  (slot 2 committed — majority accepted)       |

One round trip. Pipelining. This is a protocol you can actually build a database on.

The Stable Leader

The term “leader” doesn’t appear in single-decree Paxos. In Multi-Paxos, it’s the entire game. The leader is a distinguished proposer that handles all client requests and drives the protocol. Everyone else is a follower.

But how do you elect the leader? Lamport’s papers are deliberately vague — the leader election mechanism is outside the consensus protocol itself. In practice, you need:

  1. Failure detection: Followers must detect when the leader has failed (usually via heartbeat timeouts).
  2. Leader election: When the leader fails, a new leader must be elected. This is itself a consensus problem, which creates a delightful circularity. In practice, most systems use the Paxos mechanism itself: a prospective leader runs Phase 1 with a higher proposal number, which implicitly establishes it as the new leader if it succeeds.
  3. Leader stability: You want the leader to be stable — frequent leader changes kill performance because each change requires Phase 1.
class MultiPaxosNode:
    state:
        role: Leader | Follower
        current_leader: NodeId = null
        leader_lease_expiry: Timestamp = 0
        heartbeat_interval: Duration = 100ms
        election_timeout: Duration = random(300ms, 500ms)
        last_heartbeat_received: Timestamp = 0

    function follower_loop():
        while self.role == Follower:
            if now() - self.last_heartbeat_received > self.election_timeout:
                // Leader seems dead — try to become leader
                self.start_election()
            sleep(self.heartbeat_interval)

    function start_election():
        new_proposal = generate_proposal_number(self.my_id, self.highest_seen)

        // Run Phase 1 for ALL active log slots
        promises = send_prepare_to_majority(new_proposal)

        if |promises| >= quorum_size:
            self.role = Leader
            self.current_leader = self.my_id
            self.proposal_number = new_proposal

            // Process any already-accepted values from promises
            self.reconcile_log(promises)

            // Start sending heartbeats
            self.leader_loop()

    function leader_loop():
        while self.role == Leader:
            // Send heartbeats to maintain leadership
            broadcast(Heartbeat(self.proposal_number))

            // Process client requests
            for request in pending_requests:
                self.replicate(request)

            sleep(self.heartbeat_interval)

Steady-State Leader Operation

Here is the pseudocode for the leader’s steady-state operation — accepting a client request and replicating it:

class MultiPaxosLeader:
    state:
        log: Map<SlotNumber, LogEntry>
        next_slot: SlotNumber = 1
        commit_index: SlotNumber = 0  // highest committed slot
        proposal_number: ProposalNumber
        match_index: Map<NodeId, SlotNumber>  // per-follower progress

    function handle_client_request(command: Command) -> Result:
        // Assign the command to the next log slot
        slot = self.next_slot
        self.next_slot += 1

        entry = LogEntry {
            slot: slot,
            proposal: self.proposal_number,
            command: command,
            accepted_by: {self.my_id}  // Leader accepts its own entries
        }
        self.log[slot] = entry
        persist(self.log[slot])

        // Send Accept to all followers
        for follower in self.followers:
            send(follower, Accept {
                proposal: self.proposal_number,
                slot: slot,
                command: command
            })

        // Wait for majority (could be async with pipelining)
        wait_until(|entry.accepted_by| >= self.quorum_size)

        // Committed!
        self.advance_commit_index()
        apply_to_state_machine(command)

        return Result(success=true, response=execute(command))

    function on_accepted(slot: SlotNumber, from: NodeId):
        if slot in self.log:
            self.log[slot].accepted_by.add(from)
            self.match_index[from] = max(self.match_index[from], slot)

            if |self.log[slot].accepted_by| >= self.quorum_size:
                self.advance_commit_index()

    function advance_commit_index():
        // Find the highest slot accepted by a majority
        while self.commit_index + 1 in self.log and
              |self.log[self.commit_index + 1].accepted_by| >= self.quorum_size:
            self.commit_index += 1
            apply_to_state_machine(self.log[self.commit_index].command)

        // Notify followers of new commit index
        // (often piggybacked on next Accept message)

The Log: Gaps, Holes, and Ordering

Here is where Multi-Paxos diverges significantly from Raft, and where many implementation bugs lurk.

In Multi-Paxos, the log can have gaps. If a leader assigns slot 5 but crashes before its Accept message reaches any follower, and a new leader takes over, the new leader might assign slot 5 to a different command — or might leave slot 5 empty and start from slot 6. This happens because Multi-Paxos runs independent Paxos instances per slot, and there’s no requirement that slots be filled contiguously.

This is theoretically fine but practically awful:

  • Your state machine expects commands in order. You can’t apply slot 6 until slot 5 is resolved.
  • A gap in the log means you need to run Paxos for that slot to determine if a value was previously chosen (maybe a no-op).
  • Clients might be waiting for a response to the command that was in the gap.
function new_leader_fill_gaps():
    // After winning election, fill any gaps in the log
    for slot in range(1, self.next_slot):
        if slot not in self.log or self.log[slot].status != COMMITTED:
            // Run Phase 2 for this slot
            // If promises reported an accepted value, use it
            // Otherwise, propose a no-op
            if slot in values_from_promises:
                value = values_from_promises[slot]
            else:
                value = NO_OP

            replicate_to_slot(slot, value)

Raft (Chapter 8) avoids this entirely by requiring the log to be contiguous — no holes allowed. This is one of Raft’s genuine simplifications, and it’s worth appreciating how much complexity it eliminates.

Reconfiguration: Changing the Membership

At some point, you need to add or remove servers from your cluster. A server’s disk fails, you want to replace it. You’re scaling from three nodes to five. You’re decommissioning a data center.

This is the membership change problem, and it is — to put it charitably — under-specified in the Paxos literature. Lamport’s papers mention that the state machine can include reconfiguration commands, but the details of how to do this safely are left as an exercise for the reader. An exercise that has kept the reader up at night for decades.

The fundamental danger is this: during a reconfiguration, the old configuration and the new configuration might independently form majorities that don’t overlap. If the old configuration is {A, B, C} and the new configuration is {B, C, D}, the old majority {A, B} and the new majority {C, D} don’t overlap. Two different values could be chosen for the same slot. Consensus is violated. Your database is corrupt. Your pager is going off.

Approach 1: Alpha-Based Reconfiguration

Lamport’s approach is to have the state machine command at slot i determine the configuration used for slot i + alpha, where alpha is a parameter large enough to ensure that the reconfiguration command has been committed and applied before the new configuration takes effect.

// Configuration is determined by state machine commands
function get_configuration_for_slot(slot):
    // Look at all committed reconfiguration commands
    // with slot numbers <= (slot - ALPHA)
    config = initial_configuration
    for s in range(1, slot - ALPHA + 1):
        if log[s].command is ReconfigCommand:
            config = log[s].command.new_config
    return config

This works but has a serious limitation: you can only process one reconfiguration command every alpha slots. If alpha is 100 (a common choice), and each slot takes 1ms, that’s one reconfiguration per 100ms. This is usually fine, but it means you can’t rapidly reconfigure in an emergency.

Approach 2: Joint Consensus

Raft’s joint consensus approach (described in detail in Chapter 8) can be adapted for Multi-Paxos. The idea is to transition through an intermediate configuration that requires majorities from BOTH the old and new configurations.

Approach 3: The “Just Use a Configuration Log” Approach

Many practical systems (including Chubby) use a separate Paxos instance to agree on configuration changes. The configuration log is itself replicated using Paxos with a fixed (or very rarely changing) set of members. This is the “turtles all the way down” approach — your configuration is managed by consensus, which requires a configuration, which is managed by consensus…

In practice, the base configuration is typically hardcoded or stored in a file that is updated manually. We pretend this is fine.

Log Compaction and Snapshotting

An append-only log grows without bound. In a long-running system, you’ll eventually run out of disk space, and replaying the entire log on startup takes longer and longer. You need to compact the log.

The standard approach is snapshotting: periodically take a snapshot of the state machine’s state, then discard all log entries up to the snapshot point.

class SnapshotManager:
    state:
        last_snapshot_slot: SlotNumber = 0
        snapshot_interval: int = 10000  // slots between snapshots

    function maybe_snapshot():
        if self.commit_index - self.last_snapshot_slot >= self.snapshot_interval:
            self.take_snapshot()

    function take_snapshot():
        // Must be atomic with respect to the state machine
        snapshot = serialize(self.state_machine.state)
        snapshot_metadata = {
            last_included_slot: self.commit_index,
            last_included_proposal: self.log[self.commit_index].proposal,
            configuration: self.current_config
        }

        write_snapshot_to_disk(snapshot, snapshot_metadata)
        fsync()

        // Now we can discard old log entries
        for slot in range(1, self.commit_index):
            delete self.log[slot]

        self.last_snapshot_slot = self.commit_index

    function install_snapshot(snapshot, metadata):
        // Called when a follower is too far behind and the leader
        // sends its snapshot instead of individual log entries
        write_snapshot_to_disk(snapshot, metadata)
        self.state_machine.restore(deserialize(snapshot))
        self.commit_index = metadata.last_included_slot
        self.last_snapshot_slot = metadata.last_included_slot

        // Discard any log entries covered by the snapshot
        for slot in range(1, metadata.last_included_slot + 1):
            delete self.log[slot]

Snapshotting interacts with the rest of the system in several unpleasant ways:

Follower catchup. If a follower is too far behind (its next needed log entry has been discarded), the leader must send its snapshot. This is a large transfer that can take seconds or minutes, during which the follower is unavailable.

Snapshot consistency. The snapshot must represent a consistent state machine state at a specific log position. If the state machine is large (gigabytes of data), taking a consistent snapshot while continuing to process requests requires copy-on-write semantics or a quiescent period.

Snapshot transfer. Sending a multi-gigabyte snapshot over the network is not a trivial operation. You need flow control, checksumming, resumability (what if the transfer is interrupted?), and you need to do it without starving normal replication traffic.

What Google Actually Built

The most instructive documentation of the gap between Paxos-the-protocol and Paxos-the-system comes from Google, which built several systems on Multi-Paxos and had the good grace to publish papers about the experience.

Chubby (2006)

Chubby is Google’s distributed lock service, built on Multi-Paxos. The Chubby paper by Mike Burrows is remarkably candid about the implementation challenges:

  • They found it extremely difficult to implement Multi-Paxos from the paper alone.
  • They had to invent solutions for master leases, client session management, and event notifications — none of which are part of Paxos.
  • The system’s complexity was dominated by these “engineering” concerns, not by the consensus protocol.

Paxos Made Live (2007)

This paper by Chandra, Griesemer, and Redstone is perhaps the most valuable document ever written about implementing Paxos. It catalogs the challenges they encountered building Chubby’s Paxos implementation:

  1. Disk corruption. The paper assumes reliable persistent storage. Disks lie. Google had to add checksums to every on-disk data structure and handle corruption gracefully.

  2. Master leases. To serve reads without running Paxos, the master needs a lease — a time-bounded promise that no other node will become master. Implementing leases correctly requires carefully reasoned clock assumptions.

  3. Group membership. Changing the set of participants is, in their words, “one of the most complex and least well-understood aspects of Paxos.”

  4. Snapshots. They describe the snapshot mechanism as “the single hardest thing to get right” in their implementation.

  5. Testing. They built an elaborate testing framework including a simulated network, simulated disk, and simulated clock to test failure scenarios. Even with this, bugs slipped through.

The paper’s conclusion is worth quoting approximately: the algorithm’s correctness proof gave them confidence in the design, but the implementation challenges were enormous and mostly orthogonal to the consensus protocol itself.

Spanner (2012)

Google’s Spanner is a globally distributed database that uses Paxos for replication within each shard. Spanner’s contribution to the Paxos implementation landscape includes:

  • Long-lived leaders with 10-second lease periods, enabling efficient reads.
  • Pipelining of writes to achieve high throughput despite global latencies.
  • TrueTime — GPS and atomic clock-based time synchronization that enables externally consistent reads without running Paxos. This is not a Paxos optimization; it’s a way to avoid Paxos for reads entirely.

Why Every Implementation Is Different

Here is a non-exhaustive list of decisions you must make when implementing Multi-Paxos, none of which are specified by the protocol:

DecisionOptionsTradeoffs
Leader election triggerHeartbeat timeout / external failure detector / client-triggeredLatency vs. complexity vs. false positives
Heartbeat mechanismSeparate messages / piggybacked on AcceptMessage overhead vs. implementation simplicity
Log indexingContiguous / sparse with gapsComplexity vs. flexibility
Follower catchupSend individual entries / send snapshot / hybridNetwork vs. latency vs. complexity
Read handlingRead through Paxos / leader leases / follower readsConsistency vs. latency vs. throughput
BatchingPer-entry / batched / adaptiveLatency vs. throughput
Pipelining depthFixed / adaptive / unlimitedThroughput vs. memory vs. ordering complexity
Disk persistenceWrite-ahead log / separate state files / embedded DBPerformance vs. simplicity vs. recovery time
Snapshot formatFull state / incremental / log-structuredSpace vs. time vs. complexity
Membership changesAlpha-based / joint consensus / separate config logSafety vs. availability vs. complexity
Client interactionLeader-only / redirect / proxyLatency vs. load balancing vs. complexity

That’s eleven fundamental design decisions, most with three or more reasonable options. The combinatorial space is enormous, and the interactions between these choices are subtle. This is why every Multi-Paxos implementation is different, and why the phrase “we use Paxos” tells you almost nothing about what the system actually does.

Steady-State Follower

For completeness, here is the follower’s steady-state logic:

class MultiPaxosFollower:
    state:
        log: Map<SlotNumber, LogEntry>
        commit_index: SlotNumber = 0
        highest_promised: ProposalNumber = 0
        current_leader: NodeId = null
        last_heartbeat: Timestamp = 0

    function on_accept(msg: AcceptMessage):
        if msg.proposal < self.highest_promised:
            // Stale leader — reject
            reply(msg.from, Nack {
                slot: msg.slot,
                highest_promised: self.highest_promised
            })
            return

        // Accept the entry
        self.highest_promised = msg.proposal
        self.log[msg.slot] = LogEntry {
            slot: msg.slot,
            proposal: msg.proposal,
            command: msg.command,
            status: ACCEPTED
        }
        persist(self.log[msg.slot], self.highest_promised)

        reply(msg.from, Accepted {
            slot: msg.slot,
            proposal: msg.proposal
        })

        // Update commit index if leader told us
        if msg.leader_commit > self.commit_index:
            old_commit = self.commit_index
            self.commit_index = min(msg.leader_commit,
                                     self.highest_contiguous_slot())
            for slot in range(old_commit + 1, self.commit_index + 1):
                apply_to_state_machine(self.log[slot].command)

    function on_heartbeat(msg: HeartbeatMessage):
        if msg.proposal >= self.highest_promised:
            self.current_leader = msg.from
            self.last_heartbeat = now()
            self.highest_promised = msg.proposal

            // Heartbeats often carry commit index updates
            if msg.commit_index > self.commit_index:
                self.advance_commit_index(msg.commit_index)

    function on_prepare(msg: PrepareMessage):
        // New leader election in progress
        if msg.proposal > self.highest_promised:
            self.highest_promised = msg.proposal
            persist(self.highest_promised)

            // Send all our accepted entries
            accepted_entries = {}
            for slot, entry in self.log:
                if entry.status == ACCEPTED:
                    accepted_entries[slot] = (entry.proposal, entry.command)

            reply(msg.from, Promise {
                proposal: msg.proposal,
                accepted_entries: accepted_entries
            })
        else:
            reply(msg.from, Nack {
                proposal: msg.proposal,
                highest_promised: self.highest_promised
            })

    function highest_contiguous_slot():
        slot = 0
        while (slot + 1) in self.log:
            slot += 1
        return slot

Leader Takeover: The Messy Part

When a leader fails and a new leader takes over, the new leader must reconcile the state of all the followers. This is the most complex part of Multi-Paxos and where most bugs hide.

function leader_takeover():
    new_proposal = generate_proposal_number(self.my_id, self.highest_seen)

    // Phase 1: Send Prepare to all acceptors
    promises = send_prepare_and_collect_majority(new_proposal)

    if |promises| < quorum_size:
        return  // Failed — someone else is leader or network partition

    // Merge all accepted entries from promises
    // For each slot, take the value with the highest proposal number
    merged_log = {}
    max_slot = 0

    for promise in promises:
        for slot, (proposal, command) in promise.accepted_entries:
            max_slot = max(max_slot, slot)
            if slot not in merged_log or proposal > merged_log[slot].proposal:
                merged_log[slot] = (proposal, command)

    // Fill gaps with no-ops and re-replicate all uncertain entries
    for slot in range(self.commit_index + 1, max_slot + 1):
        if slot in merged_log:
            command = merged_log[slot].command
        else:
            command = NO_OP  // Fill the gap

        // Run Phase 2 for this slot with the new proposal number
        self.replicate_to_slot(slot, new_proposal, command)

    // Now we're caught up — start normal operation
    self.next_slot = max_slot + 1
    self.role = Leader
    self.start_heartbeats()

The no-op commands deserve special attention. When the new leader finds a gap in the log — a slot where no acceptor reports an accepted value — it must still fill that slot before it can advance the commit index past it. A no-op is a command that, when applied to the state machine, does nothing. But it must still go through the full Paxos Accept phase to be committed.

This means leader takeover latency scales with the number of uncommitted slots. If the old leader was pipelining aggressively and had 100 outstanding uncommitted entries when it failed, the new leader must re-replicate all 100 of them before it can process new client requests. This is one of the reasons that tuning the pipeline depth involves a tradeoff between throughput and recovery time.

Pipelining in Detail

Pipelining is essential for throughput. Without it, the leader must wait for each Accept to be committed before sending the next one. With pipelining, the leader can have multiple outstanding uncommitted entries:

class PipelinedLeader:
    state:
        pipeline_window: int = 50  // max outstanding uncommitted entries
        pending_commits: Map<SlotNumber, PendingEntry>

    function handle_client_request(command: Command):
        // Check pipeline backpressure
        while |self.pending_commits| >= self.pipeline_window:
            wait_for_any_commit()

        slot = self.next_slot
        self.next_slot += 1

        entry = PendingEntry {
            slot: slot,
            command: command,
            accepted_by: {self.my_id},
            client_callback: current_client_callback
        }
        self.pending_commits[slot] = entry

        // Send Accept — don't wait for response
        broadcast_accept(self.proposal_number, slot, command)

    function on_accepted(slot, from):
        if slot in self.pending_commits:
            self.pending_commits[slot].accepted_by.add(from)

            if |self.pending_commits[slot].accepted_by| >= quorum_size:
                self.try_advance_commit_index()

    function try_advance_commit_index():
        // Commit entries in order
        while (self.commit_index + 1) in self.pending_commits and
              |self.pending_commits[self.commit_index + 1].accepted_by| >= quorum_size:
            self.commit_index += 1
            entry = self.pending_commits.remove(self.commit_index)
            result = apply_to_state_machine(entry.command)
            entry.client_callback(result)

Note the crucial detail: entries are committed in order. Even if slot 5 gets a majority before slot 4, you cannot commit slot 5 until slot 4 is committed. The client callback for slot 5 must wait. This preserves the linearizability of the replicated log.

Batching

In practice, you combine pipelining with batching: instead of one command per Accept message, you pack multiple commands into a single message.

function batch_and_send():
    // Collect commands for up to BATCH_TIMEOUT or BATCH_SIZE
    batch = []
    deadline = now() + BATCH_TIMEOUT  // e.g., 1ms

    while |batch| < MAX_BATCH_SIZE and now() < deadline:
        if command = try_dequeue_request(deadline - now()):
            batch.append(command)

    if |batch| > 0:
        slot = self.next_slot
        self.next_slot += 1
        broadcast_accept(self.proposal_number, slot, batch)

Batching introduces a tradeoff between latency and throughput. A 1ms batching window means the first command in each batch waits up to 1ms before being sent. But if you’re processing 10,000 commands per second, each batch contains about 10 commands, and you’ve reduced your message overhead by 10x.

Google’s Spanner uses batching extensively. Amazon’s DynamoDB Paxos implementation reportedly uses adaptive batching that adjusts the window based on current load.

Comparison: Multi-Paxos vs. What Google Described

AspectAcademic Multi-PaxosGoogle’s Implementation (per “Paxos Made Live”)
Leader electionUnspecifiedLease-based with 10s lease period
Log gapsAllowedFilled with no-ops on leader change
ReadsThrough PaxosLeader leases for local reads
Persistence“Write to stable storage”Checksummed WAL with corruption detection
SnapshotsNot discussedCustom snapshot format with incremental support
Membership“Exercise for the reader”Complex migration protocol, single most difficult part
TestingProve it correct2000+ failure injection tests
PerformanceAnalyze message complexityBatching, pipelining, flow control, congestion avoidance

The right column is approximately 10x more work than the left column. This is the gap.

Why Everyone Builds Something Different

In 2014, a group of researchers surveyed several Paxos implementations and found that they all differed in significant ways. The paper, “Paxos Quorum Leases,” noted that systems like Chubby, Spanner, Megastore, and ZooKeeper (which uses Zab, a Paxos variant) all made different choices about:

  • How to handle reads
  • How to manage leader leases
  • How to compact the log
  • How to handle membership changes
  • How to batch and pipeline requests

The reason is simple: Multi-Paxos, as described in the literature, is not a complete system design. It is a collection of optimization ideas applied to a theoretical protocol. To build a real system, you must make dozens of engineering decisions that the theory doesn’t address, and different systems face different requirements that lead to different decisions.

This is not necessarily a problem — different systems SHOULD make different tradeoffs. But it means that “we use Multi-Paxos” is not a specification. It’s a statement of philosophy, roughly equivalent to “we use a leader-based consensus protocol inspired by Lamport’s work.” The details — which is where all the bugs and performance characteristics live — are entirely implementation-specific.

The Legacy of Multi-Paxos

Multi-Paxos is the workhorse of distributed consensus. It powers (or powered) Google’s Chubby, Spanner, Megastore, and Bigtable. It powers (in spirit) numerous other systems that use the stable-leader optimization regardless of whether they call it “Multi-Paxos.”

Its legacy is also one of frustration. Generations of engineers have stared at Lamport’s papers, understood the single-decree protocol, and then discovered that building a system requires solving a hundred problems the papers don’t address. The popularity of Raft (Chapter 8) is largely a response to this frustration — not because Raft solves a fundamentally different problem, but because Raft’s specification is a system design, not just a protocol sketch.

But Multi-Paxos also left a positive legacy: it established the blueprint that all leader-based consensus protocols follow. Stable leader, log replication, snapshotting, membership changes — these concepts, first explored (however informally) in the Multi-Paxos context, became the standard vocabulary of the field. Every protocol in the rest of Part II owes a debt to this protocol that was never fully specified but somehow became the foundation of modern distributed systems.

The gap between paper and production is not unique to Paxos. It exists for every distributed systems paper. But Paxos is where the gap was first discovered, first documented, and first lamented. And for that, we should be grateful — even as we curse it.

Viewstamped Replication: The One Nobody Reads

In the pantheon of consensus algorithms, Viewstamped Replication occupies a peculiar position: it is older than Paxos (in some formulations), more practical than Paxos (in its original description), more complete than Paxos (as a system design), and less famous than Paxos by approximately three orders of magnitude.

Brian Oki and Barbara Liskov published “Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems” in 1988 — a year before Lamport’s first submission of “The Part-Time Parliament.” The protocol describes a complete replicated state machine: normal operation, leader failure, leader election, and recovery of crashed replicas. It does all of this in a single coherent paper, without fictional Greek islands, without separating the work into a dozen under-specified follow-up papers, and without leaving the reader to invent their own membership change protocol.

And yet, if you ask a room of distributed systems engineers what consensus protocol they know, ninety percent will say Paxos, nine percent will say Raft, and the remaining one percent will say “Viewstamped what?”

This chapter is for the one percent, and for the ninety-nine percent who should know better.

Historical Context: The Naming War

Why did Paxos win the naming war? Several reasons, none of them technical:

Lamport’s stature. By the time Paxos was published (1998, though written in 1989), Lamport was already a towering figure in distributed systems. His name carried weight. Oki and Liskov were respected, but Paxos became associated with Lamport’s broader body of work on distributed systems, including logical clocks and the Byzantine generals problem.

Paxos came to symbolize the problem. Through historical accident and persistent citation, “Paxos” became synonymous with “consensus protocol” in the way that “Kleenex” became synonymous with “tissue.” When people said “Paxos,” they often meant “any leader-based crash-fault-tolerant consensus protocol.”

Google. When Google published papers about Chubby, Spanner, and Megastore in the 2000s, they described their systems as using Paxos. This cemented Paxos as the industrial standard. If Google had happened to describe their systems as using Viewstamped Replication, the landscape would look very different.

The TCS/systems divide. Paxos was embraced by the theoretical computer science community, which values elegance and minimality. VR was designed by systems researchers, who value completeness and practicality. The TCS community writes more papers, and those papers get cited more, creating a self-reinforcing citation network.

The result is that Viewstamped Replication is the Betamax of consensus protocols: arguably better in several ways, but history doesn’t care about “arguably better.”

The VR System Model

VR assumes:

  • A set of 2f+1 replicas, tolerating f crash failures.
  • Asynchronous network with reliable delivery (messages may be delayed but not lost). In practice, VR implementations add retransmission, but the protocol description assumes eventual delivery.
  • Non-Byzantine faults only: replicas either follow the protocol or crash.
  • Replicas have persistent storage (for recovery).

The system provides a replicated state machine service to clients. Clients send requests to the primary (leader), and the primary coordinates replication before responding.

Core Concepts

VR organizes time into views. Each view has a designated primary (the leader). The primary for view v is typically replica v mod n, where n is the number of replicas. When the primary fails, the system moves to a new view with a new primary. This deterministic leader selection is a notable simplification — there’s no election campaign, no randomized timeout. Everyone knows who the leader should be for any given view number.

Each request is assigned a view-stamp: a pair (view-number, op-number). The view-number identifies which primary issued the request, and the op-number is the sequence number within that view. View-stamps are totally ordered: compare first by view-number, then by op-number.

struct ViewStamp:
    view_number: int
    op_number: int

    function compare(other: ViewStamp) -> int:
        if self.view_number != other.view_number:
            return self.view_number - other.view_number
        return self.op_number - other.op_number

Normal Operation

Normal operation in VR is clean and efficient. Here is the full protocol for processing a client request:

Client          Primary (P)         Backup 1 (B1)      Backup 2 (B2)
  |                 |                    |                   |
  |-- Request(op, client-id, req-num) ->|                   |
  |                 |                    |                   |
  |                 |-- Prepare(v, op-num, op, commit-num) ->|
  |                 |-- Prepare(v, op-num, op, commit-num) ---------->|
  |                 |                    |                   |
  |                 |<-- PrepareOK(v, op-num, replica-id) ---|
  |                 |<-- PrepareOK(v, op-num, replica-id) ------------|
  |                 |                    |                   |
  |                 |  (Got f PrepareOKs — can commit)      |
  |                 |                    |                   |
  |<-- Reply(v, req-num, result) --------|                   |
  |                 |                    |                   |

Let’s walk through each step in detail.

Step 1: Client Sends Request

The client sends Request(operation, client-id, request-number) to the primary. The request-number is a monotonically increasing number that the client assigns to each request. This is critical for exactly-once semantics — the primary uses the (client-id, request-number) pair to detect duplicate requests.

Step 2: Primary Prepares

The primary assigns the next op-number, appends the operation to its log, and sends Prepare(view-number, op-number, operation, commit-number) to all backups.

The commit-number is piggybacked on the Prepare message — it tells the backups the latest operation that has been committed (accepted by f+1 replicas). This is how backups learn about commits without a separate message.

Step 3: Backups Respond

Each backup, upon receiving the Prepare:

  1. Checks that the view-number matches its current view (otherwise discards the message).
  2. Checks that the op-number is the expected next op-number (VR requires contiguous logs — no gaps).
  3. Appends the operation to its log.
  4. Updates its commit-number based on the piggybacked value from the primary.
  5. Sends PrepareOK(view-number, op-number, replica-id) to the primary.

Step 4: Primary Commits

When the primary receives PrepareOK from f backups (plus itself, that’s f+1 total), the operation is committed. The primary:

  1. Applies the operation to the state machine.
  2. Sends Reply(view-number, request-number, result) to the client.
  3. Increments its commit-number (which will be sent with the next Prepare).
class VRPrimary:
    state:
        view_number: int
        op_number: int = 0
        commit_number: int = 0
        log: List<LogEntry> = []
        client_table: Map<ClientId, (RequestNumber, Reply)>

    function on_client_request(op, client_id, request_num):
        // Check for duplicate request
        if client_id in self.client_table:
            last_req, last_reply = self.client_table[client_id]
            if request_num <= last_req:
                // Duplicate — resend cached reply
                send_to_client(client_id, last_reply)
                return

        // Assign op-number and append to log
        self.op_number += 1
        entry = LogEntry {
            view: self.view_number,
            op_num: self.op_number,
            operation: op,
            client_id: client_id,
            request_num: request_num
        }
        self.log.append(entry)

        // Send Prepare to all backups
        for backup in self.backups:
            send(backup, Prepare {
                view: self.view_number,
                op_num: self.op_number,
                operation: op,
                commit_num: self.commit_number
            })

    function on_prepare_ok(view, op_num, from_replica):
        if view != self.view_number:
            return  // Stale message

        self.log[op_num].acks.add(from_replica)

        // Check if we can commit this and any subsequent entries
        while self.commit_number < self.op_number:
            next = self.commit_number + 1
            if |self.log[next].acks| + 1 >= f + 1:  // +1 for self
                self.commit_number = next
                result = self.state_machine.apply(self.log[next].operation)

                // Update client table and respond
                entry = self.log[next]
                reply = Reply {
                    view: self.view_number,
                    request_num: entry.request_num,
                    result: result
                }
                self.client_table[entry.client_id] = (entry.request_num, reply)
                send_to_client(entry.client_id, reply)
            else:
                break  // Can't commit yet — waiting for acks

Backup Logic

class VRBackup:
    state:
        view_number: int
        op_number: int = 0
        commit_number: int = 0
        log: List<LogEntry> = []
        client_table: Map<ClientId, (RequestNumber, Reply)>

    function on_prepare(view, op_num, operation, commit_num):
        if view != self.view_number:
            return  // Wrong view — ignore

        if op_num != self.op_number + 1:
            // Gap in log — we missed messages
            // Request state transfer from primary
            self.request_state_transfer()
            return

        // Append to log
        self.op_number = op_num
        self.log.append(LogEntry {
            view: view,
            op_num: op_num,
            operation: operation
        })

        // Apply any newly committed operations
        while self.commit_number < commit_num:
            self.commit_number += 1
            result = self.state_machine.apply(self.log[self.commit_number].operation)
            // Update client table
            entry = self.log[self.commit_number]
            self.client_table[entry.client_id] = (entry.request_num, result)

        // Acknowledge
        send(primary, PrepareOK {
            view: self.view_number,
            op_num: op_num,
            replica_id: self.my_id
        })

View Changes: When the Primary Fails

This is where VR shines. The view change protocol is arguably the clearest leader-change protocol in the consensus literature. It’s specified completely, with no hand-waving about “use a failure detector” or “left as an exercise.”

Triggering a View Change

A backup suspects the primary has failed when it hasn’t heard from it for a timeout period. When this happens:

  1. The backup increments its view-number to v+1.
  2. It stops accepting Prepare messages from the old primary.
  3. It sends a StartViewChange(v+1, replica-id) message to all other replicas.

Phase 1: StartViewChange

When a replica receives f StartViewChange messages for view v+1 (from f different replicas), it sends a DoViewChange message to the new primary (replica (v+1) mod n) containing:

  • Its log
  • Its current view-number
  • The view-number of the last normal view it participated in (the “last normal view”)
  • Its op-number and commit-number
function on_start_view_change(new_view, from_replica):
    if new_view <= self.view_number and self.status == NORMAL:
        return  // Stale or we're already in a newer view

    self.status = VIEW_CHANGE
    self.view_number = new_view
    self.start_view_change_count[new_view].add(from_replica)

    // Also count our own StartViewChange
    if self.my_id not in self.start_view_change_count[new_view]:
        self.start_view_change_count[new_view].add(self.my_id)
        broadcast(StartViewChange {
            view: new_view,
            replica_id: self.my_id
        })

    if |self.start_view_change_count[new_view]| >= f:
        // Send DoViewChange to the new primary
        new_primary = new_view mod self.num_replicas
        send(new_primary, DoViewChange {
            view: new_view,
            log: self.log,
            last_normal_view: self.last_normal_view,
            op_number: self.op_number,
            commit_number: self.commit_number,
            replica_id: self.my_id
        })

Phase 2: DoViewChange

The new primary collects f+1 DoViewChange messages (including its own) and selects the “best” log — the one from the replica with the highest last-normal-view, breaking ties by highest op-number. This is the log that becomes the new primary’s log.

function on_do_view_change(msg):
    self.do_view_change_msgs[msg.view].append(msg)

    if |self.do_view_change_msgs[msg.view]| >= f + 1:
        // Select the best log
        best = null
        for dvc in self.do_view_change_msgs[msg.view]:
            if best == null or
               dvc.last_normal_view > best.last_normal_view or
               (dvc.last_normal_view == best.last_normal_view and
                dvc.op_number > best.op_number):
                best = dvc

        // Install the best log
        self.log = best.log
        self.op_number = best.op_number
        self.view_number = msg.view
        self.last_normal_view = msg.view
        self.status = NORMAL

        // Determine the highest commit number from all DoViewChange messages
        max_commit = max(dvc.commit_number for dvc in self.do_view_change_msgs[msg.view])

        // Commit any uncommitted operations up to max_commit
        while self.commit_number < max_commit:
            self.commit_number += 1
            self.state_machine.apply(self.log[self.commit_number].operation)

        // Announce the new view to all replicas
        broadcast(StartView {
            view: msg.view,
            log: self.log,
            op_number: self.op_number,
            commit_number: self.commit_number
        })

Phase 3: StartView

When backups receive a StartView message, they:

  1. Update their log to match the new primary’s log.
  2. Update their view-number, op-number, and commit-number.
  3. Apply any newly committed operations.
  4. Send PrepareOK messages for any uncommitted operations in the new log (so the new primary can commit them).
  5. Resume normal operation in the new view.
function on_start_view(msg):
    self.log = msg.log
    self.view_number = msg.view
    self.op_number = msg.op_number
    self.last_normal_view = msg.view
    self.status = NORMAL

    // Apply any newly committed operations
    while self.commit_number < msg.commit_number:
        self.commit_number += 1
        self.state_machine.apply(self.log[self.commit_number].operation)

    // Send PrepareOK for uncommitted operations
    // so new primary can commit them
    new_primary = msg.view mod self.num_replicas
    for i in range(msg.commit_number + 1, msg.op_number + 1):
        send(new_primary, PrepareOK {
            view: msg.view,
            op_num: i,
            replica_id: self.my_id
        })

The view change protocol is one round trip in the common case: StartViewChange messages fan out, DoViewChange messages converge on the new primary, and StartView fans out again. The total number of messages is O(n^2) in the worst case (because every replica sends to every other replica in Phase 1), but in practice the new primary collects the messages and the broadcast is O(n).

Recovery: Bringing Back a Crashed Replica

VR also specifies how a replica recovers after a crash. This is another area where VR is more complete than Paxos, which simply says “read your state from disk and resume.”

VR’s recovery protocol ensures that a recovering replica gets a consistent state even if its disk is unreliable (or even if it has no disk at all — the VR Revisited paper discusses diskless operation).

function recover():
    // Generate a unique nonce to prevent replay
    nonce = generate_random_nonce()

    // Ask all replicas for their current state
    broadcast(Recovery { replica_id: self.my_id, nonce: nonce })

    // Wait for f+1 responses, including one from the current primary
    responses = collect_recovery_responses(nonce)

    // The primary's response includes the full log and state
    primary_response = find_primary_response(responses)

    self.view_number = primary_response.view
    self.log = primary_response.log
    self.op_number = primary_response.op_number
    self.commit_number = primary_response.commit_number
    self.last_normal_view = primary_response.view

    // Replay committed operations to rebuild state machine
    for i in range(1, self.commit_number + 1):
        self.state_machine.apply(self.log[i].operation)

    self.status = NORMAL

The nonce is important: it prevents a recovering replica from accepting stale Recovery responses from a previous recovery attempt. Without it, a replica that crashes and recovers twice might accept a response from the first recovery during the second recovery, potentially getting stale state.

VR vs. Multi-Paxos: A Detailed Comparison

Let us be direct: VR and Multi-Paxos are more similar than different. Both are leader-based protocols that replicate a log of commands. Both require a majority quorum for progress. Both handle leader failure through a view change / leader election mechanism.

The differences are in the details, and they matter:

AspectMulti-PaxosViewstamped Replication
Leader selectionUnspecified (whoever wins Phase 1)Deterministic: replica v mod n for view v
Log gapsAllowedNot allowed (contiguous log)
Normal operation messagesAccept / AcceptedPrepare / PrepareOK (different names, same semantics)
Leader changeRun Phase 1 with higher proposal numberView change protocol with explicit phases
RecoveryRead from disk and resumeExplicit recovery protocol with nonce
Specification completenessProtocol onlyComplete system (including client interaction)
Commit notificationUnspecifiedPiggybacked on next Prepare
Membership changesExercise for readerDiscussed (briefly) in VR Revisited

The Naming Confusion

The names are unfortunate. VR calls its normal-operation messages “Prepare” and “PrepareOK,” which in Paxos terminology means something completely different (Phase 1). VR’s “Prepare” is equivalent to Paxos Phase 2 “Accept.” This naming collision has confused every person who has tried to learn both protocols, and it will continue to confuse people until the heat death of the universe.

For clarity, here is the mapping:

VR TermPaxos Equivalent
Prepare (normal operation)Accept (Phase 2)
PrepareOKAccepted
StartViewChange + DoViewChangePrepare + Promise (Phase 1)
StartView(New leader announces itself — no direct equivalent)
ViewBallot / Proposal number
PrimaryLeader / Distinguished proposer

What VR Gets Right

Deterministic leader selection. VR’s “replica v mod n” approach eliminates the need for a separate leader election protocol. Everyone knows who the leader should be. If the leader fails, everyone knows who the next leader should be. This removes an entire class of bugs related to multiple nodes believing they are the leader simultaneously.

There is a downside: if the designated leader for view v+1 is also down, the system must quickly detect this and move to view v+2. In practice, this means the view change protocol needs good timeout tuning — too aggressive and you skip over healthy leaders, too conservative and you waste time trying to contact dead ones.

No log gaps. VR’s insistence on a contiguous log (like Raft, and unlike Multi-Paxos) simplifies everything downstream: commit tracking, state machine application, snapshotting, follower catchup. You never need to fill gaps with no-ops, because gaps don’t exist.

Integrated client interaction. VR specifies how clients interact with the system, including exactly-once semantics via the client table. This is a crucial part of any real system that Paxos papers don’t address.

Complete recovery protocol. VR specifies how to bring back a crashed replica. Paxos says “read from disk.” VR says “here is a protocol that works even if the disk is unreliable.”

The VR Revisited Paper

In 2012, Liskov published “Viewstamped Replication Revisited,” which updated the original protocol with several improvements:

  1. Cleaner presentation. The original 1988 paper was written in a different academic style and used different terminology. The revisited version is more accessible.

  2. Explicit state transfer. The revisited version clearly specifies how a lagging replica catches up, including snapshot-based state transfer.

  3. Reconfiguration. The revisited version includes a reconfiguration protocol for changing the replica set. This uses an “epoch” mechanism where the old configuration and new configuration coordinate through a special reconfiguration request.

  4. Optimizations. The revisited version discusses batching, pipelining, and other performance optimizations.

The VR Revisited paper is, in some ways, the paper that Multi-Paxos never got — a complete, self-contained description of a replicated state machine protocol with all the practical details included. It is 20 pages long and covers everything from normal operation to recovery to reconfiguration. Compare this with Multi-Paxos, which is spread across half a dozen papers, none of which tells the complete story.

Performance Analysis

In normal operation (stable leader, no failures), VR and Multi-Paxos have identical performance characteristics:

  • Message complexity per operation: 2(n-1) messages — the primary sends Prepare to n-1 backups and receives n-1 PrepareOK responses. (In practice, you only need f responses, so you might send to all but only wait for f.)
  • Round trips: 1 (Prepare/PrepareOK, equivalent to Accept/Accepted).
  • Fsyncs per operation: 1 on the primary, 1 on each backup that persists before responding.
  • Latency: Network RTT + fsync time.

View changes add latency:

  • StartViewChange phase: 1 message delay (fan-out to all replicas).
  • DoViewChange phase: 1 message delay (converge on new primary).
  • StartView phase: 1 message delay (fan-out from new primary).
  • Total view change latency: Approximately 3 message delays + processing time.

This is comparable to Raft’s leader election latency (1-2 election timeouts + message delays) and Multi-Paxos’s Phase 1 latency (1 round trip for Prepare/Promise).

Implementation Considerations

If you’re considering implementing VR, here are the things the paper glosses over:

Timeout Tuning

The view change is triggered by a timeout. How long should this timeout be? Too short and you get false positives — you change views when the primary is just slow, not dead. Too long and the system is unavailable for the entire timeout period after a real failure.

The standard approach is:

  • Heartbeat interval: 50-100ms
  • Election timeout: 3-10x the heartbeat interval
  • Randomize the election timeout to avoid synchronized view changes

But “3-10x” is a wide range, and the right value depends on your network characteristics, your latency requirements, and how much you trust your failure detector. In a LAN environment, heartbeats every 50ms and a 300ms election timeout work well. In a WAN environment, you might need 500ms heartbeats and 5-second election timeouts.

Exactly-Once Semantics

VR’s client table provides exactly-once semantics: if a client retries a request, the primary detects the duplicate and returns the cached response. But the client table grows without bound unless you have a mechanism to garbage-collect old entries.

The standard approach is to require clients to include the request number of their last completed request in each new request, allowing the server to discard table entries for older requests. This works, but it requires careful coordination between client and server, and it breaks down if clients crash and restart with a new identity.

function gc_client_table():
    for client_id, (req_num, reply) in self.client_table:
        // Client's latest request implicitly acknowledges all previous
        if req_num < client_latest_ack[client_id]:
            delete self.client_table[client_id]

State Transfer

When a new replica joins or a crashed replica recovers far behind the current state, it needs a full state transfer. VR Revisited describes this but doesn’t specify the transfer mechanism in detail.

In practice, you need:

  • A consistent snapshot of the state machine
  • The log suffix from the snapshot point to the current op-number
  • A way to send this over the network without blocking normal operation

This is the same problem as Raft’s InstallSnapshot and Multi-Paxos’s snapshot transfer. Everyone solves it, everyone finds it more annoying than expected.

Disk Persistence

The original VR paper assumes replicas can operate without disk persistence, relying on the recovery protocol to rebuild state from other replicas. This is attractive — fsyncs are expensive and eliminating them from the critical path improves latency dramatically.

However, diskless operation has a major limitation: if f+1 replicas fail simultaneously (even briefly), the system can lose committed data. With disk persistence, replicas can recover independently by reading their state from disk. Without it, they depend on other replicas being available.

In practice, most VR implementations use disk persistence for the log (just like Multi-Paxos and Raft), and use the recovery protocol as a backup for cases where the disk is corrupted or the replica is being replaced entirely.

Why VR Deserves More Attention

It is fashionable to say that Raft made consensus understandable. And Raft did excellent work on pedagogy — the paper is well-written, the visualization tools are helpful, and the design was explicitly optimized for comprehensibility.

But VR Revisited, published a year before the Raft paper, is also comprehensible. It’s also a complete system design. It also avoids log gaps. It also has a clean leader change protocol. The main differences between VR and Raft are:

  1. Raft uses randomized election timeouts; VR uses deterministic leader selection.
  2. Raft’s leader completeness property (the leader always has the most complete log) is enforced during election; VR transfers the best log during view change.
  3. Raft was accompanied by an excellent pedagogy campaign including videos, visualizations, and a reference implementation.

Point 3 is probably the most important. VR Revisited is a good paper. Raft is a good paper with a marketing campaign. In academia, as in industry, marketing matters.

If you are building a new system and choosing a consensus protocol, VR is a legitimate option. It has a smaller community and fewer reference implementations than Raft, which is a real practical disadvantage. But the protocol itself is sound, well-specified, and battle-tested (it’s the ancestor of protocols used in several production systems, even if they don’t acknowledge the lineage).

A Note on the Broader VR Family

VR influenced several later protocols:

  • PBFT (1999). Castro and Liskov’s Practical Byzantine Fault Tolerance extends VR’s view change mechanism to handle Byzantine faults. The view change protocol in PBFT is recognizably a descendant of VR’s.
  • Zab (2011). ZooKeeper Atomic Broadcast (Chapter 9) shares several design choices with VR, including the primary-backup model with ordered broadcasts and a view-based leader change mechanism.
  • Raft (2014). Raft’s design is more similar to VR than to Paxos, particularly in its insistence on a contiguous log and its clean separation of leader election from normal operation.

The irony is rich: VR’s ideas live on in protocols that are far more famous than VR itself. It is the Velvet Underground of consensus protocols — not many people implemented it directly, but everyone who did went on to build something influential.

Summary

Viewstamped Replication is a complete, practical, well-specified consensus protocol that predates Paxos and anticipates Raft. Its obscurity is a historical accident, not a reflection of its technical merit. If you are studying consensus algorithms, reading VR Revisited alongside the Raft paper is illuminating — the similarities highlight what is fundamental about leader-based consensus, and the differences highlight design choices that are matters of taste rather than correctness.

The main lesson of VR’s story is that in distributed systems, as in life, being first and being right is no guarantee of being remembered. You also have to be named after a Greek island, apparently.

Raft: Paxos for Humans (Mostly)

In 2014, Diego Ongaro and John Ousterhout published “In Search of an Understandable Consensus Algorithm,” and the world of distributed systems let out a collective sigh of relief. Finally, someone had said the quiet part out loud: Paxos was too hard to understand, and this was a problem.

Raft was designed with a single overriding goal: understandability. Not performance. Not generality. Not minimality. Understandability. Ongaro and Ousterhout’s thesis was that if you cannot understand a consensus algorithm, you cannot implement it correctly, and if you cannot implement it correctly, it doesn’t matter how elegant the theory is. This is a radical claim in a field that prizes theoretical elegance, and the fact that it needed to be made tells you something about the state of the field in 2014.

The result is a protocol that is genuinely easier to understand than Paxos. It is also a protocol that is genuinely harder to understand than its proponents sometimes suggest. This chapter covers both halves of that truth.

Design Philosophy: Decomposition Over Minimality

Raft’s key design decision was to decompose consensus into three relatively independent subproblems:

  1. Leader election — How do you pick a leader?
  2. Log replication — How does the leader replicate its log to followers?
  3. Safety — How do you ensure that the log stays consistent?

Paxos, by contrast, interleaves these concerns in a way that is theoretically minimal but pedagogically opaque. Raft’s decomposition means you can understand each piece independently and then see how they fit together. This is not just a pedagogical trick — it also makes the implementation modular in a way that Paxos is not.

The other major design decision was to reduce the state space wherever possible. Raft eliminates log gaps (unlike Multi-Paxos), uses randomized timeouts instead of a separate leader election protocol, and enforces the leader completeness property — the elected leader always has the most complete log, so the leader never needs to learn about committed entries it doesn’t have. Each of these constraints reduces the number of cases the implementation must handle.

Terms and Roles

Raft divides time into terms, each identified by a monotonically increasing integer. Each term begins with an election and (if the election succeeds) is followed by normal operation under a single leader. Terms act as a logical clock — they tell you whether the information you’re looking at is current or stale.

There are three roles:

  • Leader — Handles all client requests, replicates log entries, sends heartbeats.
  • Follower — Passive. Responds to requests from leaders and candidates. If it doesn’t hear from a leader for a while, it becomes a candidate.
  • Candidate — A follower that is trying to become the leader by running an election.

Every server starts as a follower. This is the steady state for most servers most of the time.

Leader Election

Raft’s leader election uses randomized timeouts, and it is one of the protocol’s genuine contributions to the field.

The Mechanism

Each follower maintains an election timer. When the timer expires without hearing from a leader (via heartbeat or AppendEntries), the follower:

  1. Increments its current term.
  2. Transitions to candidate state.
  3. Votes for itself.
  4. Sends RequestVote to all other servers.
class RaftNode:
    // Persistent state (MUST survive restarts)
    persistent:
        current_term: int = 0
        voted_for: NodeId = null
        log: List<LogEntry> = []

    // Volatile state
    volatile:
        commit_index: int = 0
        last_applied: int = 0
        role: Leader | Follower | Candidate = Follower
        election_timer: Timer

    function on_election_timeout():
        self.role = Candidate
        self.current_term += 1
        self.voted_for = self.my_id
        persist(self.current_term, self.voted_for)

        self.votes_received = {self.my_id}  // Vote for self

        last_log_index = len(self.log)
        last_log_term = self.log[last_log_index].term if self.log else 0

        for server in self.all_servers:
            if server != self.my_id:
                send(server, RequestVote {
                    term: self.current_term,
                    candidate_id: self.my_id,
                    last_log_index: last_log_index,
                    last_log_term: last_log_term
                })

        // Reset election timer with new random timeout
        self.reset_election_timer()

Voting Rules

A server grants its vote if and only if:

  1. The candidate’s term is at least as large as the voter’s current term.
  2. The voter hasn’t already voted for someone else in this term.
  3. The candidate’s log is at least as up-to-date as the voter’s log.

Rule 3 is the election restriction and is crucial for safety. “At least as up-to-date” means: the candidate’s last log entry has a higher term than the voter’s last log entry, OR the terms are equal and the candidate’s log is at least as long.

function on_request_vote(msg):
    if msg.term < self.current_term:
        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: false
        })
        return

    if msg.term > self.current_term:
        self.step_down(msg.term)  // Become follower

    // Check if we can grant the vote
    can_vote = (self.voted_for == null or self.voted_for == msg.candidate_id)

    // Check log up-to-date-ness
    my_last_term = self.log[-1].term if self.log else 0
    my_last_index = len(self.log)

    log_ok = (msg.last_log_term > my_last_term or
              (msg.last_log_term == my_last_term and
               msg.last_log_index >= my_last_index))

    if can_vote and log_ok:
        self.voted_for = msg.candidate_id
        persist(self.current_term, self.voted_for)
        self.reset_election_timer()  // Grant implies reset

        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: true
        })
    else:
        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: false
        })

function step_down(new_term):
    self.current_term = new_term
    self.role = Follower
    self.voted_for = null
    persist(self.current_term, self.voted_for)
    self.reset_election_timer()

Why Randomized Timeouts Work

The election timeout is chosen randomly from a range, typically [150ms, 300ms]. This randomization serves two purposes:

  1. Breaks symmetry. Without randomization, all followers would time out simultaneously and split the vote. With randomization, one follower typically times out first and wins the election before others even start.

  2. Avoids livelock. If two candidates keep splitting the vote, the random timeouts ensure that eventually one will start its election early enough to win before the other starts.

This is a much simpler solution to the leader election problem than Paxos’s approach (which doesn’t really have one) or VR’s deterministic rotation (which requires a separate mechanism to skip over failed leaders). The downside is that it’s probabilistic — in theory, you could get unlucky and have repeated split votes. In practice, this essentially never happens because the probability drops exponentially with each round.

Handling the Election Response

function on_vote_response(msg):
    if msg.term > self.current_term:
        self.step_down(msg.term)
        return

    if self.role != Candidate or msg.term != self.current_term:
        return  // Stale response

    if msg.vote_granted:
        self.votes_received.add(msg.from)

        if |self.votes_received| > len(self.all_servers) / 2:
            // Won the election!
            self.become_leader()

function become_leader():
    self.role = Leader

    // Initialize leader state
    for server in self.all_servers:
        self.next_index[server] = len(self.log) + 1
        self.match_index[server] = 0

    // Send initial empty AppendEntries (heartbeat) to assert leadership
    self.send_heartbeats()

    // Optionally: append a no-op entry to commit entries from previous terms
    // (This is an important optimization discussed later)

Log Replication

Once a leader is elected, it handles all client requests. Each request is appended to the leader’s log and replicated to followers via AppendEntries RPCs.

AppendEntries: The Workhorse

AppendEntries serves double duty: it replicates log entries AND serves as a heartbeat (when sent with no entries). The leader sends it periodically to all followers.

function leader_send_append_entries(follower_id):
    prev_log_index = self.next_index[follower_id] - 1
    prev_log_term = self.log[prev_log_index].term if prev_log_index > 0 else 0

    // Entries to send: everything from next_index onward
    entries = self.log[self.next_index[follower_id]:]

    send(follower_id, AppendEntries {
        term: self.current_term,
        leader_id: self.my_id,
        prev_log_index: prev_log_index,
        prev_log_term: prev_log_term,
        entries: entries,
        leader_commit: self.commit_index
    })

The Consistency Check

Each AppendEntries message includes prev_log_index and prev_log_term — the index and term of the log entry immediately preceding the new entries. The follower checks that its log matches at this point. If it doesn’t, the follower rejects the AppendEntries, and the leader decrements next_index and retries.

This is the log matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index. It’s enforced inductively: the base case is the empty log (trivially matching), and each AppendEntries extends the induction by checking the predecessor.

function on_append_entries(msg):
    // Reset election timer — we heard from a leader
    self.reset_election_timer()

    if msg.term < self.current_term:
        reply(msg.from, AppendEntriesResponse {
            term: self.current_term,
            success: false,
            match_index: 0
        })
        return

    if msg.term > self.current_term:
        self.step_down(msg.term)

    self.role = Follower
    self.current_leader = msg.leader_id

    // Consistency check
    if msg.prev_log_index > 0:
        if msg.prev_log_index > len(self.log):
            // We don't have the predecessor entry
            reply(msg.from, AppendEntriesResponse {
                term: self.current_term,
                success: false,
                // Optimization: tell leader our log length
                // so it can skip ahead
                match_index: len(self.log)
            })
            return

        if self.log[msg.prev_log_index].term != msg.prev_log_term:
            // Predecessor entry exists but has wrong term — conflict
            // Optimization: find the first index of the conflicting term
            // and tell the leader to skip back to there
            conflict_term = self.log[msg.prev_log_index].term
            first_index = msg.prev_log_index
            while first_index > 1 and self.log[first_index - 1].term == conflict_term:
                first_index -= 1

            // Delete the conflicting entries
            self.log = self.log[:msg.prev_log_index]

            reply(msg.from, AppendEntriesResponse {
                term: self.current_term,
                success: false,
                match_index: first_index - 1
            })
            return

    // Append new entries (overwriting any conflicting entries)
    for i, entry in enumerate(msg.entries):
        index = msg.prev_log_index + 1 + i
        if index <= len(self.log):
            if self.log[index].term != entry.term:
                // Conflict — truncate and append
                self.log = self.log[:index]
                self.log.append(entry)
            // else: already have this entry, skip
        else:
            self.log.append(entry)

    persist(self.log)

    // Update commit index
    if msg.leader_commit > self.commit_index:
        self.commit_index = min(msg.leader_commit, len(self.log))
        self.apply_committed_entries()

    reply(msg.from, AppendEntriesResponse {
        term: self.current_term,
        success: true,
        match_index: msg.prev_log_index + len(msg.entries)
    })

Leader Handling of Responses

function on_append_entries_response(msg, from):
    if msg.term > self.current_term:
        self.step_down(msg.term)
        return

    if self.role != Leader or msg.term != self.current_term:
        return  // Stale

    if msg.success:
        self.next_index[from] = msg.match_index + 1
        self.match_index[from] = msg.match_index
        self.maybe_advance_commit_index()
    else:
        // Follower's log was inconsistent — back up
        self.next_index[from] = max(1, msg.match_index + 1)
        // Retry immediately
        self.leader_send_append_entries(from)

The Commit Mechanism

A log entry is committed when the leader has replicated it to a majority of servers. But there’s a critical subtlety: a leader can only commit entries from its own term.

This is the “Figure 8 problem” from the Raft paper, and it catches almost everyone off guard. Consider this scenario:

Time:  T1    T2    T3    T4    T5
S1:   [1]   [1,2] [1,2] [1,2] -- crashes --
S2:   [1]   [1,2] [1,2]  ...  [1,2,4]  -- becomes leader term 4
S3:   [1]   [1]   [1,3]  ...  [1,3]    -- was leader term 3
S4:   [1]   [1]   [1]    ...  [1,2,4]
S5:   [1]   [1]   [1]    ...  [1,2,4]

If S1 was leader in term 2, replicated entry 2 to S2, then crashed, and S3 became leader in term 3 but only replicated to itself before crashing, and then S1 comes back as leader in term 4… S1 might try to “commit” entry 2 (from term 2) by replicating it to S3. Now a majority (S1, S2, S3) has entry 2. But if S1 crashes again, S3 could become leader and overwrite entry 2 with entry 3!

The fix: a leader never commits entries from previous terms directly. It only commits them indirectly, by committing a new entry from its own term. Once a current-term entry is committed at a given index, all preceding entries are also committed (by the log matching property).

function maybe_advance_commit_index():
    // Find the highest index replicated to a majority
    for n in range(len(self.log), self.commit_index, -1):
        if self.log[n].term == self.current_term:  // CRITICAL: only current term
            count = 1  // Count self
            for server in self.all_servers:
                if server != self.my_id and self.match_index[server] >= n:
                    count += 1

            if count > len(self.all_servers) / 2:
                self.commit_index = n
                self.apply_committed_entries()
                return

function apply_committed_entries():
    while self.last_applied < self.commit_index:
        self.last_applied += 1
        result = self.state_machine.apply(self.log[self.last_applied].command)

        // If we're the leader, respond to the client
        if self.role == Leader:
            respond_to_client(self.log[self.last_applied].client_info, result)

Safety: Intuition for the Proof

Raft’s safety property is: if a log entry is committed at a given index, no other server will ever have a different entry at that index.

The proof relies on two properties:

Leader Completeness. If a log entry is committed in a given term, that entry is present in the logs of all leaders of higher-numbered terms.

Why? Because:

  1. A committed entry is on a majority of servers.
  2. A new leader must receive votes from a majority.
  3. Any two majorities overlap.
  4. The election restriction ensures the new leader’s log is at least as up-to-date as any voter’s.
  5. Therefore, the new leader has the committed entry.

Log Matching. If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.

Why? Because AppendEntries checks the predecessor’s index and term before appending. This check creates an inductive chain: if entry i matches, then entry i-1 was checked when entry i was replicated, so entry i-1 matches, and so on back to the beginning.

Together, these two properties ensure that once a value is committed, it is permanent and agreed upon by all future leaders.

The Things That Make Raft NOT as Easy as Advertised

The core protocol — leader election and log replication — is genuinely simpler than Paxos. But building a production Raft system requires solving several problems that the paper either addresses briefly or punts on entirely.

Membership Changes

The original Raft paper proposed joint consensus for membership changes: when changing from configuration C_old to C_new, the system transitions through an intermediate configuration C_old,new that requires majorities from BOTH configurations.

Time:   ─────────────────────────────────────────>
Config: C_old ──> C_old,new ──> C_new
                  │              │
            Committed here   Committed here

Joint consensus is safe but complex. You need to handle the case where the leader is not in the new configuration, where the joint configuration spans a leader change, and where a new server needs to catch up before it can vote.

Because of this complexity, the Raft authors later proposed single-server changes: only add or remove one server at a time. This is simpler because any two majorities of configurations that differ by one server will overlap.

function add_server(new_server):
    // First, bring the new server up to date
    while not is_caught_up(new_server):
        send_snapshot_or_entries(new_server)
        wait(CATCHUP_CHECK_INTERVAL)

    // Propose the configuration change as a regular log entry
    new_config = self.current_config + {new_server}
    self.replicate(ConfigChange { config: new_config })

    // Configuration takes effect as soon as the entry is appended
    // (not when it's committed!)

function remove_server(old_server):
    new_config = self.current_config - {old_server}
    self.replicate(ConfigChange { config: new_config })

    // If we're removing ourselves, step down after committing
    if old_server == self.my_id:
        self.step_down_after_commit()

The single-server approach seems simpler, but it has its own subtleties. You cannot have two pending configuration changes at the same time (you must wait for each to commit). And there’s a tricky edge case when removing a server that is the current leader.

Log Compaction

The log grows without bound. You need to compact it. Raft describes snapshotting: periodically, take a snapshot of the state machine and discard log entries up to the snapshot point.

function take_snapshot():
    if self.last_applied - self.last_snapshot_index < SNAPSHOT_THRESHOLD:
        return

    snapshot = Snapshot {
        last_included_index: self.last_applied,
        last_included_term: self.log[self.last_applied].term,
        state: serialize(self.state_machine),
        config: self.current_config
    }

    write_snapshot_to_disk(snapshot)

    // Discard log entries up to the snapshot
    self.log = self.log[self.last_applied + 1:]
    self.last_snapshot_index = snapshot.last_included_index
    self.last_snapshot_term = snapshot.last_included_term

When a leader needs to send entries to a follower that has fallen behind the snapshot point, it sends the snapshot instead:

function send_install_snapshot(follower_id):
    send(follower_id, InstallSnapshot {
        term: self.current_term,
        leader_id: self.my_id,
        last_included_index: self.last_snapshot_index,
        last_included_term: self.last_snapshot_term,
        data: self.snapshot_data,
        // May be chunked for large snapshots
        offset: 0,
        done: true
    })

The follower replaces its state with the snapshot and discards its entire log up to the snapshot point. This is simple in concept but complex in implementation — snapshotting a large state machine while continuing to process requests requires copy-on-write semantics or a quiescent period, and transferring a multi-gigabyte snapshot is a non-trivial network operation.

The Pre-Vote Protocol

Raft has a problem: a partitioned server keeps incrementing its term (because its election timer keeps firing and it keeps starting elections it can never win). When the partition heals, this server has a very high term number. When it contacts the cluster, other servers see the high term and step down from leadership, causing unnecessary leader changes.

The pre-vote protocol (proposed in Ongaro’s dissertation but not in the original paper) addresses this. Before starting a real election, a candidate sends a “pre-vote” request that doesn’t increment the term. Other servers respond based on whether they would vote for this candidate — but they don’t actually record a vote or change their term. Only if the pre-vote succeeds does the candidate proceed with a real election.

function on_election_timeout_with_prevote():
    // Phase 0: Pre-vote
    pre_vote_term = self.current_term + 1  // Hypothetical next term

    pre_votes = {self.my_id}
    for server in self.all_servers:
        if server != self.my_id:
            send(server, PreVote {
                term: pre_vote_term,
                candidate_id: self.my_id,
                last_log_index: len(self.log),
                last_log_term: self.log[-1].term if self.log else 0
            })

    // Collect pre-votes
    pre_vote_responses = wait_for_responses(ELECTION_TIMEOUT)

    if count_granted(pre_vote_responses) + 1 > len(self.all_servers) / 2:
        // Pre-vote succeeded — proceed with real election
        self.start_real_election()
    else:
        // Pre-vote failed — we're probably partitioned
        self.reset_election_timer()  // Try again later

function on_pre_vote(msg):
    // Respond based on whether we WOULD vote, but don't record anything
    would_vote = (msg.term >= self.current_term and
                  is_log_up_to_date(msg.last_log_index, msg.last_log_term) and
                  (self.current_leader == null or
                   now() - self.last_heartbeat > ELECTION_TIMEOUT))

    reply(msg.from, PreVoteResponse {
        term: msg.term,
        vote_granted: would_vote
    })
    // NOTE: we do NOT update current_term or voted_for

Pre-vote is now considered essential for production Raft implementations. etcd added it. CockroachDB uses it. TiKV uses it. Its absence from the original paper is one of those cases where the protocol works fine in the common case but has a sharp edge in a scenario that’s not uncommon in production (network partitions).

Learner Nodes (Non-Voting Members)

When adding a new server to a Raft cluster, the new server has an empty log and needs to catch up. During catchup, it can’t usefully participate in consensus — it would just slow things down. Worse, if you add it to the configuration immediately, it changes the majority requirement, potentially making the cluster unable to commit new entries.

Learner nodes (also called non-voting members or observers) are servers that receive log replication but don’t vote in elections or count toward the commit quorum. They’re used to stage new servers until they’re caught up.

function add_server_with_learner(new_server):
    // Step 1: Add as learner (non-voting)
    self.learners.add(new_server)

    // Step 2: Replicate log to learner (same as a follower)
    while not is_caught_up(new_server):
        send_append_entries(new_server)
        wait(CHECK_INTERVAL)

    // Step 3: Promote to voting member
    self.learners.remove(new_server)
    new_config = self.current_config + {new_server}
    self.replicate(ConfigChange { config: new_config })

Read-Only Operations

Linearizable reads are harder than they look. The naive approach — just read from the leader — is unsafe because the leader might be partitioned and a new leader might have been elected.

Raft offers two approaches:

ReadIndex. The leader records its commit index, confirms it’s still the leader by sending heartbeats to a majority, and then serves the read once the recorded commit index has been applied.

function linearizable_read(query):
    if self.role != Leader:
        redirect_to_leader(query)
        return

    read_index = self.commit_index

    // Confirm we're still leader
    heartbeat_acks = send_heartbeats_and_wait()
    if |heartbeat_acks| < majority:
        // We might not be leader anymore
        return Error("not leader")

    // Wait for state machine to catch up
    wait_until(self.last_applied >= read_index)

    return self.state_machine.query(query)

Lease-based reads. If the leader has received heartbeat responses from a majority within the last election timeout period, it assumes it’s still the leader and serves reads locally. This is faster but depends on bounded clock drift, which not everyone is comfortable assuming.

Real Implementations

etcd/raft

etcd’s Raft implementation (in Go) is probably the most widely used Raft library. It’s used by etcd itself, Kubernetes (via etcd), CockroachDB, and TiKV.

Key characteristics:

  • Implements the core Raft protocol faithfully.
  • Adds pre-vote, learner nodes, and leader transfer.
  • Does NOT implement the transport layer — it’s a library that produces messages, and the application is responsible for sending them. This is an excellent design decision that makes it adaptable to different network stacks.
  • Does NOT implement persistence — the application provides a storage interface. Same rationale.

CockroachDB

CockroachDB uses etcd/raft but adds significant extensions:

  • Range-level Raft. Each data range (a contiguous keyspace) is a separate Raft group. A single CockroachDB cluster might have tens of thousands of Raft groups.
  • Multi-Raft. To avoid the overhead of thousands of independent Raft groups each sending their own heartbeats, CockroachDB batches Raft messages between nodes.
  • Joint consensus. CockroachDB uses joint consensus for membership changes rather than single-server changes.
  • Epoch-based leases. Range leases are based on epochs rather than wall-clock time, avoiding clock-dependency issues.

TiKV

TiKV (the storage engine for TiDB) also uses etcd/raft with its own extensions:

  • Batching and pipelining. TiKV aggressively batches Raft messages and pipelines requests.
  • Async apply. The state machine application is asynchronous — committed entries are applied in a separate thread from the Raft protocol thread. This improves throughput but requires careful handling of read requests.
  • Multi-Raft with region-based partitioning, similar to CockroachDB.

Performance Comparison with Multi-Paxos

In steady state (stable leader, no failures):

MetricRaftMulti-Paxos
Messages per write2(n-1)2(n-1)
Round trips per write11
Fsyncs per write1 (leader) + f (followers)Same
Read latency (linearizable)1 RTT (ReadIndex) or 0 (lease)Same (lease-based)
Leader change latency~election timeout (~300ms)~Phase 1 RTT (~2ms)

The one notable difference is leader change latency. Raft’s randomized election timeout means there’s a 150-300ms delay before a new leader is elected after a failure. Multi-Paxos can elect a new leader in a single round trip (the Phase 1 Prepare/Promise), which might be only a few milliseconds in a LAN.

In practice, this difference rarely matters because leader failures are (should be) rare events. But in systems that are extremely sensitive to failover latency, Multi-Paxos has an advantage.

Why Raft Won the Mindshare War

Raft’s dominance in the consensus algorithm mindshare is not primarily a technical achievement. The protocol is good — clean, well-specified, and practical. But its success is primarily a pedagogical and community achievement.

The paper is readable. It’s 18 pages, clearly structured, with detailed examples and figures. The authors explicitly optimized for understandability and it shows.

The visualization. The Raft visualization (thesecretlivesofdata.com/raft) is one of the best algorithm visualizations ever created. It lets you interactively step through leader election, log replication, and failure scenarios. This single resource has probably done more for consensus algorithm education than any paper.

Reference implementations. The Raft paper was accompanied by reference implementations, and the clear specification encouraged many more. The Raft website lists over 60 implementations in various languages.

Timing. Raft arrived at a time when the industry desperately needed an understandable consensus algorithm. Docker and Kubernetes were emerging, etcd needed a consensus protocol, and the distributed database movement was accelerating. Raft was the right protocol at the right time.

Explicit system design. Unlike Multi-Paxos, Raft specifies a complete system: leader election, log replication, safety, membership changes, log compaction. You can implement Raft from the paper alone (with effort). You cannot implement Multi-Paxos from the papers alone (without also inventing significant parts of the system yourself).

The combination of these factors created a self-reinforcing cycle: more people understood Raft, so more people implemented it, so more production systems used it, so more blog posts were written about it, so more people learned it. Paxos, for all its theoretical depth, could not compete with this cycle.

Where Raft Falls Short

Raft is not perfect. The design choices that make it understandable also constrain it:

Strong leader. All writes go through the leader. In a geo-distributed deployment, this means all writes incur the latency to the leader’s region. Multi-decree Paxos variants like EPaxos (Chapter 13) can commit writes at any replica.

No log gaps. The contiguous log simplifies reasoning but means a slow follower blocks commit of later entries (since commit index advances sequentially). This is rarely a problem in practice but is a theoretical limitation.

Leader bottleneck. In a large cluster (5+ nodes), the leader must send AppendEntries to all followers and process all responses. The leader’s network bandwidth and CPU become the bottleneck before the followers’.

Rigid term structure. Raft’s term-based reasoning is clean but inflexible. Certain optimizations that are natural in Multi-Paxos (like out-of-order commits or flexible quorums) don’t fit naturally into Raft’s model.

These limitations are real but usually acceptable. For most systems, the benefits of understandability and implementation quality outweigh the theoretical performance advantages of more flexible protocols.

The Honest Assessment

Raft is not “Paxos for humans.” It is a well-designed consensus protocol with excellent documentation that solves the same problem as Multi-Paxos with similar performance. It makes some design choices that simplify understanding at the cost of flexibility, and it was accompanied by an unprecedented pedagogical effort that made it accessible to a broad audience.

If you are building a new system that needs consensus, Raft is almost certainly the right choice. Not because it’s the best consensus algorithm (there is no “best”), but because it has the largest community, the most reference implementations, the most operational experience, and the most educational resources. In distributed systems, being well-understood is a feature that trumps almost every theoretical advantage.

Paxos is more general. VR is more complete. EPaxos is more flexible. But Raft is the one your team can implement correctly, debug effectively, and operate confidently. In the agony of consensus algorithms, that might be the thing that matters most.

Zab: What ZooKeeper Actually Uses

Every conversation about consensus protocols eventually arrives at ZooKeeper. Not because ZooKeeper is elegant — it is not — but because it is everywhere. ZooKeeper is the coordination service that half the distributed systems in the world depend on, and it does not use Paxos. It does not use Raft. It uses Zab: ZooKeeper Atomic Broadcast.

This fact surprises many people. ZooKeeper was built at Yahoo! in the late 2000s, when Paxos was the only consensus game in town (Raft wouldn’t arrive until 2014). The ZooKeeper team had the option to implement Paxos. They chose not to. Instead, they designed their own protocol, tailored to ZooKeeper’s specific requirements, and in doing so, they made a series of tradeoffs that illuminate the gap between academic consensus protocols and the systems that actually serve production traffic.

Zab is not as clean as Raft. It is not as theoretically minimal as Paxos. But it is what actually runs behind Kafka (historically), Hadoop, HBase, Solr, and hundreds of other systems that list ZooKeeper as a dependency. When your production system goes down at 3 AM, there is a non-trivial chance that Zab is involved somewhere in the dependency chain. You should understand what it does.

Why Not Paxos?

The ZooKeeper team’s decision to build their own protocol was not born of hubris. (Well, perhaps a little hubris. All good systems work starts with a little hubris.) It was born of a specific technical requirement that Paxos, as described in the literature, does not naturally provide: FIFO client ordering with prefix agreement.

ZooKeeper’s API requires:

  1. All updates from a given client are applied in the order the client issued them. If client C sends write A, then write B, every server that applies both must apply A before B.

  2. All updates are applied in a total order that is consistent with the per-client FIFO order. There is a single global order of operations, and it respects the causal ordering within each client’s session.

  3. A client that reads after a write must see the effect of that write (or a later state). This is session consistency — not full linearizability, but stronger than eventual consistency.

These properties are called causal ordering (or more precisely, FIFO ordering with respect to client sessions), and they map naturally to ZooKeeper’s use cases: distributed locks, leader election, configuration management, and service discovery.

Multi-Paxos can provide total ordering, but it does not inherently provide FIFO client ordering. You can build FIFO ordering on top of Multi-Paxos (by tracking per-client sequence numbers and enforcing ordering constraints), but it’s additional machinery. Zab provides it natively because the protocol was designed around it.

The other reason is more pragmatic: in 2007, when ZooKeeper was being built, Multi-Paxos was poorly documented (as we discussed in Chapter 6), and the gap between the academic description and a production system was enormous. Building a protocol from scratch, tailored to their exact requirements, seemed (and probably was) less risky than trying to fill in all the blanks that Multi-Paxos left unspecified.

Zab’s System Model

Zab assumes:

  • A set of servers, one of which is the leader (primary) and the rest are followers.
  • Crash-recovery fault model: servers can crash and restart.
  • Messages can be lost, duplicated, or reordered, but not corrupted.
  • The system tolerates f failures out of 2f+1 servers.

Zab provides atomic broadcast: the ability to deliver messages to all servers in the same order, with the guarantee that if any server delivers a message, all operational servers eventually deliver it.

The Two Modes of Zab

Zab operates in two distinct modes:

  1. Recovery mode — Used when the system starts up or when the leader fails. In this mode, the servers elect a new leader and synchronize their state.

  2. Broadcast mode — Used during normal operation. The leader receives client requests, broadcasts them to followers, and commits them when a majority acknowledges.

The transition between these modes is the heart of Zab, and getting it right is where most of the protocol’s complexity lives.

Zab Identifiers: Epochs and Zxids

Zab uses a two-part transaction identifier called a zxid (ZooKeeper transaction ID). A zxid is a 64-bit number with two 32-bit components:

  • Epoch (high 32 bits): Incremented each time a new leader is elected. Analogous to Raft’s “term” or VR’s “view number.”
  • Counter (low 32 bits): Incremented for each transaction within an epoch. Reset to 0 when a new epoch starts.
struct Zxid:
    epoch: uint32    // High 32 bits
    counter: uint32  // Low 32 bits

    function compare(other: Zxid) -> int:
        if self.epoch != other.epoch:
            return self.epoch - other.epoch
        return self.counter - other.counter

    function to_int64() -> int64:
        return (self.epoch << 32) | self.counter

The epoch serves the same purpose as Raft’s term: it identifies which leader issued a transaction. The counter provides ordering within a leader’s tenure. Together, they create a total order over all transactions, with a clean boundary between different leaders’ contributions.

Phase 1: Leader Election (Discovery)

When the system starts or the current leader fails, Zab enters recovery mode, beginning with leader election. Zab’s leader election is conceptually simpler than you might expect:

  1. Each server broadcasts a vote containing its proposed leader and the zxid of its last committed transaction.
  2. Servers update their vote if they see a “better” vote (higher epoch, or same epoch with higher counter, or same zxid but higher server ID).
  3. When a server observes that a majority have voted for the same server, that server is the prospective leader.
class ZabElection:
    function elect_leader():
        // Initially vote for ourselves
        my_vote = Vote {
            proposed_leader: self.my_id,
            zxid: self.last_committed_zxid,
            election_epoch: self.election_epoch + 1
        }
        self.election_epoch += 1

        // Broadcast our vote
        broadcast(my_vote)

        received_votes = {self.my_id: my_vote}

        while true:
            msg = receive(timeout=ELECTION_TIMEOUT)

            if msg == null:
                // Timeout — re-broadcast our vote
                broadcast(my_vote)
                continue

            if msg.election_epoch < self.election_epoch:
                continue  // Stale

            if msg.election_epoch > self.election_epoch:
                // We're behind — adopt their epoch
                self.election_epoch = msg.election_epoch
                received_votes = {}
                my_vote = self.determine_better_vote(my_vote, msg)
                broadcast(my_vote)

            // Record the vote
            received_votes[msg.from] = msg

            // Update our vote if we see a better candidate
            if self.is_better_candidate(msg, my_vote):
                my_vote = Vote {
                    proposed_leader: msg.proposed_leader,
                    zxid: msg.zxid,
                    election_epoch: self.election_epoch
                }
                broadcast(my_vote)
                received_votes[self.my_id] = my_vote

            // Check if a majority has voted for the same leader
            for candidate in self.all_servers:
                votes_for = count(v for v in received_votes.values()
                                  if v.proposed_leader == candidate)
                if votes_for > len(self.all_servers) / 2:
                    // Wait a bit for any late-arriving better votes
                    while more_votes_available(SHORT_TIMEOUT):
                        msg = receive(SHORT_TIMEOUT)
                        if msg and self.is_better_candidate(msg, my_vote):
                            // Better candidate appeared — continue election
                            break
                    else:
                        // Election complete
                        if candidate == self.my_id:
                            return LEADING
                        else:
                            self.current_leader = candidate
                            return FOLLOWING

    function is_better_candidate(vote_a, vote_b) -> bool:
        // Higher epoch wins, then higher counter, then higher server ID
        if vote_a.zxid.epoch != vote_b.zxid.epoch:
            return vote_a.zxid.epoch > vote_b.zxid.epoch
        if vote_a.zxid.counter != vote_b.zxid.counter:
            return vote_a.zxid.counter > vote_b.zxid.counter
        return vote_a.proposed_leader > vote_b.proposed_leader

The “better candidate” heuristic (higher zxid, breaking ties by server ID) ensures that the elected leader has the most complete transaction history. This is analogous to Raft’s election restriction but done through the voting protocol rather than vote-granting rules.

Phase 2: Synchronization (Recovery)

After a leader is elected, it must synchronize the followers’ state before entering broadcast mode. This is the discovery and synchronization phase, and it handles the messy reality that different servers may have processed different transactions before the old leader failed.

Step 1: Follower Connects to Leader

Each follower sends the leader its last committed zxid and its epoch.

// Follower side
function connect_to_leader(leader_id):
    send(leader_id, FollowerInfo {
        last_zxid: self.last_committed_zxid,
        current_epoch: self.accepted_epoch
    })

    // Wait for leader's response
    new_epoch_msg = receive_from(leader_id, timeout=SYNC_TIMEOUT)

    if new_epoch_msg.new_epoch > self.accepted_epoch:
        self.accepted_epoch = new_epoch_msg.new_epoch
        persist(self.accepted_epoch)
        send(leader_id, AckEpoch {
            last_zxid: self.last_committed_zxid,
            current_epoch: self.accepted_epoch
        })

Step 2: Leader Establishes New Epoch

The leader collects FollowerInfo from a majority of followers and determines the new epoch number (one more than the highest epoch it has seen).

// Leader side
function synchronize_followers():
    new_epoch = self.accepted_epoch + 1

    follower_infos = collect_from_majority(FollowerInfo)

    // Find the highest epoch and zxid among followers
    for info in follower_infos:
        new_epoch = max(new_epoch, info.current_epoch + 1)

    self.accepted_epoch = new_epoch

    // Send new epoch to all connected followers
    for follower in self.connected_followers:
        send(follower, NewEpoch { new_epoch: new_epoch })

    // Wait for AckEpoch from a majority
    ack_epochs = collect_from_majority(AckEpoch)

    // Now synchronize each follower's history
    self.sync_followers(ack_epochs)

Step 3: History Synchronization

The leader must bring each follower’s transaction history in line with its own. There are several cases:

function sync_follower(follower_id, follower_last_zxid):
    if follower_last_zxid == self.last_committed_zxid:
        // Follower is up to date — just send DIFF (empty)
        send(follower_id, Sync { type: DIFF, transactions: [] })

    elif follower_last_zxid < self.last_committed_zxid and
         self.has_transactions_since(follower_last_zxid):
        // Follower is behind but we have the needed transactions
        txns = self.get_transactions_since(follower_last_zxid)
        send(follower_id, Sync { type: DIFF, transactions: txns })

    elif follower_last_zxid > self.last_committed_zxid:
        // Follower has transactions we don't!
        // These must be from a failed leader — truncate them
        send(follower_id, Sync {
            type: TRUNC,
            truncate_to: self.last_committed_zxid
        })

    elif follower_last_zxid.epoch < self.last_committed_zxid.epoch:
        // Follower is from a previous epoch — might need TRUNC + DIFF
        // First truncate any uncommitted transactions from old epoch
        // Then send missing committed transactions
        send(follower_id, Sync {
            type: TRUNC_AND_DIFF,
            truncate_to: last_common_zxid,
            transactions: self.get_transactions_since(last_common_zxid)
        })

    else:
        // Follower is too far behind — send full snapshot
        send(follower_id, Sync {
            type: SNAP,
            snapshot: self.take_snapshot()
        })

This synchronization handles the crucial case where a follower has transactions that the new leader doesn’t — transactions that were proposed by the old leader but never committed. These transactions must be truncated (rolled back) because they are not part of the committed history. This is analogous to Raft truncating a follower’s log when it conflicts with the leader’s.

Step 4: Enter Broadcast Mode

Once the leader has synchronized a majority of followers, the system enters broadcast mode and can start processing client requests.

function enter_broadcast_mode():
    // Wait for sync acknowledgments from a majority
    sync_acks = collect_from_majority(SyncAck)

    // Reset the counter for the new epoch
    self.next_zxid = Zxid {
        epoch: self.accepted_epoch,
        counter: self.last_committed_zxid.counter + 1
    }

    self.mode = BROADCAST

    // Send UPTODATE to all synced followers
    for follower in self.synced_followers:
        send(follower, UpToDate {})

    // Now we can process client requests

Broadcast Mode: Normal Operation

Once in broadcast mode, Zab operates as a simple two-phase protocol that looks very much like two-phase commit (but with majority-based quorums, crucially):

Client          Leader              Follower 1        Follower 2
  |                |                    |                 |
  |-- Request ---->|                    |                 |
  |                |                    |                 |
  |                |-- PROPOSAL(zxid, txn) -->            |
  |                |-- PROPOSAL(zxid, txn) ------------>  |
  |                |                    |                 |
  |                |<-- ACK(zxid) ------|                 |
  |                |<-- ACK(zxid) ----------------------|
  |                |                    |                 |
  |                |  (Majority ACKed — can commit)      |
  |                |                    |                 |
  |                |-- COMMIT(zxid) --->|                 |
  |                |-- COMMIT(zxid) ----------------->   |
  |                |                    |                 |
  |<-- Response ---|                    |                 |
class ZabLeader:
    function handle_client_write(request):
        // Assign a zxid
        zxid = self.next_zxid
        self.next_zxid.counter += 1

        txn = Transaction {
            zxid: zxid,
            data: request.data,
            client_id: request.client_id,
            session_id: request.session_id
        }

        // Append to our own transaction log
        self.txn_log.append(txn)
        persist(txn)

        // Send PROPOSAL to all followers
        for follower in self.active_followers:
            send(follower, Proposal {
                zxid: zxid,
                transaction: txn
            })

        // Track acknowledgments
        self.pending_proposals[zxid] = PendingProposal {
            transaction: txn,
            acks: {self.my_id},  // Leader implicitly ACKs
            request: request
        }

    function on_ack(zxid, from_follower):
        if zxid not in self.pending_proposals:
            return  // Already committed or unknown

        proposal = self.pending_proposals[zxid]
        proposal.acks.add(from_follower)

        if |proposal.acks| > len(self.all_servers) / 2:
            self.commit(zxid)

    function commit(zxid):
        // IMPORTANT: Zab commits in FIFO order
        // We must commit all preceding transactions first
        while self.next_commit_zxid <= zxid:
            txn = self.pending_proposals[self.next_commit_zxid].transaction

            // Send COMMIT to all followers
            for follower in self.active_followers:
                send(follower, Commit { zxid: self.next_commit_zxid })

            // Apply to our state machine
            result = self.state_machine.apply(txn)

            // Respond to client
            respond_to_client(self.pending_proposals[self.next_commit_zxid].request, result)

            delete self.pending_proposals[self.next_commit_zxid]
            self.next_commit_zxid.counter += 1

Follower Side

class ZabFollower:
    function on_proposal(msg):
        // Verify this is from the current leader in the current epoch
        if msg.zxid.epoch != self.accepted_epoch:
            return  // Wrong epoch

        // Append to transaction log
        self.txn_log.append(msg.transaction)
        persist(msg.transaction)

        // Send ACK
        send(self.leader, Ack { zxid: msg.zxid })

    function on_commit(msg):
        // Apply transaction to state machine
        // Commits MUST be processed in zxid order
        if msg.zxid != self.next_expected_commit:
            // Queue for later — we might have received commits out of order
            self.commit_queue.add(msg.zxid)
            return

        self.apply_commit(msg.zxid)
        self.next_expected_commit.counter += 1

        // Apply any queued commits
        while self.next_expected_commit in self.commit_queue:
            self.commit_queue.remove(self.next_expected_commit)
            self.apply_commit(self.next_expected_commit)
            self.next_expected_commit.counter += 1

    function apply_commit(zxid):
        txn = self.txn_log.get(zxid)
        self.state_machine.apply(txn)
        self.last_committed_zxid = zxid

FIFO Ordering: Why It Matters

The FIFO ordering guarantee is what makes Zab distinct from generic consensus protocols. Zab guarantees:

  1. If a leader broadcasts proposal a before proposal b, every server that delivers b must deliver a first. This is the FIFO broadcast property.

  2. If a leader in epoch e commits transaction a, and a later leader in epoch e’ commits transaction b, then a is delivered before b. This is the causal ordering across epochs.

  3. A new leader must commit all transactions that were committed by previous leaders before it can broadcast new transactions. This is the prefix property.

Together, these guarantee that the transaction history forms a consistent prefix — every server’s history is a prefix of the same global sequence. There are no gaps, no reorderings, no orphaned transactions.

This matters for ZooKeeper because clients depend on ordering. Consider this sequence:

Client A: create /leader with value "node-1"
Client A: create /leader/config with value "settings"

If the second write could be applied without the first (due to a leader change between them), a reader might see /leader/config exist without /leader, which would be semantically nonsensical.

Zab’s FIFO guarantee ensures this cannot happen. All transactions from a given client session are applied in order, and the total order respects this per-session ordering.

Zab vs. Raft: A Detailed Comparison

Zab and Raft are both leader-based consensus protocols with similar high-level architectures. The differences are in the details:

AspectZabRaft
Transaction IDsEpoch + counter (zxid)Term + log index
Leader electionVote-based with best-zxid heuristicRequestVote with log up-to-date check
SynchronizationExplicit DIFF/TRUNC/SNAP phasesAppendEntries consistency check
Commit mechanismExplicit COMMIT messagePiggybacked leader_commit in AppendEntries
Log structureTransaction log (no gaps)Contiguous log (no gaps)
FIFO guaranteeBuilt into protocolAchieved via log ordering
Read semanticsSession consistency by defaultLinearizable with ReadIndex
RecoveryEpoch-based with sync phaseTerm-based with log matching

Key Difference 1: Explicit COMMIT Messages

In Raft, commit information is piggybacked on AppendEntries messages — followers learn about commits from the leader_commit field in the next AppendEntries. This is efficient (no extra messages) but means followers might not learn about a commit immediately.

In Zab, the leader sends an explicit COMMIT message after receiving a majority of ACKs. This is an extra message but means followers learn about commits sooner, which is important for ZooKeeper’s read consistency model (followers serve reads, and they need to know what’s committed).

Key Difference 2: Read Model

This is arguably the biggest practical difference. In Raft, reads from followers are stale by default — only the leader can serve linearizable reads (and even that requires special handling). Raft is designed for linearizable systems.

In ZooKeeper/Zab, followers serve reads directly from their local state. This provides session consistency (a client’s reads reflect its own writes) but not linearizability across clients. A client connected to a slow follower might see stale data.

ZooKeeper provides a sync operation that forces the follower to catch up with the leader before serving the next read. This bridges the gap to linearizability when needed, but it’s opt-in.

// ZooKeeper follower handling a read
function handle_read(session, path):
    if session.pending_sync:
        // Wait for sync to complete
        wait_until(self.last_committed_zxid >= session.sync_zxid)
        session.pending_sync = false

    return self.state_machine.get(path)

// ZooKeeper sync operation
function handle_sync(session):
    // Ask leader for current committed zxid
    leader_zxid = query_leader_commit()
    session.sync_zxid = leader_zxid
    session.pending_sync = true

This read model is a pragmatic choice. In many ZooKeeper use cases (configuration management, service discovery), slightly stale reads are acceptable and the throughput benefit of reading from any server is significant. When freshness matters (distributed locks), the client uses sync.

Key Difference 3: Synchronization Complexity

Raft’s log synchronization is elegant: the leader sends AppendEntries, the follower rejects if the log doesn’t match, the leader backs up and retries. It’s simple and handles all cases uniformly.

Zab’s synchronization has four distinct modes (DIFF, TRUNC, SNAP, TRUNC_AND_DIFF), each handling a specific case. This is more complex to implement but potentially more efficient — sending a DIFF of the last few transactions is cheaper than resending the entire log from the divergence point, and the explicit TRUNC mode makes it clear what’s happening when uncommitted transactions need to be rolled back.

The Relationship Between Zab and Virtual Synchrony

Zab is sometimes described as being related to virtual synchrony, and the connection is worth understanding.

Virtual synchrony, developed by Ken Birman in the 1980s, provides ordered message delivery within “groups” of processes. When a group membership changes (a process joins or leaves), all surviving members receive the same set of messages from the old group before transitioning to the new group. This is called a “view change” — the same term used in VR (Chapter 7).

Zab’s epoch-based approach echoes virtual synchrony’s view-based approach:

  • An epoch in Zab corresponds to a view in virtual synchrony.
  • The synchronization phase (where the new leader ensures all followers have the same history) corresponds to the “flush” in virtual synchrony (where all pending messages are delivered before the view change completes).
  • The guarantee that transactions from previous epochs are committed before new transactions are accepted mirrors virtual synchrony’s guarantee that all messages from the old view are delivered before the new view begins.

The key difference is that Zab uses a centralized leader, while virtual synchrony is typically implemented with a distributed protocol (like total order broadcast based on sequencer or token-based approaches). Zab’s centralized leader is simpler to implement and reason about, at the cost of making the leader a bottleneck.

Practical Experience: What ZooKeeper Operators Actually Deal With

If you operate ZooKeeper in production, the consensus protocol is rarely your primary concern. The operational challenges are dominated by issues that sit above or beside Zab:

Session Management

ZooKeeper clients maintain a session with the server, and sessions have a timeout. If the session expires (because the client can’t reach any server for the timeout period), all ephemeral nodes created by that client are deleted. This is how distributed locks work in ZooKeeper — the lock is an ephemeral node, and if the lock holder dies, the node is automatically deleted.

But session management is fragile:

  • GC pauses on the client can cause session expiration even when the client is healthy. A 10-second GC pause with a 6-second session timeout means your locks just got released.
  • Network partitions between the client and the ZooKeeper cluster cause session expiration. The client can’t just reconnect and resume — it must re-establish all its ephemeral nodes.
  • ZooKeeper server failures during a client’s session require the client to reconnect to a different server. The session is maintained (because session state is replicated), but there’s a window where the client might not know what state its session is in.
// The ZooKeeper session state machine (simplified)
enum SessionState:
    CONNECTING      // Trying to establish connection
    CONNECTED       // Normal operation
    RECONNECTING    // Lost connection, trying to reconnect
    EXPIRED         // Session timed out — all ephemeral nodes deleted
    CLOSED          // Client explicitly closed

// What operators deal with at 3 AM:
// 1. Client GC pause -> session expired -> ephemeral nodes deleted
//    -> distributed lock released -> two processes think they hold the lock
//    -> data corruption

Ephemeral Nodes and the Herd Effect

ZooKeeper’s watch mechanism notifies clients when a node changes. A common anti-pattern is to have all clients watch the same node (e.g., the lock node). When the lock is released, ALL clients are notified simultaneously, and they all try to acquire the lock at the same time. This is the “herd effect” or “thundering herd,” and it can overwhelm the ZooKeeper cluster.

The solution is the “sequential node” pattern: clients create sequential ephemeral nodes and watch only the node immediately before theirs. When the lock holder releases, only the next client in line is notified.

Disk Latency

Zab requires fsync on every write (the leader fsyncs the proposal, followers fsync the ACK). If the disk is slow (a saturated SSD, a virtual disk on a noisy neighbor cloud instance), write latency increases, heartbeats are delayed, and the leader might be declared dead — causing an unnecessary leader election.

ZooKeeper operators learn quickly that dedicated, fast disks for the ZooKeeper transaction log are not optional. A separate disk from the snapshots. A separate disk from the operating system. SSDs, not spinning disks. Seriously.

The Four-Letter Words

ZooKeeper exposes monitoring via “four-letter word” commands (literally: you telnet to the ZooKeeper port and type four letters like stat, ruok, mntr). These are crude but effective monitoring tools:

  • ruok — “Are you OK?” Responds with imok. If it doesn’t, you have a problem.
  • stat — Shows connected clients and outstanding requests.
  • mntr — Shows detailed metrics including proposal latency, sync latency, and outstanding requests.

Experienced operators watch the avg_proposal_latency and avg_sync_latency metrics. When these spike, something is wrong with the disk, the network, or both. The consensus protocol itself is almost never the problem — it’s the infrastructure underneath it.

Zab’s Strengths and Weaknesses

Strengths

Native FIFO ordering. Zab’s ordering guarantees are exactly what ZooKeeper needs. Building the same guarantees on top of Raft or Paxos is possible but requires additional infrastructure.

Explicit synchronization. The DIFF/TRUNC/SNAP synchronization modes handle recovery efficiently. A follower that’s only a few transactions behind gets a quick DIFF rather than a full log replay.

Battle-tested. Zab has been running in production at massive scale for over fifteen years. The bugs have been found and fixed. The edge cases have been discovered and handled. This is not a theoretical protocol — it’s a production-hardened one.

Weaknesses

Tight coupling to ZooKeeper. Zab was designed for ZooKeeper and it shows. The protocol is not easily extracted and used as a general-purpose consensus library. If you want a reusable consensus implementation, etcd/raft or a Paxos library is a better choice.

Complex recovery. The four synchronization modes (DIFF, TRUNC, SNAP, TRUNC_AND_DIFF) are more complex than Raft’s uniform AppendEntries-based catch-up. More modes means more code paths means more potential for bugs.

Leader bottleneck. Like all leader-based protocols, the leader is a throughput bottleneck. All writes must go through the leader, and the leader must communicate with all followers for every write. ZooKeeper mitigates this by serving reads from followers, but writes are still centralized.

Documentation. The original Zab paper is reasonable, but the actual ZooKeeper implementation has diverged from the paper in various ways over the years. Understanding what ZooKeeper actually does requires reading the source code, not just the paper. This is a common problem in long-lived systems, but it’s particularly acute for Zab because there’s no “Zab Made Simple” equivalent — no accessible secondary description to complement the original paper.

Snapshotting in ZooKeeper

ZooKeeper’s snapshotting mechanism is worth discussing because it takes an unusual approach: fuzzy snapshots.

Most systems require a consistent snapshot — a snapshot that represents the state at a specific point in time. Taking a consistent snapshot of a large data structure while continuing to process requests is expensive (it requires either copy-on-write or a pause).

ZooKeeper takes a different approach: it takes a “fuzzy” snapshot by iterating over the in-memory data tree and writing it to disk without acquiring a global lock. This means the snapshot might include some transactions that were applied after the snapshot started but not others — it’s an inconsistent view of the state.

This is fine because ZooKeeper replays the transaction log from the snapshot point forward when recovering. The fuzzy snapshot gives a (potentially inconsistent) starting state, and the transaction log replay brings it to a consistent state. As long as the transactions in the log are idempotent (which ZooKeeper’s are), replaying them on top of a fuzzy snapshot produces the correct result.

function take_fuzzy_snapshot():
    // Record the current last-committed zxid
    snapshot_zxid = self.last_committed_zxid

    // Iterate over the data tree WITHOUT a global lock
    // This means the snapshot might include transactions
    // committed after snapshot_zxid — that's OK
    snapshot_data = {}
    for path, node in self.data_tree.iterate():
        snapshot_data[path] = serialize(node)

    write_to_disk(snapshot_data, snapshot_zxid)

    // On recovery:
    // 1. Load fuzzy snapshot
    // 2. Replay transaction log from snapshot_zxid forward
    // 3. Result is consistent state

This is a clever optimization. Consistent snapshots require either a pause (bad for latency) or copy-on-write (complex and memory-expensive). Fuzzy snapshots avoid both by relying on the idempotency of the transaction log replay. It’s the kind of practical engineering trick that you develop only by building and operating real systems.

The ZooKeeper Ecosystem

Zab’s influence extends beyond ZooKeeper itself. Systems that depend on ZooKeeper (and thus transitively on Zab) include:

  • Apache Kafka (historically, for broker coordination and topic metadata — Kafka has since introduced KRaft to remove this dependency)
  • Apache HBase (for region server coordination)
  • Apache Solr (for cluster state management)
  • Apache Hadoop (HDFS NameNode high availability)
  • Kubernetes (via etcd, though etcd uses Raft — some older Kubernetes-adjacent systems used ZooKeeper)

The trend in recent years has been away from ZooKeeper as a dependency. Kafka’s KRaft mode eliminates the ZooKeeper dependency entirely. New systems tend to embed Raft rather than depend on an external ZooKeeper cluster. But the installed base is enormous, and ZooKeeper (and thus Zab) will be running in production data centers for many years to come.

Summary

Zab is a purpose-built consensus protocol for ZooKeeper. It provides FIFO-ordered atomic broadcast with epoch-based recovery, designed specifically for the coordination primitives that ZooKeeper exposes (locks, configuration, service discovery). It is not the most elegant consensus protocol, nor the most general, but it is among the most battle-tested.

The decision to build a custom protocol rather than using Paxos reflects a pragmatic reality: generic consensus protocols are building blocks, not systems. The gap between the protocol and the system must be filled, and sometimes it’s easier to design a protocol that fills the gap from the start than to retrofit a generic protocol to your specific requirements.

Whether this was the right decision is debatable — the ZooKeeper team has certainly spent more time maintaining and debugging Zab than they would have spent adapting Multi-Paxos. But the result works, it has worked for fifteen years, and it has handled the coordination needs of some of the largest distributed systems ever built. In the agony of consensus algorithms, that is a success by any measure.

PBFT: When You Can’t Trust Anyone

In 1999, Miguel Castro and Barbara Liskov published a paper that changed the landscape of Byzantine fault tolerance. Before PBFT, Byzantine agreement protocols were largely academic curiosities — theoretically interesting, practically useless. The existing protocols had exponential message complexity or required synchronous networks, which is another way of saying they required networks that don’t exist. Castro and Liskov’s contribution was making BFT practical. The word “practical” is doing an enormous amount of heavy lifting in that title, as we’re about to discover.

PBFT can tolerate up to f Byzantine (arbitrarily faulty) nodes out of a total of n = 3f + 1 replicas. It guarantees safety in an asynchronous network and liveness when the network is eventually synchronous. The protocol operates in views, each with a designated primary (leader), and uses a three-phase commit protocol for normal operation. If the primary misbehaves, a view-change protocol replaces it.

This all sounds straightforward. It isn’t.

The System Model

Before we dive into the protocol, let’s be precise about what PBFT assumes:

  • n = 3f + 1 replicas. To tolerate f Byzantine faults, you need at least 3f + 1 total nodes. This comes from the fundamental impossibility result: with fewer nodes, a Byzantine adversary can create ambiguity that prevents correct nodes from distinguishing valid states. If you want to tolerate 1 faulty node, you need 4 replicas. Tolerate 2? Seven replicas. The overhead adds up.

  • Asynchronous network with eventual synchrony. PBFT does not assume messages arrive within bounded time for safety. A Byzantine node can delay messages, reorder them, duplicate them — the protocol remains safe. However, liveness requires eventual synchrony: messages must eventually get through within some (unknown) bound. Without this, FLP impossibility kicks in and nobody makes progress.

  • Cryptographic assumptions. Every message is signed. PBFT uses public-key cryptography to authenticate messages and MACs (message authentication codes) for efficiency. The adversary cannot forge signatures or break the cryptographic primitives. This assumption is doing real work — the entire protocol falls apart without it.

  • Independent failures. The Byzantine nodes are assumed to fail independently, not in a coordinated attack by a single adversary controlling multiple nodes. In practice, if someone compromises your deployment infrastructure and controls multiple replicas simultaneously, the 3f + 1 bound may not hold. This is the kind of thing the paper mentions in passing and practitioners lose sleep over.

The Three-Phase Protocol: Normal Case Operation

PBFT’s normal case protocol has three phases: pre-prepare, prepare, and commit. The client sends a request to the primary, and the primary initiates the three-phase protocol to replicate the request across all replicas.

Message Flow: Normal Case

Here’s the complete message flow for a single client request with n = 4 (f = 1):

Client    Primary(0)  Replica 1   Replica 2   Replica 3
  |           |           |           |           |
  |--REQUEST->|           |           |           |
  |           |           |           |           |
  |           |--PRE-PREPARE-------->|           |
  |           |--PRE-PREPARE---------------->|   |
  |           |--PRE-PREPARE---------------------->|
  |           |           |           |           |
  |           |<--PREPARE-|           |           |
  |           |<--PREPARE------------|           |
  |           |<--PREPARE--------------------->  |
  |           |           |<--PREPARE(0)-------->|
  |           |           |<--PREPARE(2)-------->|
  |           |           |           |<-PREPARE(0)
  |           |           |           |<-PREPARE(1)
  |           |           |           |           |
  |           | (each replica collects 2f PREPARE |
  |           |  messages matching PRE-PREPARE)   |
  |           |           |           |           |
  |           |--COMMIT-->|           |           |
  |           |--COMMIT------------>|           |
  |           |--COMMIT--------------------->|   |
  |           |<--COMMIT--|           |           |
  |           |<--COMMIT------------|           |
  |           |<--COMMIT--------------------->  |
  |           |           |<--COMMIT(0)--------->|
  |           |           |<--COMMIT(2)--------->|
  |           |           |           |<-COMMIT(0)
  |           |           |           |<-COMMIT(1)
  |           |           |           |           |
  |<--REPLY---|           |           |           |
  |<--REPLY---------------|           |           |
  |<--REPLY--------------------------|           |
  |<--REPLY----------------------------------->  |

The client waits for f + 1 matching replies (2 in this case). Since at most f replicas can be Byzantine, f + 1 matching replies guarantee at least one came from a correct replica. The client doesn’t need to know which replicas are faulty.

Notice the all-to-all communication in the prepare and commit phases. Every replica sends to every other replica. This is the O(n^2) per phase that will haunt PBFT’s scalability story.

Phase 1: Pre-Prepare

The primary assigns a sequence number to the client request and broadcasts a PRE-PREPARE message to all backups.

// Primary replica p in view v
function on_client_request(request):
    if i_am_not_primary(v):
        forward request to primary
        return

    seq_num = next_sequence_number()

    // Check watermarks
    if seq_num < low_watermark or seq_num > high_watermark:
        drop request  // Outside our window
        return

    msg = PrePrepare{
        view:     v,
        seq:      seq_num,
        digest:   hash(request),
        request:  request
    }

    sign(msg)
    log.append(msg)
    broadcast_to_all_backups(msg)
// Backup replica i receives PRE-PREPARE from primary
function on_pre_prepare(msg):
    // Validate the message
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, primary_of(msg.view)):
        discard(msg)
        return

    if msg.seq < low_watermark or msg.seq > high_watermark:
        discard(msg)
        return

    // Critical check: have we already accepted a different
    // PRE-PREPARE for this view and sequence number?
    if exists_pre_prepare(msg.view, msg.seq) and
       existing.digest != msg.digest:
        discard(msg)  // Primary is equivocating!
        return

    // Accept the PRE-PREPARE
    log.append(msg)

    // Enter PREPARE phase
    prepare_msg = Prepare{
        view:   msg.view,
        seq:    msg.seq,
        digest: msg.digest,
        sender: my_id
    }
    sign(prepare_msg)
    log.append(prepare_msg)
    broadcast_to_all_replicas(prepare_msg)

A backup accepts the PRE-PREPARE if:

  1. It’s in the correct view.
  2. The signature is valid.
  3. The sequence number is within the watermark window.
  4. It hasn’t accepted a different PRE-PREPARE for the same view and sequence number. This last check prevents a Byzantine primary from assigning the same sequence number to different requests for different backups — a form of equivocation.

The primary does not send a PREPARE message. It already demonstrated its intent via the PRE-PREPARE. This is a minor detail the paper handles carefully and many implementations get wrong initially.

Phase 2: Prepare

Each backup that accepts the PRE-PREPARE multicasts a PREPARE message to all replicas. A replica collects PREPARE messages until it has a prepared certificate: the PRE-PREPARE plus 2f matching PREPARE messages from different replicas (including itself, excluding the primary).

function on_prepare(msg):
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, msg.sender):
        discard(msg)
        return

    if msg.seq < low_watermark or msg.seq > high_watermark:
        discard(msg)
        return

    log.append(msg)
    check_prepared(msg.view, msg.seq, msg.digest)

function check_prepared(view, seq, digest):
    // Do we have a matching PRE-PREPARE?
    if not has_pre_prepare(view, seq, digest):
        return

    // Count matching PREPAREs from distinct replicas
    prepare_count = count_matching_prepares(view, seq, digest)

    if prepare_count >= 2 * f:
        // We are PREPARED for this request
        mark_as_prepared(view, seq, digest)

        // Enter COMMIT phase
        commit_msg = Commit{
            view:   view,
            seq:    seq,
            digest: digest,
            sender: my_id
        }
        sign(commit_msg)
        log.append(commit_msg)
        broadcast_to_all_replicas(commit_msg)

Why 2f? Because out of n = 3f + 1 replicas, at most f are Byzantine. The primary sent the PRE-PREPARE (1 replica). We need 2f PREPAREs from distinct backups. Since at most f of those could be Byzantine, at least f + 1 correct replicas have prepared. This is enough to guarantee that no two different requests can both become prepared at the same sequence number in the same view. This is the quorum intersection argument: any two sets of 2f + 1 replicas out of 3f + 1 total must overlap in at least f + 1 replicas, and at least one of those must be correct.

The prepared predicate is the critical invariant: if a correct replica is prepared for request m at sequence number n in view v, then no correct replica can be prepared for a different request m’ at the same sequence number in the same view. This is what prevents the Byzantine primary from causing inconsistency.

Phase 3: Commit

Once prepared, a replica broadcasts a COMMIT message. A replica collects COMMIT messages until it has a committed certificate: 2f + 1 matching COMMIT messages from different replicas (including itself).

function on_commit(msg):
    if msg.view != current_view:
        discard(msg)
        return

    if not verify_signature(msg, msg.sender):
        discard(msg)
        return

    log.append(msg)
    check_committed(msg.view, msg.seq, msg.digest)

function check_committed(view, seq, digest):
    // Do we have a prepared certificate?
    if not is_prepared(view, seq, digest):
        return

    // Count matching COMMITs from distinct replicas
    commit_count = count_matching_commits(view, seq, digest)

    if commit_count >= 2 * f + 1:
        // We are COMMITTED-LOCAL for this request
        mark_as_committed(view, seq, digest)

        // Execute requests in sequence number order
        execute_pending_requests()

function execute_pending_requests():
    while has_committed(last_executed + 1):
        last_executed += 1
        request = get_request(last_executed)
        result = execute(request)

        // Send reply to client
        reply = Reply{
            view:      current_view,
            timestamp: request.timestamp,
            client:    request.client_id,
            replica:   my_id,
            result:    result
        }
        sign(reply)
        send(request.client_id, reply)

The commit phase is necessary because the prepared predicate alone isn’t sufficient across view changes. A replica might be prepared, but if the view changes before it commits, other correct replicas might not know about it. The commit phase ensures that enough correct replicas know the request is prepared, so the information survives a view change. Without the commit phase, a view change could cause a prepared request to be “forgotten,” violating safety.

This is one of those subtleties that looks like over-engineering until you try removing it and watch your protocol break in testing.

Why Three Phases? Can’t We Do Two?

A natural question: Raft and Paxos work with essentially two phases (propose and accept). Why does PBFT need three?

The answer lies in the difference between crash fault tolerance and Byzantine fault tolerance. In CFT protocols, a node either works correctly or crashes — it never lies. So when a leader says “commit this entry,” replicas can trust it. In BFT, the leader might be lying. The three-phase structure ensures:

  1. Pre-prepare: The primary proposes an ordering (and might be lying about it).
  2. Prepare: Replicas verify they all received the same proposal. This catches equivocation by the primary — if the primary sent different proposals to different replicas, they’ll discover the inconsistency. The prepared certificate proves that 2f + 1 replicas agree on what the primary proposed.
  3. Commit: Replicas verify that enough of them are prepared. This ensures the decision survives view changes even if the primary is replaced.

You can think of it this way: in CFT, you trust the leader and just need to ensure durability. In BFT, you first need to establish what the leader actually said (prepare), and then ensure enough replicas know what the leader actually said (commit).

View Changes: The Hard Part

The view-change protocol is, without exaggeration, the most complex part of PBFT. It handles leader replacement when the primary is suspected of being faulty. If you’ve implemented Raft, think of view changes as leader election, except the leader might be actively trying to sabotage the process.

A replica triggers a view change when:

  • It suspects the primary is faulty (e.g., timeout on expected PRE-PREPARE).
  • It receives f + 1 VIEW-CHANGE messages for a higher view (indicating other replicas also suspect the primary).

View Change Protocol

// Replica i suspects the primary is faulty
function start_view_change(new_view):
    if new_view <= current_view:
        return  // Don't go backwards

    // Stop accepting messages for current view
    stop_timers()

    // Construct the VIEW-CHANGE message
    // P is the set of prepared certificates for sequence numbers
    // between low and high watermarks
    P = collect_prepared_certificates()

    msg = ViewChange{
        new_view: new_view,
        last_stable_checkpoint: last_checkpoint_seq,
        checkpoint_proof: checkpoint_certificate,
        prepared_set: P,     // Set of (seq, digest, prepare_certificate)
        sender: my_id
    }
    sign(msg)
    send_to_new_primary(msg)
    // Also keep a copy for ourselves
    log_view_change(msg)

The new primary (determined by new_view mod n) collects VIEW-CHANGE messages:

// New primary for view v collects VIEW-CHANGE messages
function on_view_change(msg):
    if msg.new_view != pending_view:
        return

    if not verify_signature(msg, msg.sender):
        return

    if not verify_checkpoint_proof(msg.checkpoint_proof):
        return

    if not verify_prepared_certificates(msg.prepared_set):
        return

    view_change_messages[msg.sender] = msg

    if count(view_change_messages) >= 2 * f + 1:
        construct_new_view(msg.new_view)

function construct_new_view(v):
    // Determine the starting sequence number:
    // the highest stable checkpoint across all VIEW-CHANGE messages
    min_s = max(vc.last_stable_checkpoint
                for vc in view_change_messages.values())

    // Determine the highest sequence number mentioned in any
    // prepared certificate across all VIEW-CHANGE messages
    max_s = max(seq for vc in view_change_messages.values()
                    for (seq, _, _) in vc.prepared_set)

    // For each sequence number from min_s+1 to max_s,
    // determine what request should be assigned
    O = {}  // The set of PRE-PREPARE messages for the new view
    for seq in range(min_s + 1, max_s + 1):
        // Find if any VIEW-CHANGE message has a prepared certificate
        // for this sequence number
        prepared = find_highest_view_prepared(seq, view_change_messages)

        if prepared != null:
            // Re-propose the request that was prepared
            O[seq] = PrePrepare{
                view: v, seq: seq, digest: prepared.digest
            }
        else:
            // No replica had this prepared — assign a null request
            O[seq] = PrePrepare{
                view: v, seq: seq, digest: NULL_DIGEST
            }

    // Broadcast NEW-VIEW message
    new_view_msg = NewView{
        view: v,
        view_changes: view_change_messages,  // The 2f+1 VIEW-CHANGE msgs
        pre_prepares: O                      // The re-proposed requests
    }
    sign(new_view_msg)
    broadcast_to_all(new_view_msg)

    // Enter the new view
    current_view = v
    // Start processing the pre-prepares in O
    for pp in O.values():
        process_as_pre_prepare(pp)
// Backup receives NEW-VIEW message
function on_new_view(msg):
    if msg.view <= current_view:
        return

    // Verify the NEW-VIEW message
    if not verify_signature(msg, primary_of(msg.view)):
        return

    // Verify that the VIEW-CHANGE messages are valid
    if count(msg.view_changes) < 2 * f + 1:
        return

    for vc in msg.view_changes.values():
        if not verify_view_change(vc):
            return

    // Independently recompute what O should be
    // using the same deterministic algorithm
    expected_O = recompute_pre_prepares(msg.view, msg.view_changes)

    if expected_O != msg.pre_prepares:
        // New primary is lying about what should be re-proposed!
        return  // Don't enter new view; may trigger another view change

    // Accept the new view
    current_view = msg.view
    for pp in msg.pre_prepares.values():
        process_as_pre_prepare(pp)

Message Flow: View Change

Here’s the view change flow when replica 0 (primary of view 0) is suspected faulty. Replica 1 becomes the new primary for view 1:

Replica 0     Replica 1     Replica 2     Replica 3
(old primary)  (new primary)
   |              |              |              |
   | (timeout)    | (timeout)    | (timeout)    |
   |              |              |              |
   |              |<-VIEW-CHANGE-|              |
   |              |<-VIEW-CHANGE---------------|
   |              |  (also sends own VC)        |
   |              |              |              |
   |              | (has 2f+1 = 3 VIEW-CHANGE   |
   |              |  messages including own)     |
   |              |              |              |
   |              |--NEW-VIEW--->|              |
   |              |--NEW-VIEW------------------>|
   |              |--NEW-VIEW-->                |
   |              |              |              |
   |              | (replicas verify NEW-VIEW,  |
   |              |  recompute O, enter view 1) |
   |              |              |              |
   |              | (normal protocol resumes    |
   |              |  in view 1)                 |

Why View Changes Are Nightmarishly Complex

Let me enumerate the things that make view changes the hardest part of PBFT:

  1. State transfer. The new primary needs to reconstruct the state from VIEW-CHANGE messages. Each VIEW-CHANGE carries prepared certificates, checkpoint proofs, and sequence number information. Assembling this into a consistent view is non-trivial, especially when some of those certificates may have been constructed by Byzantine nodes.

  2. The O set computation. The new primary must correctly determine, for every sequence number in the window, whether a request was prepared in a previous view and re-propose it. Get this wrong and you violate safety. The paper describes this as a deterministic computation, which it is, but implementing it involves iterating over potentially thousands of prepared certificates across multiple VIEW-CHANGE messages and handling edge cases around null requests, gaps, and partially-prepared requests.

  3. Verification. Backups must independently verify the NEW-VIEW message by recomputing the O set themselves. This means every backup performs the same complex computation and checks it against what the new primary claims. A Byzantine new primary could try to subtly alter the O set to drop a committed request.

  4. Concurrent view changes. What happens if the view change to view 2 fails because the new primary for view 2 is also Byzantine? You need another view change to view 3. Multiple concurrent view changes can interact in subtle ways, especially if messages from different views are in flight simultaneously.

  5. Liveness during view changes. While a view change is in progress, the system makes no progress on client requests. Byzantine nodes can trigger unnecessary view changes to degrade performance (a denial-of-service vector). The paper addresses this with exponentially increasing timeouts, but tuning these timeouts in practice is an art form.

I’ve seen production PBFT implementations where the normal-case protocol worked fine for months, and the view-change code had critical bugs that were never triggered because the primary never failed. Then the primary failed, and the system didn’t recover. The view-change protocol accounts for maybe 20% of the paper’s text but 80% of the implementation complexity.

Watermarks and Garbage Collection

PBFT uses a sliding window mechanism controlled by two watermarks:

  • Low watermark (h): The sequence number of the last stable checkpoint.
  • High watermark (H): h + k, where k is a configurable window size (the paper suggests k = log2(n) * constant).
// Checkpoint protocol
function maybe_checkpoint(seq):
    if seq mod CHECKPOINT_INTERVAL != 0:
        return

    state_digest = hash(application_state)

    msg = Checkpoint{
        seq:    seq,
        digest: state_digest,
        sender: my_id
    }
    sign(msg)
    broadcast_to_all(msg)

function on_checkpoint(msg):
    checkpoint_messages[msg.seq][msg.sender] = msg

    // Do we have 2f+1 matching checkpoints?
    matching = count(cp for cp in checkpoint_messages[msg.seq].values()
                     if cp.digest == msg.digest)

    if matching >= 2 * f + 1:
        // Stable checkpoint!
        stable_checkpoint = msg.seq
        low_watermark = msg.seq
        high_watermark = msg.seq + WINDOW_SIZE

        // Garbage collect: discard all messages with seq <= msg.seq
        garbage_collect(msg.seq)

The watermarks serve two purposes:

  1. Memory management. Without garbage collection, replicas would accumulate messages forever. The checkpoint protocol allows replicas to discard old messages once 2f + 1 replicas agree on the state at a sequence number.
  2. Flow control. The high watermark prevents a faulty primary from exhausting memory by assigning arbitrarily high sequence numbers.

The checkpoint interval is a tuning knob: too frequent and you waste bandwidth on checkpoint messages; too infrequent and replicas accumulate too much state. In practice, I’ve seen intervals ranging from 128 to 1024, depending on the application’s state size and message rate.

Message Complexity: The Scalability Wall

Let’s count the messages for a single client request:

PhaseMessagesComplexity
Request1 (client to primary)O(1)
Pre-preparen - 1 (primary to backups)O(n)
Prepare(n - 1) * n (each backup to all)O(n^2)
Commitn * n (each replica to all)O(n^2)
Replyn (all replicas to client)O(n)
Total~2n^2O(n^2)

For n = 4 (f = 1): approximately 32 messages per request. For n = 7 (f = 2): approximately 98 messages per request. For n = 13 (f = 4): approximately 338 messages per request. For n = 31 (f = 10): approximately 1,922 messages per request.

Each message must be authenticated (either MAC or digital signature). Each replica must process all messages it receives. The quadratic blowup means that PBFT hits a practical scalability wall around 20-30 replicas. Beyond that, the communication overhead dominates, and throughput collapses.

This is the fundamental challenge that motivated protocols like HotStuff (Chapter 11). The quadratic message complexity isn’t a bug in PBFT — it’s a consequence of the all-to-all communication needed to ensure that correct replicas can independently verify consensus without trusting the primary. Reducing this requires different cryptographic tools (threshold signatures) or different protocol structures.

Bandwidth Analysis

It’s not just message count — it’s message size. Each PREPARE and COMMIT message carries a view number, sequence number, digest, and signature. With RSA-2048 signatures, that’s at least 300+ bytes per message. With n = 31 replicas, a single request generates roughly 1,922 messages * 300 bytes = ~576 KB of authentication overhead alone, not counting the actual request data.

Using MACs instead of signatures (as the paper suggests for prepare and commit messages) reduces per-message overhead but introduces complexity in verification during view changes, where you need transferable proofs. Castro and Liskov’s paper handles this with an elaborate scheme of MAC vectors, which is yet another source of implementation headaches.

Optimizations

The base PBFT protocol is not fast enough for most practical use cases. Castro and Liskov’s paper includes several optimizations, and subsequent work added more.

Batching

Instead of running a three-phase protocol for every client request, the primary batches multiple requests into a single PRE-PREPARE. This amortizes the O(n^2) message overhead across many requests.

function batch_requests():
    // Accumulate requests until batch is full or timeout expires
    while batch.size() < MAX_BATCH_SIZE and not timeout_expired():
        if has_pending_request():
            batch.add(next_request())

    if batch.size() > 0:
        seq_num = next_sequence_number()
        msg = PrePrepare{
            view:     current_view,
            seq:      seq_num,
            digest:   hash(batch),
            requests: batch
        }
        broadcast_to_all_backups(msg)

Batching is the single most important optimization. In the original PBFT paper’s benchmarks, batching improved throughput by an order of magnitude. Without it, PBFT would be too slow for any serious workload.

The tradeoff is latency: individual requests wait in the batch buffer. With a batch timeout of, say, 10ms and a maximum batch size of 100, you add up to 10ms of latency but amortize the protocol overhead 100x.

Speculative Execution

Replicas can execute requests speculatively after the prepare phase, before collecting the full commit certificate. The speculative result is sent to the client immediately. The client accepts the speculative result if it receives 3f + 1 matching speculative replies (all replicas agree).

function on_prepared(seq, digest):
    // Speculatively execute before commit
    result = execute_speculatively(get_request(seq))

    spec_reply = SpeculativeReply{
        view:      current_view,
        seq:       seq,
        client:    request.client_id,
        result:    result,
        sender:    my_id
    }
    send_to_client(spec_reply)

    // Continue with normal commit phase
    broadcast_commit(seq, digest)

If the speculative execution turns out to be wrong (due to a view change or reordering), the replica rolls back and re-executes. This optimization reduces perceived latency for the common case but requires application-level support for rollback, which is not always feasible.

Read-Only Optimization

Read-only requests can be processed without the full three-phase protocol. A replica that is up-to-date can respond directly to a read request. The client collects 2f + 1 matching replies.

This is safe because read-only requests don’t modify state, so there’s no ordering concern. The client waits for 2f + 1 matching replies to ensure at least f + 1 came from correct replicas, guaranteeing the read reflects a committed state.

In practice, this optimization provides significant throughput gains for read-heavy workloads, which is most workloads.

Tentative Execution

A variant where replicas execute requests tentatively after collecting the prepared certificate but don’t wait for the commit certificate. They send a tentative reply to the client. The client waits for 2f + 1 matching tentative replies. If the replies match, the client treats them as final.

This reduces the effective latency from three message delays (pre-prepare, prepare, commit) to two (pre-prepare, prepare), at the cost of requiring the client to collect more replies and handle the rare case where tentative execution needs to be rolled back.

Benchmarks and Real-World Performance

Castro and Liskov’s original benchmarks (1999, running on 200 MHz Pentium Pros connected by 100 Mbps Ethernet) showed:

  • BFS (Byzantine File System) throughput: approximately 50% of the unreplicated NFS server
  • Latency: approximately 3ms for small requests in a LAN setting

More modern benchmarks (circa 2010-2020, on contemporary hardware) show PBFT achieving:

ConfigurationThroughput (ops/sec)Latency (ms)Notes
n=4, LAN50,000 - 100,0001-3With batching
n=7, LAN30,000 - 60,0002-5With batching
n=13, LAN10,000 - 25,0005-15Throughput drops noticeably
n=31, LAN1,000 - 5,00020-100Quadratic overhead dominates
n=4, WAN1,000 - 5,000100-500Network latency dominates

These numbers depend heavily on request size, batch size, signing algorithm, and network characteristics. The general pattern is clear: PBFT performs well at small scale and degrades rapidly as you add replicas.

For comparison, a Raft cluster with n=5 on similar hardware typically achieves 100,000+ ops/sec with sub-millisecond latency. The cost of Byzantine tolerance is roughly a 2-5x throughput penalty at small scale, increasing to 10x or more at larger scale. This is the cost of not trusting your peers.

The Crypto Tax

Cryptographic operations are a significant portion of PBFT’s overhead. Each message requires signature verification. Let’s look at the costs:

OperationRSA-2048ECDSA P-256Ed25519HMAC-SHA256
Sign~1.5 ms~0.3 ms~0.05 ms~0.001 ms
Verify~0.05 ms~0.3 ms~0.1 ms~0.001 ms

With n = 31 replicas, each replica processes roughly 2n = 62 PREPARE messages and 2n = 62 COMMIT messages per request. If using RSA signatures, that’s 124 signature verifications * 0.05 ms = 6.2 ms just for crypto. With Ed25519, it’s ~12.4 ms (Ed25519 verify is slower than RSA verify). With MACs, it’s ~0.124 ms.

This is why the original PBFT paper uses MACs for the prepare and commit phases and reserves full signatures for view-change messages that need transferability. It’s also why modern BFT protocols increasingly use threshold signatures (more on this in Chapter 11).

When PBFT Makes Sense

PBFT is appropriate when:

  • You have a small, known set of replicas (4-20). The quadratic overhead is manageable at this scale.
  • You genuinely face Byzantine threats. If your replicas might be compromised, buggy in non-crash ways, or operated by mutually distrusting parties, CFT protocols are insufficient.
  • You need deterministic finality. PBFT provides immediate, irrevocable finality. Once a request is committed, it’s committed. No probabilistic guarantees, no confirmation delays.
  • You’re in a permissioned setting. PBFT requires knowing the identity and public key of every replica. It doesn’t work for open, permissionless networks.

PBFT is not appropriate when:

  • You trust your replicas to be non-Byzantine. If the only failure mode is crashes, use Raft or Multi-Paxos. They’re simpler, faster, and more scalable.
  • You need more than ~20 replicas. The O(n^2) overhead makes PBFT impractical for large replica sets. Use HotStuff or a protocol with linear complexity.
  • You’re building a public blockchain. PBFT can’t handle open membership. You need Nakamoto-style consensus or a protocol designed for permissionless settings.
  • You can’t afford the latency. Three network round trips (pre-prepare, prepare, commit) plus cryptographic overhead means PBFT adds significant latency compared to CFT protocols.

The Gap Between Paper and Implementation

Let me enumerate some things the paper handles gracefully in a paragraph that take weeks to implement correctly:

  1. Message retransmission. The paper assumes reliable point-to-point channels. In practice, you need retransmission logic. But how do you retransmit efficiently without flooding the network? How do you distinguish a lost message from a delayed one from a Byzantine node claiming it never received a message?

  2. Request deduplication. Clients may retransmit requests. Replicas must detect duplicates using timestamps or nonces. The paper mentions this; implementing it correctly, especially across view changes and state transfers, is surprisingly fiddly.

  3. State transfer. When a replica falls behind (crashed and recovered, or was partitioned), it needs to catch up. The paper describes a checkpoint-based state transfer mechanism, but implementing it efficiently — transferring gigabytes of application state while the system continues processing requests — is a significant engineering challenge.

  4. Key management. Every message is signed. In a long-running system, keys need rotation. How do you rotate keys without stopping the protocol? The paper doesn’t address this because it’s an operational concern, not a protocol concern. But it’s very much your concern.

  5. Performance isolation. A single slow replica can affect the whole system because the protocol waits for 2f + 1 responses. In practice, you need timeout tuning, adaptive batching, and careful resource management to prevent stragglers from dominating latency.

  6. Testing Byzantine behavior. How do you test that your implementation actually tolerates Byzantine faults? Unit tests are insufficient — you need a test harness that can inject arbitrary Byzantine behavior: equivocation, selective message dropping, message reordering, invalid signatures, and combinations thereof. Building this test infrastructure is almost as much work as building the protocol itself.

Historical Context and Legacy

PBFT was a breakthrough. Before it, Byzantine fault tolerance was considered impractical — the province of theoreticians, not systems builders. Castro and Liskov showed that BFT could be “only” 3x slower than unreplicated execution, which was a dramatic improvement over prior BFT protocols.

PBFT directly influenced:

  • Hyperledger Fabric v0.6, which used PBFT as its consensus protocol (later replaced with a pluggable model).
  • BFT-SMaRt, a well-known Java implementation of BFT state machine replication.
  • Every subsequent BFT protocol, including HotStuff, Tendermint, SBFT, and dozens of others that define themselves in relation to PBFT.

The paper has been cited over 10,000 times, making it one of the most influential distributed systems papers ever published. And yet, if you ask anyone who has implemented PBFT from the paper, they’ll tell you it was one of the most painful engineering experiences of their career.

That’s the legacy of PBFT: a protocol that proved BFT could be practical, while also demonstrating that “practical” is a relative term. The subsequent two decades of BFT research have been, in large part, attempts to make PBFT’s ideas work better at scale. We’ll look at the most successful of those attempts next.

HotStuff and the Linear BFT Revolution

PBFT gave us practical Byzantine fault tolerance. Then it gave us O(n^2) message complexity per consensus decision and a view-change protocol so complicated that getting it right became a rite of passage in systems research. For twenty years, the BFT community tried to do better. Many protocols improved specific aspects — SBFT reduced the common-case latency, Zyzzyva optimized for the optimistic case, Tendermint simplified the structure — but the quadratic communication bottleneck and the horrifying view-change complexity persisted like a load-bearing wall that nobody could figure out how to remove.

Then, in 2018, Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham published HotStuff. The key insight was disarmingly simple in retrospect: use threshold signatures to aggregate votes, reducing per-phase communication from O(n^2) to O(n), and unify the normal case and view change into a single protocol path. The result was a BFT protocol with linear message complexity per view, a view-change protocol that is literally the same as the normal protocol, and enough elegance that Facebook (now Meta) chose it as the basis for their Libra/Diem blockchain’s consensus layer.

Of course, “disarmingly simple” means “took twenty years and some of the best minds in distributed systems to figure out.”

The Core Problem HotStuff Solves

Let’s be precise about what was wrong with PBFT’s communication pattern. In PBFT’s prepare and commit phases, every replica sends a message to every other replica. Each replica independently collects messages and determines when a quorum has been reached. This all-to-all communication pattern has two consequences:

  1. Quadratic message count. Each phase generates O(n^2) messages. With n = 100 replicas, that’s ~10,000 messages per phase, two phases of quadratic communication per decision. The network saturates.

  2. Complex view changes. Because every replica independently collects its own view of the quorum, the state that must be transferred during a view change is complex. Each replica has its own set of collected certificates, and the new leader must reconcile these. This is why PBFT’s view-change protocol is so intricate.

HotStuff’s insight: what if the leader collected all the votes and produced a single, compact proof that a quorum was reached? Instead of every replica telling every other replica “I voted,” every replica tells the leader “I voted,” and the leader combines these votes into a single quorum certificate (QC) using a threshold signature. Then the leader broadcasts this single QC to all replicas. Each phase goes from O(n^2) to O(n).

Threshold Signatures: The Enabling Technology

Before diving into HotStuff’s protocol, we need to understand threshold signatures, because they’re the cryptographic primitive that makes linear complexity possible.

A (k, n) threshold signature scheme allows any k out of n parties to collaboratively produce a signature that can be verified with a single public key. No individual party can produce a valid signature alone. The critical properties are:

  • Aggregation. Individual signature shares from different parties can be combined into a single signature of constant size, regardless of how many parties contributed.
  • Verification. The combined signature can be verified with a single public key verification, regardless of n.
  • Threshold. At least k shares are needed to produce a valid combined signature.

For BFT with n = 3f + 1, we set k = 2f + 1. A quorum certificate is a threshold signature on a message, proving that at least 2f + 1 replicas signed it.

Common threshold signature schemes include BLS (Boneh-Lynn-Shacham) signatures, which have a natural aggregation property, and threshold RSA or ECDSA schemes that require more complex setup.

The cost is non-trivial: BLS signature verification is significantly more expensive than standard signature verification (pairing operations on elliptic curves), and the DKG (distributed key generation) setup requires its own protocol. HotStuff’s authors acknowledged this tradeoff — you’re trading communication complexity for cryptographic computation. Whether this is a net win depends on your deployment: in a WAN setting where network latency dominates, it almost always is. In a LAN setting with fast networks and many small messages, the crypto overhead might hurt.

The HotStuff Protocol

HotStuff operates in a sequence of views, each with a designated leader. The protocol proceeds in three phases: prepare, pre-commit, and commit, followed by a decide step. Each phase follows the same star-shaped communication pattern: replicas send votes to the leader, the leader aggregates them into a QC, and broadcasts the QC with the next proposal.

The Crucial Abstraction: Generic HotStuff

Before presenting the specific phases, let’s look at the generic framework. HotStuff’s elegance comes from recognizing that each phase has the same structure:

// Generic phase at the leader
function generic_phase(proposal, phase_name):
    // Step 1: Leader broadcasts proposal (with QC from previous phase)
    broadcast(Proposal{
        phase:    phase_name,
        node:     proposal,
        qc:       highest_qc,
        view:     current_view,
        leader:   my_id
    })

    // Step 2: Collect votes from replicas
    votes = {}
    while count(votes) < 2 * f + 1:
        vote = receive_vote()
        if verify_vote(vote) and vote.view == current_view:
            votes[vote.sender] = vote.partial_sig

    // Step 3: Aggregate into QC
    qc = threshold_combine(votes)
    return qc
// Generic phase at a replica
function generic_vote(proposal):
    // Verify the proposal
    if not verify_proposal(proposal):
        return

    if not safe_to_vote(proposal):
        return

    // Send vote (partial threshold signature) to leader
    partial_sig = threshold_sign(proposal.node)
    send_to_leader(Vote{
        view:        current_view,
        node:        proposal.node,
        partial_sig: partial_sig,
        sender:      my_id
    })

Every phase is: leader proposes with a QC, replicas vote, leader aggregates into a new QC. Three iterations of this pattern, and you have consensus. Compare this with PBFT’s three distinct phase structures with different message formats, different quorum rules, and a completely different view-change protocol.

The Three Phases in Detail

Now let’s be specific about what each phase accomplishes and why three phases are necessary.

// ============ PREPARE PHASE ============
// Leader proposes a new block/command
function leader_prepare():
    // Create proposal extending the highest known QC
    proposal = create_node(
        parent:   highest_qc.node,
        command:  next_client_command(),
        qc:       highest_qc    // "justify" this proposal
    )

    msg = PrepareMsg{
        view:     current_view,
        node:     proposal,
        justify:  highest_qc
    }
    broadcast(msg)

function replica_on_prepare(msg):
    // Safety check: does this proposal extend from a safe branch?
    if not safe_node(msg.node, msg.justify):
        return  // Refuse to vote

    // Liveness check: is this from the current view's leader?
    if msg.view != current_view:
        return

    if leader_of(msg.view) != msg.sender:
        return

    // Vote
    partial_sig = threshold_sign(msg.node)
    send_to_leader(Vote{PREPARE, msg.view, msg.node, partial_sig})

// ============ PRE-COMMIT PHASE ============
// Leader has prepareQC, broadcasts it
function leader_pre_commit(prepareQC):
    msg = PreCommitMsg{
        view:     current_view,
        justify:  prepareQC    // Proof that 2f+1 voted to prepare
    }
    broadcast(msg)

function replica_on_pre_commit(msg):
    // Verify the QC
    if not verify_qc(msg.justify):
        return

    // Update locked QC — this is the safety-critical step
    // pre_commit_qc locks the replica on this branch
    locked_qc = msg.justify  // Lock on the prepareQC

    partial_sig = threshold_sign(msg.justify.node)
    send_to_leader(Vote{PRE_COMMIT, msg.view, msg.justify.node, partial_sig})

// ============ COMMIT PHASE ============
// Leader has precommitQC, broadcasts it
function leader_commit(precommitQC):
    msg = CommitMsg{
        view:     current_view,
        justify:  precommitQC
    }
    broadcast(msg)

function replica_on_commit(msg):
    if not verify_qc(msg.justify):
        return

    partial_sig = threshold_sign(msg.justify.node)
    send_to_leader(Vote{COMMIT, msg.view, msg.justify.node, partial_sig})

// ============ DECIDE ============
// Leader has commitQC — consensus reached!
function leader_decide(commitQC):
    msg = DecideMsg{
        view:     current_view,
        justify:  commitQC
    }
    broadcast(msg)

function replica_on_decide(msg):
    if not verify_qc(msg.justify):
        return

    // Execute the committed command
    execute(msg.justify.node.command)
    // Respond to client
    send_reply_to_client(msg.justify.node)

Message Flow: Normal Case

With n = 4 (f = 1):

Client      Leader       Replica 1    Replica 2    Replica 3
  |            |             |             |             |
  |--REQUEST-->|             |             |             |
  |            |             |             |             |
  |            |---PREPARE (proposal + highQC)---------->|
  |            |---PREPARE-->|             |             |
  |            |---PREPARE----------->|    |             |
  |            |             |             |             |
  |            |<--vote------|             |             |
  |            |<--vote-------------------|             |
  |            |<--vote---------------------------->    |
  |            |             |             |             |
  |            | (aggregate into prepareQC)              |
  |            |             |             |             |
  |            |---PRE-COMMIT (prepareQC)--------------->|
  |            |---PRE-COMMIT>|            |             |
  |            |---PRE-COMMIT-------->|    |             |
  |            |             |             |             |
  |            |<--vote------|             |             |
  |            |<--vote-------------------|             |
  |            |<--vote---------------------------->    |
  |            |             |             |             |
  |            | (aggregate into precommitQC)            |
  |            |             |             |             |
  |            |---COMMIT (precommitQC)----------------->|
  |            |---COMMIT--->|             |             |
  |            |---COMMIT----------->|     |             |
  |            |             |             |             |
  |            |<--vote------|             |             |
  |            |<--vote-------------------|             |
  |            |<--vote---------------------------->    |
  |            |             |             |             |
  |            | (aggregate into commitQC)               |
  |            |             |             |             |
  |            |---DECIDE (commitQC)----->|             |
  |            |---DECIDE----------->|     |             |
  |            |---DECIDE---------------------------->  |
  |            |             |             |             |
  |<--REPLY----|             |             |             |

Total messages per decision: 3 * (n + n) + n = 7n. For n = 4, that’s 28 messages. Compare with PBFT’s ~32 messages for n = 4. The savings aren’t dramatic at small n. But at n = 100: HotStuff sends ~700 messages versus PBFT’s ~20,000. That’s where linear versus quadratic matters.

The Safety Rule: When Is It Safe to Vote?

The safe_node function is the heart of HotStuff’s safety argument. A replica votes for a proposal if:

function safe_node(node, justify_qc):
    // Safety condition: the proposal must either extend the branch
    // we're locked on, OR the justify QC is from a higher view
    // than our locked QC (proving the system has moved on)

    extends_locked = is_ancestor(locked_qc.node, node)
    higher_qc = justify_qc.view > locked_qc.view

    return extends_locked or higher_qc

This is the locking mechanism. During the pre-commit phase, a replica “locks” on the prepareQC. It will only vote for future proposals that either:

  1. Extend the locked branch — the new proposal builds on the locked node, so there’s no conflict.
  2. Have a higher QC — someone else got a more recent quorum certificate, proving the system has progressed past what we locked on, so it’s safe to unlock.

This two-part rule is what allows HotStuff to be both safe and live. The lock prevents conflicting commits (safety), and the ability to unlock with a higher QC prevents deadlock (liveness).

View Changes: The Beautiful Part

Here’s where HotStuff earns its elegance. In PBFT, the view-change protocol is a completely separate, complex protocol with its own message types, its own quorum logic, and its own verification procedures. In HotStuff, the view change is… the same protocol.

function on_view_timeout():
    // View change triggered by timeout
    // Send our highest QC to the new leader
    new_leader = leader_of(current_view + 1)

    msg = NewView{
        view:       current_view + 1,
        highest_qc: my_highest_qc,
        sender:     my_id
    }
    send(new_leader, msg)
    current_view += 1

// New leader collects NEW-VIEW messages
function leader_on_new_view():
    new_view_msgs = {}
    while count(new_view_msgs) < 2 * f + 1:
        msg = receive_new_view()
        if verify(msg) and msg.view == current_view:
            new_view_msgs[msg.sender] = msg

    // Pick the highest QC from the collected messages
    highest_qc = max(msg.highest_qc for msg in new_view_msgs.values(),
                     key=lambda qc: qc.view)

    // Now just run the normal protocol, extending from highest_qc
    // This IS the view change. That's it.
    leader_prepare()  // Uses highest_qc as the justify

That’s it. The view change consists of: (1) replicas send their highest QC to the new leader, (2) the new leader picks the highest one, and (3) the new leader runs the normal prepare phase using that QC.

Compare this with PBFT’s view change, which requires assembling prepared certificates from 2f + 1 replicas, computing the O set of re-proposals for every sequence number in the watermark window, broadcasting the entire set for verification, and having every backup independently recompute the O set to verify the new leader isn’t cheating.

The reason HotStuff can get away with this simpler view change is the three-phase structure with the locking mechanism. The locked QC carried in the NEW-VIEW messages provides enough information for the new leader to determine the safe point to extend from. The safe_node rule at replicas ensures they won’t vote for anything that conflicts with a committed decision.

This unification of normal case and view change is, in my opinion, HotStuff’s most important contribution. It doesn’t just reduce complexity — it eliminates an entire class of bugs. Every PBFT implementation I’ve seen has had view-change bugs that didn’t exist in the normal-case code, because the view-change code was tested less and was fundamentally more complex. With HotStuff, if the normal case works, the view change works.

Chained HotStuff: Pipelining Consensus

Basic HotStuff requires three phases (plus decide) for each consensus decision. Chained HotStuff observes that these phases are independent across different proposals and can be pipelined.

The key insight: instead of running prepare/pre-commit/commit sequentially for one proposal and then starting the next, each new proposal effectively advances the previous proposals through their phases.

// Chained HotStuff: each proposal serves double duty
function leader_propose_chained():
    // Create a new proposal
    node = create_node(
        parent:  highest_qc.node,
        command: next_command(),
        qc:      highest_qc
    )

    broadcast(Proposal{view: current_view, node: node})

    // This proposal simultaneously:
    // - Starts PREPARE for the new command
    // - Acts as PRE-COMMIT for the parent (1 phase ago)
    // - Acts as COMMIT for the grandparent (2 phases ago)
    // - Triggers DECIDE for the great-grandparent (3 phases ago)

function replica_on_proposal_chained(msg):
    if not safe_node(msg.node, msg.node.qc):
        return

    // Update locked QC if the parent's QC is higher
    if msg.node.qc.view > locked_qc.view:
        locked_qc = msg.node.qc

    // Check for commits:
    // If node.qc.node.qc.node is the same chain, we can commit
    // the great-grandparent
    b_star = msg.node             // Current
    b_double = b_star.qc.node    // Parent (1-chain)
    b_single = b_double.qc.node  // Grandparent (2-chain)
    b = b_single.qc.node         // Great-grandparent (3-chain)

    // Three-chain commit rule: if b_single is the parent of b_double
    // and b_double is the parent of b_star, then b is committed
    if parent(b_single) == b_double and parent(b_double) == b_star:
        execute_up_to(b)

    // Vote for the current proposal
    partial_sig = threshold_sign(msg.node)
    send_to_leader(Vote{msg.view, msg.node, partial_sig})

Pipelining Visualization

View 1:   Leader proposes cmd1
           |
View 2:   Leader proposes cmd2 (carries QC for cmd1)
           cmd1 is now PREPARED (1-chain)
           |
View 3:   Leader proposes cmd3 (carries QC for cmd2)
           cmd2 is now PREPARED (1-chain)
           cmd1 is now PRE-COMMITTED (2-chain)
           |
View 4:   Leader proposes cmd4 (carries QC for cmd3)
           cmd3 is now PREPARED (1-chain)
           cmd2 is now PRE-COMMITTED (2-chain)
           cmd1 is now COMMITTED (3-chain) → execute cmd1!
           |
View 5:   Leader proposes cmd5 (carries QC for cmd4)
           cmd4 is PREPARED
           cmd3 is PRE-COMMITTED
           cmd2 is COMMITTED → execute cmd2!

In steady state, Chained HotStuff commits one command per view with only one round of communication per view (leader broadcasts, replicas respond). The latency to commit a specific command is still three views, but the throughput is one command per view.

This pipelining is elegant, but there’s a catch the paper mentions briefly: if a leader is faulty and doesn’t produce a valid QC, the pipeline stalls. A view change means the current view’s proposal doesn’t get a QC, which means the previous proposals don’t advance through their phases. After three consecutive leader failures, you’re three views behind on commits. In a network with Byzantine participants actively trying to disrupt progress, this can significantly impact throughput. The pacemaker (discussed below) is supposed to handle this, but doing so efficiently is harder than it sounds.

The Pacemaker: Liveness Without Synchrony Assumptions

HotStuff’s protocol provides safety regardless of timing. But liveness — actually making progress — requires some form of synchronization to ensure enough replicas are in the same view at the same time, talking to the same leader.

The pacemaker is the component responsible for this. It’s explicitly separated from the safety protocol, which is a clean design choice. The paper specifies properties the pacemaker must satisfy but leaves the implementation somewhat open. Here’s one common approach:

// Pacemaker: ensures replicas eventually synchronize on the same view
function pacemaker():
    // Start a timer for the current view
    timer = start_timer(timeout_for_view(current_view))

    while true:
        if received_valid_proposal(current_view):
            // Good — leader is alive, reset timer
            reset_timer(timer)

        if timer_expired():
            // Leader seems faulty, initiate view change
            broadcast_timeout_certificate()
            advance_view()

function timeout_for_view(v):
    // Exponential backoff to handle cascading failures
    return BASE_TIMEOUT * 2^(consecutive_timeouts)

function advance_view():
    // Collect timeout certificates from 2f+1 replicas
    // proving the view should change
    tc = collect_timeout_certificates(current_view)

    if count(tc) >= 2 * f + 1:
        current_view += 1
        consecutive_timeouts += 1
        // Send NEW-VIEW to next leader with highest QC
        send_new_view(leader_of(current_view))

function on_successful_commit():
    // Reset backoff on successful progress
    consecutive_timeouts = 0

The pacemaker is where the “eventual synchrony” assumption lives. In a purely asynchronous network, the pacemaker might never synchronize replicas, and the system might never make progress. The assumption is that eventually, the network becomes synchronous enough for the pacemaker to align replicas on the same view with a correct leader.

The Responsiveness Property

HotStuff has a property called optimistic responsiveness: in the normal case (correct leader, no faults), the protocol proceeds at the speed of the network, not at the speed of a predetermined timeout. The leader waits for 2f + 1 votes and immediately proceeds — it doesn’t wait for a timer to expire.

This matters in practice because networks have variable latency. A protocol that proceeds at “actual network speed” will outperform one that waits for a conservative timeout. PBFT also has this property in its normal case, but PBFT’s view-change protocol does not — it relies on timeouts to detect a faulty primary, and during the view change, progress is gated by timeout expiry.

HotStuff’s unified view change inherits responsiveness: the new leader waits for 2f + 1 NEW-VIEW messages and immediately proceeds. No timeout-gated phases during recovery. This means recovery from a faulty leader is as fast as the network allows, not as slow as the most conservative timeout.

PBFT vs HotStuff: Direct Comparison

Let’s compare them head-to-head.

Message Complexity

MetricPBFTHotStuff
Messages per phase (normal)O(n^2)O(n)
Phases per decision2 quadratic + 1 linear3 linear
Total messages per decisionO(n^2)O(n)
View change messagesO(n^3) worst caseO(n)
Authenticator complexityO(n^2) MACs or O(n) signaturesO(n) partial sigs + O(1) threshold sig

Latency

MetricPBFTHotStuff
Network round trips (normal)3 (pre-prepare, prepare, commit)3 (prepare, pre-commit, commit) + decide
Network round trips (with pipelining)N/A in base protocol1 per decision (Chained HotStuff, steady state)
View change round trips2+ (VIEW-CHANGE, NEW-VIEW)1 (NEW-VIEW, then normal protocol)
Crypto operations per replica per decisionO(n) MAC or sig verificationsO(n) partial sig verifications + threshold combine

Throughput

Replicas (n)PBFT (ops/sec, est.)HotStuff (ops/sec, est.)Ratio
480,00060,0000.75x
1620,00040,0002.0x
642,00025,00012.5x
128< 50015,00030x+

Note: these numbers are approximate and depend heavily on implementation quality, hardware, network conditions, batch size, and signature scheme. The point is the trend: at small n, PBFT’s simpler crypto can win. At larger n, HotStuff’s linear communication dominates.

Complexity of Implementation

AspectPBFTHotStuff
Normal case protocolModerateSimple
View change protocolVery complexSame as normal case
Cryptographic setupStandard PKIThreshold key setup (DKG)
State managementComplex watermark/checkpointSimpler chain-based
Lines of code (typical impl)5,000 - 15,0003,000 - 8,000
View change bugs in practiceCommonRare (it’s the same code)

The Tradeoff

HotStuff is not strictly better than PBFT. The tradeoffs:

  1. Crypto overhead. Threshold signatures (especially BLS) are computationally expensive. A BLS pairing operation takes ~1-2ms, compared to ~0.05ms for an Ed25519 verification. For small n where network overhead isn’t the bottleneck, PBFT with MACs can be faster.

  2. Leader bottleneck. In HotStuff, the leader processes all votes and aggregates them. It’s a star topology, and the leader does more work than any other replica. PBFT’s all-to-all communication distributes the work more evenly. A Byzantine leader in HotStuff can selectively delay aggregation to degrade performance, and replicas can’t easily detect this until the timeout fires.

  3. Latency. Both protocols have three-phase latency in the normal case. HotStuff’s decide step adds a fourth message delay compared to PBFT (which doesn’t have an explicit decide broadcast — replicas commit independently after collecting the commit certificate). Chained HotStuff amortizes this with pipelining, but individual request latency is still three views.

  4. DKG requirement. Setting up threshold signatures requires a distributed key generation protocol, which is itself a multi-round protocol that can be disrupted by Byzantine participants. PBFT just needs a standard PKI. This makes HotStuff harder to bootstrap and harder to handle key rotation.

LibraBFT / DiemBFT: HotStuff in Production

Facebook’s Libra project (later renamed Diem, later shut down) chose HotStuff as the basis for their consensus protocol. LibraBFT (later DiemBFT) made several practical modifications:

Key Modifications from Base HotStuff

  1. Explicit timeout certificates. DiemBFT added timeout certificates (TCs) as first-class objects. When 2f + 1 replicas timeout, they form a TC that proves the view should change. This gives a concrete mechanism for the pacemaker.

  2. Two-chain commit rule. DiemBFT v4 modified the commit rule to require only a two-chain (two consecutive QCs) instead of HotStuff’s three-chain. This reduces commit latency from 3 round trips to 2 at the cost of a more complex safety argument. The trick involves using the timeout certificates to prove safety — if a view times out, the TC provides evidence that allows safe unlocking.

  3. Decoupled execution. DiemBFT separates consensus (ordering) from execution. Blocks are ordered by consensus but executed asynchronously. This allows consensus to proceed at network speed while execution happens in the background.

  4. Reputation-based leader selection. Instead of round-robin leader rotation, DiemBFT uses a reputation mechanism: leaders who produce blocks and respond promptly get selected more often. Leaders who fail to produce blocks get deprioritized. This helps the pacemaker converge on good leaders faster.

// DiemBFT leader selection with reputation
function select_leader(view):
    // Build reputation scores from recent history
    scores = {}
    for replica in all_replicas:
        scores[replica] = base_score

        // Reward for producing blocks
        blocks_produced = count_blocks_by(replica, recent_window)
        scores[replica] += blocks_produced * PRODUCE_WEIGHT

        // Penalize for timeouts (failed to lead)
        timeouts_caused = count_timeouts_by(replica, recent_window)
        scores[replica] -= timeouts_caused * TIMEOUT_PENALTY

    // Deterministic selection based on scores and view number
    // (all replicas compute the same result)
    sorted_replicas = sort_by_score(scores)
    return sorted_replicas[view % len(sorted_replicas)]

Performance Results from DiemBFT

The DiemBFT team published benchmarks showing:

ConfigurationThroughput (TPS)Latency (ms)Network
n = 4, LAN160,000< 110 Gbps
n = 33, LAN80,0002-510 Gbps
n = 100, LAN30,00010-2010 Gbps
n = 10, WAN5,000300-500Global

These numbers include batching and pipelining. Without batching, throughput drops by 10-50x, similar to PBFT.

The Diem project was shut down in 2022 for regulatory rather than technical reasons. The HotStuff-based consensus code lives on in the Aptos blockchain (which hired many of the Diem engineers) and in various other projects that adopted or adapted the protocol.

Why Three Phases and Not Two?

A question that comes up frequently: PBFT has three phases, HotStuff has three phases, but CFT protocols like Raft get by with essentially two phases (leader proposes, followers accept). Why does BFT always seem to need three?

The answer is the commit-availability dilemma. In BFT, there’s a fundamental tension:

  • Safety requires that once a value is committed, no conflicting value can ever be committed, even across view changes.
  • Liveness requires that after a view change, the new leader can propose a new value if the old leader’s proposal didn’t complete.

With two phases, you can have safety or liveness across view changes, but not both. Here’s the intuitive argument:

After the first phase (prepare), a quorum has voted for a value. After the second phase (commit), a quorum has confirmed they know about the first quorum.

  • If you commit after one quorum (two phases), the new leader after a view change might not know about the commitment (because only the committing replicas know, and they might not be in the new leader’s quorum). You can fix this by requiring the new leader to learn about it — but then the new leader is blocked until it hears from enough replicas, which kills liveness.

  • The third phase (which creates a QC certifying the second-phase QC) ensures that enough replicas are “locked” on the committed value that any quorum the new leader contacts will contain at least one locked replica. This locked replica will inform the new leader, who can then safely re-propose the committed value. Without the third phase, the lock isn’t strong enough.

Some protocols (including DiemBFT v4) achieve two-phase commits by exploiting additional information (like timeout certificates), but the fundamental tension remains and the safety argument becomes more subtle.

Criticisms and Limitations

HotStuff is a significant advance, but it’s not without limitations:

  1. Leader centrality. Every message goes through the leader. The leader is a single point of performance — if it’s slow, the whole system is slow. A Byzantine leader can selectively censor transactions by not including them in proposals. Detection is possible but delayed.

  2. Threshold signature setup. DKG is complex and requires its own fault tolerance. If the DKG is compromised, the threshold signature scheme fails, and HotStuff’s aggregation doesn’t work. This is a bootstrapping problem that the paper waves a hand at.

  3. Chaining rigidity. In Chained HotStuff, a leader failure doesn’t just lose one view’s proposal — it stalls the pipeline for the proposals that were in earlier phases. Three consecutive leader failures mean the pipeline is empty and three views of proposals are lost. Recovery involves filling the pipeline again, adding latency.

  4. Still O(n) per view. Linear is better than quadratic, but for very large n (thousands of nodes), O(n) per view is still significant. Some newer protocols aim for sub-linear communication using sampling or committee-based approaches, though they introduce additional assumptions.

  5. Practical crypto challenges. BLS signatures on common curves (BLS12-381) have verification times around 1-2ms. With batching, you can use aggregate verification (~2ms for verifying n signatures at once), but the leader’s aggregation step becomes a bottleneck. EdDSA-based threshold schemes are faster but less mature.

Where HotStuff Fits

HotStuff is the right choice when:

  • You need BFT with more than ~20 replicas.
  • You can invest in threshold signature infrastructure.
  • View-change correctness is a priority (it should always be, but here it comes for free).
  • You’re building a blockchain or permissioned network with known validators.

HotStuff is less ideal when:

  • You have fewer than 10 replicas and network bandwidth isn’t a concern. PBFT may be simpler to deploy (no DKG).
  • You need leaderless operation. HotStuff is inherently leader-based.
  • Threshold signature infrastructure is unavailable or too expensive to set up.
  • You need sub-second latency in a WAN setting. Three round trips across continents add up.

The protocol’s lasting contribution isn’t just the linear complexity — it’s the demonstration that normal-case operation and view changes can be unified into a single, clean protocol structure. Every BFT protocol designed after HotStuff has to explain why it’s not just using HotStuff’s framework. That’s the mark of a good idea.

Tendermint: BFT Meets the Real World

There’s a particular class of distributed systems protocol that lives primarily in academic papers, and another class that lives primarily in production. Tendermint has the unusual distinction of living in both, and the scars to prove it. Originally designed by Jae Kwon in 2014 and subsequently developed by Ethan Buchman, Tendermint (now rebranded as CometBFT) took the ideas from PBFT and DLS (the Dwork-Lynch-Stockmeyer partial synchrony framework) and forged them into something that actually runs in production across hundreds of blockchains in the Cosmos ecosystem.

The result is a protocol that makes pragmatic engineering decisions the academic papers never had to confront: What happens when your validator set changes every hour? How do you handle a state machine that takes 500ms to execute a block? What do light clients need to verify consensus without running the full protocol? These are the questions that separate a protocol from a system.

Tendermint’s consensus is often described as “PBFT-like,” which is accurate in the way that saying a house is “blueprint-like.” The general shape is there — three-phase BFT with 2f + 1 quorums out of 3f + 1 validators — but the engineering decisions, the locking mechanism, the round structure, and the integration with application logic are distinctly Tendermint’s own.

The Tendermint Consensus Protocol

Tendermint consensus operates in heights (block numbers) and rounds (attempts within a height). Each height produces one block. If the first round fails (proposer is faulty, network is slow), the protocol moves to the next round with a different proposer. Within each round, there are three steps: Propose, Prevote, and Precommit.

The protocol tolerates up to f Byzantine validators out of n = 3f + 1 total, requires that the total voting power of Byzantine validators is less than 1/3 of total voting power (Tendermint uses weighted voting, so it’s voting power, not just count).

The Three Steps

// ============ HEIGHT h, ROUND r ============

// Step 1: PROPOSE
// The designated proposer for round r broadcasts a proposal
function proposer_step(h, r):
    if i_am_proposer(h, r):
        if has_valid_locked_block():
            // We locked on a block in a previous round — must re-propose it
            block = locked_block
            pol_round = locked_round  // Proof-of-lock round
        else:
            // Create a new block
            block = create_block(h)
            pol_round = -1

        broadcast(Proposal{
            height:    h,
            round:     r,
            block:     block,
            pol_round: pol_round  // -1 if new, or the round we locked
        })

    // Start timeout for proposal
    start_timer(TIMEOUT_PROPOSE + r * TIMEOUT_DELTA)
// Step 2: PREVOTE
// Upon receiving a proposal (or timing out), each validator prevotes
function prevote_step(h, r):
    proposal = get_proposal(h, r)

    if proposal == nil:
        // No proposal received — prevote nil
        broadcast(Prevote{h, r, nil})
        return

    block = proposal.block

    // Validate the block
    if not is_valid_block(block):
        broadcast(Prevote{h, r, nil})
        return

    // Safety check: locking rules
    if locked_round != -1 and locked_block != block:
        // We're locked on a different block
        if proposal.pol_round < locked_round:
            // Proposer's proof-of-lock is from before our lock
            // Don't vote for this — it might conflict
            broadcast(Prevote{h, r, nil})
            return
        else:
            // Proposer has a POL from a round >= our locked round
            // We need to see 2/3+ prevotes from that round for this block
            if has_polka(h, proposal.pol_round, block):
                // The evidence checks out — safe to prevote
                broadcast(Prevote{h, r, block_id(block)})
            else:
                broadcast(Prevote{h, r, nil})
            return

    // Not locked, or locked on the same block — vote for it
    broadcast(Prevote{h, r, block_id(block)})
// Step 3: PRECOMMIT
// Upon collecting 2/3+ prevotes, each validator precommits
function precommit_step(h, r):
    // Wait for 2/3+ prevotes
    prevotes = collect_prevotes(h, r)

    if has_two_thirds_prevotes_for(prevotes, some_block_id):
        // A "polka" — 2/3+ prevoted for this block
        block = get_block(some_block_id)

        // LOCK on this block
        locked_block = block
        locked_round = r

        broadcast(Precommit{h, r, block_id(block)})

    else if has_two_thirds_prevotes_for(prevotes, nil):
        // 2/3+ prevoted nil — no block this round
        // Unlock (unless we have a stronger lock from a later round,
        // but this is the latest round, so unlock)
        locked_block = nil
        locked_round = -1

        broadcast(Precommit{h, r, nil})

    else:
        // Neither — timeout expired without 2/3+ for anything
        broadcast(Precommit{h, r, nil})
// COMMIT DECISION
// Upon collecting 2/3+ precommits for a block
function commit_decision(h, r):
    precommits = collect_precommits(h, r)

    if has_two_thirds_precommits_for(precommits, some_block_id):
        block = get_block(some_block_id)

        // Commit the block!
        commit_block(h, block)

        // Save the commit (the set of 2/3+ precommit signatures)
        // This becomes the "commit" that light clients can verify
        save_commit(h, precommits)

        // Move to next height
        start_height(h + 1)

    else if has_two_thirds_precommits_for(precommits, nil):
        // Round failed — try next round
        start_round(h, r + 1)

    else:
        // Timeout without 2/3+ for anything
        start_round(h, r + 1)

Message Flow: Normal Case (Happy Path)

Proposer      Validator 1    Validator 2    Validator 3
   |               |              |              |
   |--PROPOSAL---->|              |              |
   |--PROPOSAL-------------------->|              |
   |--PROPOSAL------------------------------------>|
   |               |              |              |
   | (each validator validates block, checks lock) |
   |               |              |              |
   |<--PREVOTE-----|              |              |
   |<--PREVOTE-----|              |              |
   |   PREVOTE---->|<--PREVOTE----|              |
   |   PREVOTE---->|   PREVOTE--->|<--PREVOTE----|
   |   PREVOTE-------------------->|   PREVOTE-->|
   |               |              |              |
   | (all-to-all: each validator sends prevote   |
   |  to every other validator — O(n²)!)         |
   |               |              |              |
   | (each independently observes 2/3+ prevotes  |
   |  for the block — a "polka")                 |
   |               |              |              |
   | (each validator LOCKS on the block)         |
   |               |              |              |
   |<--PRECOMMIT---|              |              |
   |   PRECOMMIT-->|<--PRECOMMIT--|              |
   |   PRECOMMIT-->|   PRECOMMIT->|<--PRECOMMIT-|
   |   PRECOMMIT-->|   PRECOMMIT->|   PRECOMMIT>|
   |               |              |              |
   | (all-to-all again — O(n²))                  |
   |               |              |              |
   | (each independently observes 2/3+ precommits|
   |  → COMMIT the block, advance height)        |

Note that Tendermint’s prevote and precommit phases use all-to-all communication, just like PBFT’s prepare and commit phases. The message complexity is O(n^2) per round. Tendermint did not adopt HotStuff’s threshold-signature-based linear complexity — a deliberate engineering choice we’ll discuss later.

Why Two Voting Rounds?

Tendermint’s two voting rounds (prevote and precommit) serve the same purpose as PBFT’s prepare and commit:

  1. Prevote (like PBFT prepare): Establishes that 2/3+ validators agree on a specific block for this round. This creates a “polka” — proof that the network has converged on a block. The polka is used to justify locking.

  2. Precommit (like PBFT commit): Establishes that 2/3+ validators are locked on the block (they observed the polka and committed to it). Once 2/3+ precommits exist, the block is committed.

The two-round structure ensures that a committed block can always be recovered after a round change or crash, because enough validators are locked on it.

The Locking Mechanism: Tendermint’s Safety Heart

The locking mechanism is what prevents safety violations, and it’s both the most important and most subtle part of Tendermint. Let me walk through it carefully.

Locking Rules

  1. Lock on polka. When a validator sees 2/3+ prevotes for a block B in round r, it locks on (B, r). This means: “I have evidence that the network might commit B, so I’ll defend it.”

  2. Only prevote locked block. If a validator is locked on (B, r), it will only prevote for B in subsequent rounds, unless it sees evidence that it’s safe to unlock.

  3. Unlock on higher polka. If a validator sees a polka for a different block B’ in a round r’ > r, it can unlock from (B, r) and lock on (B’, r’). The higher round polka proves that the network has moved on.

  4. Proposer carries proof-of-lock. When a proposer re-proposes a locked block, it includes the pol_round — the round in which it observed the polka. Other validators check this against their own locks to decide if it’s safe to vote.

Why This Works: The Safety Argument

Suppose block B is committed at height h. This means 2/3+ precommits exist for B. Each precommitting validator was locked on B. For a different block B’ to be committed at the same height, 2/3+ validators would need to prevote for B’ — but the locked validators will only prevote for B (or unlock due to a higher polka for B’). Since 2/3+ are locked on B, at most 1/3 can prevote for B’ (the non-locked ones), which isn’t enough for a polka. So B’ can’t get a polka, can’t get precommits, and can’t be committed.

The unlock-on-higher-polka rule doesn’t violate this because: if there’s a higher polka for B’, then 2/3+ prevoted for B’ in a higher round. But if B was already committed (2/3+ precommitted), then 2/3+ were locked on B. For 2/3+ to prevote B’, the intersection (at least 1/3+) would need to have been locked on B but prevoted B’. They can only do this if they saw a polka for B’ in a round > their lock round — but B was committed, meaning 2/3+ precommitted B, meaning 2/3+ locked on B, meaning no polka for B’ is possible (insufficient unlocked validators). Contradiction.

This is one of those safety arguments that’s airtight on paper and terrifying to think about at 2 AM when you’re debugging a consensus failure. The engineering challenge is ensuring that the lock state is persisted correctly across crashes, that the polka evidence is validated rigorously, and that the timing of lock/unlock operations is exactly right.

Common Implementation Bug: The Lock Persistence Problem

Here’s a bug I’ve seen in multiple Tendermint-like implementations: a validator crashes after locking on a block but before writing the lock to stable storage. It restarts, doesn’t remember the lock, and prevotes for a different block. If enough validators do this simultaneously, safety can be violated.

The fix is simple in principle: write the lock to disk before sending the precommit. In practice, this adds disk I/O latency to the critical path and introduces questions about what happens if the write is partially completed when the crash occurs. Tendermint’s production code handles this with a write-ahead log (WAL) that records every state transition before it takes effect.

ABCI: Separating Consensus from Application

One of Tendermint’s most significant architectural decisions is the Application Blockchain Interface (ABCI). ABCI is a socket-based interface that separates the consensus engine from the application logic. The consensus engine (Tendermint Core) handles peer discovery, block propagation, voting, and finality. The application (running as a separate process) handles transaction validation, state transitions, and queries.

// ABCI interface (simplified)
interface Application:
    // Called when a new transaction is received in the mempool
    // Returns: accept/reject for mempool inclusion
    function CheckTx(tx) -> ResponseCheckTx

    // Called at the start of block processing
    function BeginBlock(header) -> ResponseBeginBlock

    // Called for each transaction in the block, in order
    function DeliverTx(tx) -> ResponseDeliverTx

    // Called at the end of block processing
    function EndBlock(height) -> ResponseEndBlock

    // Called to persist the state changes
    function Commit() -> ResponseCommit  // Returns app state hash

    // Called to query application state
    function Query(path, data) -> ResponseQuery

    // Called to get application info (including latest block height)
    function Info() -> ResponseInfo

The flow during block commitment:

Tendermint Core                  Application (via ABCI)
     |                                |
     | (consensus commits block B     |
     |  at height h)                  |
     |                                |
     |---BeginBlock(B.header)-------->|
     |<--ResponseBeginBlock-----------|
     |                                |
     |---DeliverTx(tx1)-------------->|
     |<--ResponseDeliverTx------------|
     |---DeliverTx(tx2)-------------->|
     |<--ResponseDeliverTx------------|
     |   ... (for each tx in block)   |
     |                                |
     |---EndBlock(h)----------------->|
     |<--ResponseEndBlock-------------|
     |   (may include validator set   |
     |    updates for next height!)   |
     |                                |
     |---Commit()-------------------->|
     |<--ResponseCommit(app_hash)----|
     |                                |
     | (app_hash is included in the   |
     |  next block's header, creating |
     |  a commitment to app state)    |

Why ABCI Matters

ABCI’s separation of concerns has profound implications:

  1. Language independence. The application can be written in any language that speaks ABCI (socket protocol or gRPC). The Cosmos SDK uses Go, but applications have been written in Rust, JavaScript, and others.

  2. Deterministic replay. The consensus engine guarantees that all validators deliver the same blocks in the same order. The application just needs to be deterministic: given the same sequence of blocks, produce the same state. This is the state machine replication guarantee.

  3. Validator set changes. The application can change the validator set via EndBlock responses. This is how proof-of-stake systems work on Tendermint: the application logic determines who the validators are, and Tendermint adjusts the consensus participant set accordingly. This is elegant in theory but adds complexity in practice — the consensus engine needs to handle validator set changes between heights while maintaining safety guarantees.

  4. Application-level validation. CheckTx allows the application to reject invalid transactions before they enter the mempool, and DeliverTx allows per-transaction processing. This keeps garbage out of blocks without the consensus engine needing to understand application semantics.

The downside of ABCI is latency. Every block requires multiple cross-process calls (or cross-machine calls if the application runs on a different host). For high-throughput applications, this overhead is significant. ABCI++ (introduced in CometBFT v0.38) addresses some of this by adding hooks earlier in the consensus process, allowing the application to participate in block proposal (via PrepareProposal and ProcessProposal).

Round Structure and Timeouts

Tendermint’s round structure is deterministic, with configurable timeouts:

// Timeout configuration
TIMEOUT_PROPOSE  = 3000ms   // Wait for proposal
TIMEOUT_PREVOTE  = 1000ms   // Wait for 2/3+ prevotes after seeing any
TIMEOUT_PRECOMMIT = 1000ms  // Wait for 2/3+ precommits after seeing any
TIMEOUT_DELTA    = 500ms    // Increment per round (for backoff)

function round_timeout(base_timeout, round):
    return base_timeout + round * TIMEOUT_DELTA

// State machine for a single height
function run_height(h):
    round = 0
    while true:
        // Propose step
        if i_am_proposer(h, round):
            propose(h, round)

        wait_for(
            received_proposal(h, round),
            timeout: round_timeout(TIMEOUT_PROPOSE, round)
        )

        // Prevote step
        do_prevote(h, round)

        wait_for(
            has_two_thirds_plus_prevotes(h, round),
            timeout: round_timeout(TIMEOUT_PREVOTE, round)
        )

        // Precommit step
        do_precommit(h, round)

        wait_for(
            has_two_thirds_plus_precommits(h, round),
            timeout: round_timeout(TIMEOUT_PRECOMMIT, round)
        )

        if committed(h):
            return  // Move to next height

        // Round failed — increment and try again
        round += 1

The increasing timeouts (via TIMEOUT_DELTA) serve as the eventual synchrony mechanism. If the network is temporarily partitioned or a proposer is slow, subsequent rounds give more time for messages to arrive. This is the same exponential-backoff-flavored approach that PBFT uses, but Tendermint makes it more explicit and tunable.

Proposer Selection

Tendermint uses a deterministic, weighted round-robin proposer selection:

function select_proposer(validators, round):
    // Each validator has a "priority" that accumulates
    // based on their voting power
    for v in validators:
        v.priority += v.voting_power

    // Select the validator with highest priority
    proposer = max(validators, key=lambda v: v.priority)

    // Decrease selected proposer's priority
    proposer.priority -= total_voting_power

    return proposer

This ensures that validators propose blocks proportional to their voting power. A validator with 10% of the stake proposes approximately 10% of blocks. The algorithm is deterministic — all validators compute the same proposer for each round — which is essential for consensus.

Tendermint vs PBFT vs HotStuff

Let’s compare the three BFT protocols we’ve covered.

Protocol Structure

AspectPBFTHotStuffTendermint
PhasesPre-prepare, Prepare, CommitPrepare, Pre-commit, Commit, DecidePropose, Prevote, Precommit
Communication patternAll-to-all (prepare, commit)Star (through leader)All-to-all (prevote, precommit)
Message complexityO(n^2) per decisionO(n) per viewO(n^2) per round
View/round changeSeparate complex protocolSame as normal caseSame as normal case (next round)
CryptoMACs + signaturesThreshold signaturesStandard signatures
PipeliningNo (in base protocol)Yes (Chained HotStuff)No (one block per height)

Locking Mechanism

AspectPBFTHotStuffTendermint
When lockedAfter prepared certificateAfter pre-commit QCAfter observing polka (2/3+ prevotes)
Lock scopePer sequence numberPer node in chainPer height (across rounds)
Unlock conditionView change with higher certificateHigher QC from safe_node ruleHigher polka in later round
Lock persistenceMust survive crashesMust survive crashesWAL-based persistence

Performance Characteristics

MetricPBFTHotStuffTendermint
Typical block timeN/A (request-based)0.5-2s (chained)1-7s (configurable)
Throughput (n=4)80K ops/s60K ops/s1K-10K TPS
Throughput (n=100)<5K ops/s25K ops/s100-1K TPS
FinalityImmediateImmediateImmediate
Latency (LAN)3-10ms5-20ms1-7s (block time)
Latency (WAN)100-500ms300-500ms5-15s

Tendermint’s throughput numbers are lower in part because it’s measuring different things: transactions per second through the full application stack (consensus + ABCI + application execution), not just consensus operations. The ABCI overhead and application execution time are significant factors. Raw consensus throughput (without application) would be higher.

Why Tendermint Didn’t Adopt Linear Complexity

A natural question: if HotStuff achieves O(n) message complexity, why does Tendermint stick with O(n^2)?

Several reasons:

  1. Simplicity. Tendermint uses standard digital signatures, not threshold signatures. No DKG ceremony, no complex cryptographic setup. Any validator can join with a standard key pair. This dramatically simplifies deployment and key management.

  2. Practical validator sets. Most Cosmos chains run with 50-175 validators. At this scale, O(n^2) is manageable — 175^2 = 30,625 messages per round, which is high but feasible with modern networking. The chains that need 1000+ validators are rare and typically use delegated staking to keep the active set small.

  3. Gossip-based communication. Tendermint doesn’t actually send n^2 direct messages. It uses a gossip protocol: each validator sends its vote to a subset of peers, who relay it further. This doesn’t change the theoretical complexity, but it spreads the load and works well in real networks.

  4. No leader bottleneck. With all-to-all communication, no single node is overloaded. In HotStuff, the leader processes all n votes and aggregates them — it does more work than any other node. In Tendermint, work is distributed evenly.

  5. Historical timing. Tendermint’s core protocol was designed in 2014, years before HotStuff (2018). By the time HotStuff was published, Tendermint had a large production ecosystem. Switching consensus protocols for a running network with billions of dollars at stake is… not done casually.

Light Client Verification

One of Tendermint’s most practical features is its support for light clients — clients that verify consensus without running the full protocol or storing the full state.

A Tendermint light client needs:

  1. A trusted block header (from genesis or a trusted source).
  2. The current validator set.
  3. Block headers and commit signatures for blocks it wants to verify.
// Light client verification
function verify_block(header, commit, trusted_validators):
    // Check that the commit contains 2/3+ voting power
    // of signatures from the validator set
    total_power = sum(v.voting_power for v in trusted_validators)
    signed_power = 0

    for sig in commit.signatures:
        validator = trusted_validators.get(sig.validator_id)
        if validator == nil:
            continue  // Unknown validator, skip

        if not verify_signature(sig, validator.public_key, header):
            continue  // Invalid signature, skip

        signed_power += validator.voting_power

    if signed_power * 3 <= total_power * 2:
        return error("insufficient voting power: need >2/3")

    return ok(header)

// Verifying a header at height h given trusted header at height t
function verify_header_at_height(h, trusted_header_at_t):
    if h == t + 1:
        // Sequential verification: next block's header
        // contains the hash of the validator set that signed it
        header_h = fetch_header(h)
        commit_h = fetch_commit(h)
        validators_h = get_validators_from_header(trusted_header_at_t)
        return verify_block(header_h, commit_h, validators_h)

    else:
        // Skipping verification: can skip ahead if the validator set
        // hasn't changed too much (1/3 overlap rule)
        header_h = fetch_header(h)
        commit_h = fetch_commit(h)
        validators_h = fetch_validators(h)

        // Check that 1/3+ of trusted validators signed header_h
        // (this prevents long-range attacks)
        trusted_power = verify_overlap(
            commit_h, trusted_header_at_t.validators)

        if trusted_power * 3 <= total_trusted_power:
            // Not enough overlap — can't skip, must verify sequentially
            return verify_sequential(t, h)

        return verify_block(header_h, commit_h, validators_h)

The light client protocol enables:

  • Mobile wallets that verify blockchain state without downloading the full chain.
  • IBC (Inter-Blockchain Communication) — Cosmos’s cross-chain protocol uses light client verification to prove state on one chain to another.
  • Bridges to other ecosystems.

This is something that neither PBFT nor HotStuff addresses directly. Their papers focus on the consensus protocol itself, not on how external observers verify its output. Tendermint’s light client design is a practical contribution that came from building a system that real users need to interact with.

Real-World Deployment: The Cosmos Ecosystem

As of 2025, Tendermint/CometBFT powers:

  • Cosmos Hub — the central hub of the Cosmos ecosystem, with ~175 validators and billions in staked value.
  • Osmosis — a decentralized exchange with ~150 validators.
  • Celestia — a modular data availability layer.
  • dYdX — a derivatives exchange that migrated from Ethereum to its own Cosmos chain.
  • Hundreds of other chains in the Cosmos ecosystem, each with their own validator sets.

Production Lessons

Here are things we’ve learned from Tendermint’s production deployments that aren’t in any paper:

  1. Block time tuning is an art. The default 5-7 second block time balances finality latency against network propagation time. Chains have experimented with 1-second block times and found that it works in good network conditions but leads to frequent empty blocks and increased round failures when latency spikes. The right block time depends on your geographic distribution of validators and your tolerance for empty blocks.

  2. Validator infrastructure is heterogeneous. Some validators run on bare metal in data centers; others run on cloud VMs in different regions. The fastest validator might have 1ms network latency to its peers; the slowest might have 300ms. Timeout tuning must accommodate the slowest honest validator without giving Byzantine validators too much time to misbehave.

  3. Mempool management matters more than you think. The consensus protocol assumes transactions are available — but getting the right transactions into blocks, deduplicating across the gossip network, and handling transaction validity that changes as state changes is complex. Tendermint’s mempool has been rewritten multiple times.

  4. State sync is essential. A new validator joining the network can’t replay blocks from genesis (that would take weeks for a mature chain). Tendermint supports state sync: downloading a recent state snapshot and only replaying recent blocks. This requires trust in the snapshot provider, which somewhat undermines the BFT model. In practice, validators use snapshots from multiple sources and verify against the light client protocol.

  5. Evidence of misbehavior. Tendermint collects evidence of Byzantine behavior (double-signing, specifically) and includes it in blocks. The application can then punish (slash) the misbehaving validator. This economic incentive layer is not part of the consensus protocol per se, but it’s essential for the system’s security in a proof-of-stake setting.

  6. Upgrades are the hardest problem. Upgrading the consensus protocol on a running network with 150+ independent validators requires coordination that no paper describes. Cosmos chains use “governance proposals” where validators vote on an upgrade block height, and at that height, all validators simultaneously switch to the new software. When this works, it’s elegant. When a validator misses the memo, it forks off and needs to catch up.

ABCI++: The Evolution

CometBFT v0.38 introduced ABCI++ (also called ABCI 2.0), which extends the interface with new hooks:

// New ABCI++ methods
interface Application_v2 extends Application:
    // Called when a proposer is preparing a block
    // Allows the application to reorder, add, or remove transactions
    function PrepareProposal(txs, max_bytes) -> ResponsePrepareProposal

    // Called by non-proposer validators to validate a proposed block
    // Can reject the entire block (vote nil) or accept
    function ProcessProposal(block) -> ResponseProcessProposal

    // Called to extend the precommit vote with application data
    function ExtendVote(block) -> ResponseExtendVote

    // Called to verify another validator's vote extension
    function VerifyVoteExtension(extension) -> ResponseVerifyVoteExtension

    // Replaces BeginBlock + DeliverTx + EndBlock with a single call
    function FinalizeBlock(block) -> ResponseFinalizeBlock

These additions address real problems:

  • PrepareProposal/ProcessProposal: Gives the application control over block contents. Applications can implement MEV (Maximal Extractable Value) protection, transaction ordering policies, or custom validity rules that go beyond CheckTx.

  • ExtendVote/VerifyVoteExtension: Allows validators to attach application-specific data to their votes. Use cases include oracle price feeds (validators attest to off-chain data during consensus), threshold decryption (validators contribute decryption shares), and more.

  • FinalizeBlock: Replaces the multi-call block execution with a single atomic call, reducing ABCI overhead.

The Practical Engineering Decisions

Let me enumerate the decisions Tendermint made that you won’t find in BFT papers but that matter enormously in production:

1. Gossip Over Direct Communication

Tendermint doesn’t maintain n^2 direct connections. Instead, each node maintains connections to a subset of peers and uses gossip to disseminate messages. The gossip protocol adds latency (messages take multiple hops) but dramatically reduces the number of connections each node must maintain.

For 150 validators, maintaining 149 direct TCP connections is feasible but adds memory and CPU overhead per connection (TLS, keepalives, etc.). Gossip with 20-40 peers is more practical and more resilient to network topology changes.

2. WAL-Based Crash Recovery

Every state transition — receiving a proposal, prevoting, locking, precommitting — is written to a write-ahead log before the action takes effect. On recovery, the WAL is replayed to restore the validator to its pre-crash state. This is conceptually simple but the details matter: the WAL must be fsynced before proceeding, which adds ~1-5ms of latency per consensus step on typical SSDs.

3. Evidence Handling

When a validator detects equivocation (a peer signed two different blocks or votes at the same height/round), it collects the conflicting signatures as evidence and broadcasts them. The evidence is included in future blocks, and the application can slash the misbehaving validator.

This creates an incentive layer that exists outside the consensus protocol. The protocol itself doesn’t need slashing to be safe — safety comes from the 2/3+ honest assumption. But slashing makes it economically irrational to be Byzantine, which is the practical argument for why the 2/3+ honest assumption holds.

4. Proposer-Based Timestamps

Tendermint originally used BFT time (median of validator-reported timestamps) for block timestamps. This was replaced with proposer-based timestamps in later versions, where the proposer sets the block time and validators reject blocks with timestamps too far from their local clocks. This simplifies the protocol and removes a subtle attack vector where Byzantine validators could skew the median time.

5. Block Size and Gas Limits

Block size limits and gas limits (maximum computational work per block) are application-level parameters, not consensus parameters. But they profoundly affect consensus performance: a block that takes 5 seconds to execute means the effective minimum block time is 5+ seconds, regardless of what the consensus timeout is configured to. This coupling between application execution time and consensus latency is a source of constant tuning.

When Tendermint Is the Right Choice

Tendermint/CometBFT makes sense when:

  • You’re building an application-specific blockchain. The ABCI separation lets you write your application logic in any language while getting production-tested BFT consensus.
  • You need immediate finality. Unlike Nakamoto consensus, Tendermint blocks are final once committed. No waiting for 6 confirmations or worrying about chain reorganizations.
  • Your validator set is moderate-sized (10-200). Tendermint performs well in this range. Beyond 200, the O(n^2) message complexity starts to bite.
  • You want the Cosmos ecosystem. IBC (Inter-Blockchain Communication), the Cosmos SDK, and a large community of validators and developers are significant assets.
  • You need light client support. Tendermint’s light client protocol is mature and well-tested.

Tendermint is less ideal when:

  • You need thousands of consensus participants. Use something with linear complexity.
  • You’re not building a blockchain. If you just need replicated state machine with BFT, Tendermint’s blockchain-specific features (blocks, heights, ABCI) may be unnecessary overhead. Consider a general-purpose BFT library.
  • You need sub-second finality. Tendermint’s block-based structure means latency is at least one block time (typically 1-7 seconds).
  • You don’t need BFT. If your replicas are trusted, Raft is simpler, faster, and more appropriate.

The Legacy

Tendermint’s lasting contribution isn’t just the consensus protocol — it’s the demonstration that BFT consensus can be productized, deployed at scale, and maintained by an ecosystem of independent operators. The academic BFT community produced brilliant protocols. Tendermint proved they could be turned into infrastructure that handles billions of dollars in value.

The protocol itself is a pragmatic blend of PBFT’s ideas with practical engineering: standard signatures instead of threshold signatures, gossip instead of direct communication, ABCI instead of tightly coupled state machines, WAL-based recovery instead of assumed reliability. These choices sacrifice theoretical optimality for operational simplicity, and the thriving Cosmos ecosystem suggests that’s the right tradeoff for many applications.

CometBFT continues to evolve, and the lessons learned from operating hundreds of chains inform each iteration. The gap between a BFT paper and a BFT production system is still enormous, but Tendermint has done more than any other project to bridge it.

BFT in Blockchains vs BFT Everywhere Else

Byzantine fault tolerance had a quiet three decades. From Lamport, Shostak, and Pease’s original 1982 paper through the early 2010s, BFT was a respected but niche area of distributed systems research. Conferences published papers. Graduate students wrote dissertations. A handful of production systems used it. Then Satoshi Nakamoto published a white paper, and suddenly everyone needed to know about Byzantine fault tolerance.

The irony is rich. Nakamoto consensus — proof of work — is arguably the least efficient BFT protocol ever deployed in production. It burns electricity equivalent to small countries, takes minutes to hours for probabilistic finality, and processes single-digit transactions per second. But it solved a problem that classical BFT protocols couldn’t: open membership. Anyone can join. No one needs permission. No distributed key generation ceremony, no known validator set, no upfront coordination. That property was so valuable that the world was willing to pay an enormous price for it.

This chapter examines the relationship between BFT and blockchains, the fundamental differences between permissioned and permissionless BFT, and the question that every non-blockchain engineer eventually asks: “Do I actually need Byzantine fault tolerance?”

The Fundamental Divide: Open vs Closed Membership

The single most important distinction in BFT protocols is membership: who gets to participate?

Closed Membership (Permissioned BFT)

Classical BFT protocols — PBFT, HotStuff, Tendermint — all assume a known, fixed set of participants. You know who the replicas are, you have their public keys, and you can count them to determine quorum sizes.

Properties of closed-membership BFT:

  • n is known. You can compute 3f + 1 and set quorum sizes precisely.
  • Identity is established. Every message is signed by a known party.
  • Sybil resistance is free. An attacker can’t create fake identities because membership is controlled.
  • Communication is bounded. You know who to send messages to and how many responses to expect.
  • Finality is deterministic. Once 2f + 1 replicas commit, the decision is final. No probabilistic hand-waving.

Open Membership (Permissionless BFT)

Nakamoto-style consensus and its descendants allow anyone to participate. You don’t know how many participants there are, you don’t know who they are, and you can’t trust any identity because identities are free.

Properties of open-membership BFT:

  • n is unknown. You can’t compute quorum sizes in the traditional sense.
  • Identity is cheap. Creating new identities (Sybil attack) is trivial without some costly resource.
  • Communication is unbounded. You can’t send messages to “all replicas” because you don’t know who they are.
  • Finality is probabilistic. Decisions become “more final” over time but are never absolutely irrevocable (in pure Nakamoto consensus).
  • Participation requires proof of resource. To prevent Sybil attacks, participants must prove they’ve expended some scarce resource: computation (proof of work), capital (proof of stake), storage (proof of space), etc.

This divide is not a spectrum — it’s a categorical difference that shapes every aspect of protocol design:

AspectPermissioned BFTPermissionless BFT
MembershipKnown, fixed (or controlled changes)Open, anyone can join/leave
IdentityEstablished via PKI or out-of-bandPseudonymous or anonymous
Sybil resistanceMembership controlProof of resource expenditure
Quorum definition2f + 1 out of 3f + 1 known nodesLongest chain / most accumulated weight
FinalityDeterministic, immediateProbabilistic, grows over time
Throughput1K - 100K+ TPS3-20 TPS (PoW); higher with PoS
Latency to finalityMilliseconds to secondsMinutes to hours (PoW); seconds (PoS)
Fault toleranceUp to f < n/3 ByzantineUp to 50% hash power (PoW); varies
Energy costNegligibleEnormous (PoW); negligible (PoS)

Nakamoto Consensus: BFT by Another Name

Let’s be precise about what Bitcoin’s proof-of-work consensus actually provides, because it’s surprisingly subtle.

The Protocol

// Nakamoto consensus (simplified)
function mine():
    while true:
        // Select transactions from mempool
        txs = select_transactions()

        // Build block extending the longest chain
        parent = tip_of_longest_chain()
        block = Block{
            parent_hash: hash(parent),
            transactions: txs,
            timestamp: now(),
            nonce: 0
        }

        // Find a nonce that makes the block hash below the target
        while hash(block) > difficulty_target:
            block.nonce += 1

        // Found a valid block!
        broadcast(block)
        add_to_chain(block)

function on_receive_block(block):
    if not validate_block(block):
        reject(block)
        return

    add_to_chain(block)

    // Fork choice rule: always follow the longest chain
    // (actually: chain with most accumulated proof of work)
    if accumulated_work(block.chain) > accumulated_work(current_tip.chain):
        switch_to_chain(block.chain)
        // This might REVERT previously accepted blocks!

The BFT Properties

Nakamoto consensus provides:

  • Safety (probabilistic). The probability of a committed transaction being reversed decreases exponentially with the number of confirmation blocks. After 6 confirmations (~60 minutes for Bitcoin), reversal requires an attacker with >50% of the network’s hash power.

  • Liveness (probabilistic). As long as honest miners control >50% of hash power, new blocks will eventually be produced and transactions will eventually be included.

  • Censorship resistance. No single party can prevent a valid transaction from eventually being included, as long as honest miners will include it.

The fault tolerance bound is different from classical BFT: 50% of hash power rather than 33% of nodes. This is because Nakamoto consensus doesn’t require all honest parties to communicate — it uses the chain itself as the communication medium. The cost is probabilistic finality and much lower throughput.

Why Computer Scientists Were Skeptical

When Bitcoin first appeared, many distributed systems researchers were dismissive. The throughput was laughable (7 TPS), the latency was absurd (60 minutes for reasonable safety), and the energy consumption was unconscionable. By the metrics that classical BFT cared about, Nakamoto consensus was a terrible protocol.

What the researchers initially missed was that Bitcoin optimized for a different metric: permissionless participation. The ability to join the network without anyone’s permission, to mine blocks without establishing identity, and to transact without trusting any specific party was novel and, for certain applications, worth the enormous cost.

The subsequent decade of blockchain research has been, in many ways, an attempt to get the permissionless property without paying the Nakamoto tax. Proof of stake, committee-based BFT, and various hybrid approaches all try to shrink the gap between permissioned BFT’s performance and permissionless BFT’s openness.

Why Most Non-Blockchain Systems Don’t Need BFT

Here’s the argument that should be your default: if you’re not building a blockchain or a system with mutually distrusting operators, you probably don’t need BFT. Let me justify this.

The Threat Model Argument

BFT protects against Byzantine faults: nodes that behave arbitrarily, including lying, equivocating, sending contradictory messages to different peers, and actively trying to sabotage the protocol. For BFT to be worth its cost, the Byzantine threat must be realistic.

In a typical enterprise deployment:

  • You control all the nodes. They run your software, on your infrastructure, managed by your team. A “Byzantine” node in this context means either a bug or a security compromise.
  • Bugs are usually crash faults. Most software bugs cause crashes, hangs, or incorrect output that’s detectable (wrong format, invalid values). Truly Byzantine bugs — where a node produces valid-looking but incorrect output that other nodes can’t distinguish from correct behavior — are rare.
  • Security compromises are all-or-nothing. If an attacker compromises one node in your cluster, they likely have (or will soon have) access to the others, because they share infrastructure, credentials, and access patterns. BFT with f = 1 doesn’t help if the attacker can compromise all nodes.
  • The cost is significant. BFT requires 3f + 1 nodes instead of 2f + 1 (50% more). Message complexity is higher. Latency is higher. The implementation is more complex, meaning more bugs, meaning more operational burden.

For these reasons, the vast majority of production systems use crash fault tolerant (CFT) consensus: Raft, Multi-Paxos, Zab, or similar. Google’s Spanner, Amazon’s DynamoDB, Apache Kafka, etcd, CockroachDB, TiKV — all CFT.

The Performance Gap

Let’s quantify what BFT costs compared to CFT:

MetricCFT (Raft)BFT (PBFT, n=4)BFT (PBFT, n=7)Overhead
Minimum nodes (f=1)34+33%
Minimum nodes (f=2)577+40%
Messages per decisionO(n)O(n^2)O(n^2)Quadratic
Throughput (typical, LAN)100K+ ops/s50-80K ops/s30-60K ops/s2-5x lower
Latency (typical, LAN)<1ms1-3ms2-5ms2-5x higher
Crypto overhead per msgNone or HMACSignature verifySignature verifySignificant
Implementation complexityModerateHighHigh2-3x more code
Testing difficultyModerateVery highVery highNeed Byzantine fault injection

For most workloads, paying a 2-5x performance penalty and significantly higher complexity to protect against a threat that doesn’t materially apply is a poor engineering decision.

When People Think They Need BFT But Don’t

Common scenarios where teams consider BFT but probably shouldn’t:

  1. “Our nodes might have bugs.” Yes, but CFT already handles the most common bug manifestation (crashes). For non-crash bugs, invest in testing, monitoring, and detection rather than BFT. A system that detects and alerts on Byzantine behavior is cheaper and more practical than one that tolerates it.

  2. “We run in multiple clouds.” Multi-cloud deployment protects against cloud provider failures (a form of partition), not against Byzantine behavior. Use CFT with replicas spread across providers.

  3. “We don’t trust our partners’ software.” If you’re integrating with partners, the trust boundary is at the API level, not the consensus level. Use contract validation, cryptographic signatures on data, and audit logs rather than BFT consensus.

  4. “We need regulatory compliance.” Regulators care about auditability, data integrity, and availability — not the specific fault tolerance model of your consensus protocol. A CFT system with proper audit logging meets regulatory requirements.

When You Actually Need BFT

Having argued against BFT for most cases, let me now make the case for it. There are real scenarios where Byzantine fault tolerance is appropriate outside of public blockchains.

Multi-Party Computation with Untrusted Participants

When multiple organizations need to jointly compute something — a financial settlement, a supply chain verification, a collaborative analysis — and no single organization trusts the others to operate honestly, BFT consensus provides guarantees that CFT cannot.

Example: Multi-bank settlement system.

Bank A         Bank B         Bank C         Bank D
(runs node)    (runs node)    (runs node)    (runs node)
   |              |              |              |
   | (each bank submits transactions)           |
   | (BFT consensus orders them)                |
   | (all banks execute the same order)         |
   | (any bank can verify the computation)      |

In this scenario:

  • Each bank controls its own node. A compromise of Bank A’s node shouldn’t affect the system’s correctness.
  • Banks don’t trust each other not to submit conflicting transactions or attempt to double-spend.
  • The consensus protocol must be correct even if one bank’s node is actively trying to cheat.
  • CFT would be insufficient: if Bank A’s node crashes in Raft, it just loses its vote. But if Bank A’s node is Byzantine in Raft, it could equivocate and cause inconsistency that Raft can’t detect.

This is the permissioned blockchain use case, and it’s legitimate. Hyperledger Fabric, R3 Corda, and similar projects target this space.

Financial Trading Systems

High-frequency trading systems sometimes use replicated state machines for order matching. When the participants include potentially adversarial traders, BFT prevents a compromised matching engine replica from manipulating order execution.

Supply Chain with Untrusted Participants

Multiple companies in a supply chain — manufacturers, shippers, retailers — need to track goods. If any participant can unilaterally alter the shared record, the system is meaningless. BFT ensures that the shared record is correct even if some participants misbehave.

Critical Infrastructure with Defense-in-Depth

Some safety-critical systems (aviation, nuclear, medical) use BFT not because they expect adversarial behavior but as defense in depth. If a hardware fault causes a node to produce arbitrary outputs (e.g., a bit flip in memory that changes a control signal), BFT ensures the system continues correctly. This is the original motivation for Byzantine fault tolerance from the 1980s, predating blockchains by decades.

Multi-Cloud with Genuine Distrust

This is different from “we run in multiple clouds for availability.” This is: “we run in multiple clouds because we don’t trust any single cloud provider to not be compromised or compelled to tamper with our computation.” Government agencies, organizations handling classified data, and some financial institutions have this genuine concern. BFT across cloud providers ensures that a compromised provider can’t unilaterally affect the computation.

The Cost-Benefit Analysis

Here’s a framework for deciding whether BFT is warranted:

function should_use_bft():
    // Question 1: Is Byzantine behavior a realistic threat?
    byzantine_threat =
        operators_are_mutually_distrusting OR
        nodes_run_different_software_stacks OR
        nodes_are_in_different_security_domains OR
        compromise_of_one_node_is_independent_of_others

    if not byzantine_threat:
        return NO  // Use CFT

    // Question 2: Is the cost acceptable?
    n_required = 3 * f + 1  // vs 2 * f + 1 for CFT
    performance_overhead = 2x to 10x  // vs CFT
    implementation_complexity = HIGH
    operational_complexity = HIGH

    cost_acceptable =
        n_required is feasible AND
        performance_overhead is tolerable AND
        team_has_bft_expertise

    if not cost_acceptable:
        return MAYBE_USE_DETECTION_INSTEAD
        // Monitor for Byzantine behavior, alert, and handle manually

    // Question 3: Is there a simpler alternative?
    simpler_alternative =
        can_use_cryptographic_signatures_on_data OR
        can_use_audit_logs_with_detection OR
        can_use_trusted_hardware (SGX, etc.) OR
        can_restructure_to_avoid_shared_state

    if simpler_alternative:
        return PROBABLY_NO  // Use the simpler thing

    return YES  // BFT is warranted

Most paths through this decision tree lead to “no.” That’s intentional. BFT is expensive and complex; the bar for using it should be high.

Hybrid Approaches

The binary choice between CFT and BFT is a false dichotomy. Several hybrid approaches exist:

BFT for Ordering, CFT for Execution

Use BFT consensus to agree on the order of operations, then execute those operations on trusted infrastructure using simpler protocols. This is essentially what Hyperledger Fabric v2 does: an ordering service provides BFT ordering, but the peer nodes that execute transactions use simpler endorsement policies.

Detect-and-Recover Instead of Tolerate

Instead of tolerating Byzantine faults in real-time (which requires 3f + 1 nodes), detect them after the fact and recover:

// Byzantine detection approach
function detect_and_recover():
    // All nodes execute and sign their results
    for node in nodes:
        result[node] = node.execute(operation)
        sig[node] = node.sign(result[node])

    // Verify agreement
    if all_results_match(result):
        return result[0]  // All good

    // Disagreement detected — identify Byzantine node
    majority_result = find_majority(result)
    for node in nodes:
        if result[node] != majority_result:
            flag_as_byzantine(node)
            // Alert, investigate, replace

    return majority_result

This approach works when:

  • Immediate tolerance isn’t required (you can afford brief incorrect behavior).
  • Detection is sufficient deterrent (the Byzantine node faces consequences).
  • Recovery is feasible (you can replace or patch the faulty node).

Many practical systems use this approach: they run with CFT consensus but add cryptographic auditing to detect Byzantine behavior retroactively.

Trusted Execution Environments (TEEs)

Hardware enclaves like Intel SGX, AMD SEV, or ARM TrustZone can provide integrity guarantees at the hardware level. If you trust the hardware, a node running in a TEE can’t produce Byzantine outputs (ignoring side-channel attacks and hardware bugs, which is a big “if”).

Using TEEs, you can potentially run CFT consensus with BFT-like guarantees:

ApproachNodes Required (f=1)PerformanceAssumptions
Pure CFT3HighestCrash faults only
CFT + TEE3High (TEE overhead)Trust hardware vendor
Pure BFT4Moderatef < n/3 Byzantine
BFT + TEE3Moderate (TEE overhead)Can reduce n; trust hardware

The TEE approach has been used in several systems, notably Microsoft’s CCF (Confidential Consortium Framework), which uses SGX enclaves to provide BFT-like guarantees with CFT-like node counts.

The catch: trusting hardware is a strong assumption. SGX has had multiple vulnerability disclosures (Foreshadow, Plundervolt, AEPIC). Whether hardware trust is more or less reasonable than trusting your replicas to be non-Byzantine depends on your threat model.

Optimistic BFT

Run the system optimistically assuming no Byzantine faults (essentially CFT performance). If a Byzantine fault is detected, fall back to the full BFT protocol.

// Optimistic BFT (simplified Zyzzyva-style)
function optimistic_commit(request):
    // Fast path: all replicas respond with the same result
    results = broadcast_and_collect(request)

    if all_match(results) and count(results) == 3 * f + 1:
        // All replicas agree — commit immediately
        // One network round trip!
        return results[0]

    else:
        // Disagreement or missing responses
        // Fall back to full BFT protocol (PBFT-like)
        return slow_path_bft(request)

The optimistic path gives CFT-like performance (one round trip) when all nodes behave correctly. The slow path provides safety when Byzantine faults occur. The downside: the slow path is more complex and slower than standard BFT because it needs to handle the transition from optimistic to pessimistic mode.

BFT Protocol Selection Guide

Given the landscape, here’s a practical guide:

For Permissionless Blockchains

NeedRecommendationWhy
Maximum decentralizationNakamoto (PoW)No identity required; proven at scale
Better performancePoS with BFT finalityEthereum 2.0 (Casper FFG), Cosmos (Tendermint)
Throughput focusCommittee-based BFTAlgorand, Solana (modified)
Simple smart contractsEstablished L1Use existing ecosystem, don’t build consensus

For Permissioned Blockchains

NeedRecommendationWhy
< 20 validatorsPBFT or variantSimple crypto, well-understood
20-200 validatorsTendermint/CometBFTProduction-tested, ecosystem
> 200 validatorsHotStuff variantLinear complexity necessary
Maximum throughputHotStuff + pipeliningChained HotStuff or DiemBFT

For Non-Blockchain Systems

NeedRecommendationWhy
Trusted operatorsCFT (Raft, Paxos)BFT overhead not warranted
Untrusted operators, < 20 nodesPBFTSimple, well-understood
Untrusted operators, > 20 nodesHotStuffLinear complexity
Hardware trust availableCFT + TEEFewer nodes, good performance
Detection sufficientCFT + auditSimplest solution with accountability

Comparison Table: BFT Protocols Across Domains

ProtocolDomainMembershipFault ToleranceFinalityThroughputLatency
PBFTGeneral BFTClosedf < n/3Deterministic10K-80K ops/s1-10ms
HotStuffGeneral BFTClosedf < n/3Deterministic10K-100K+ ops/s5-20ms
TendermintBlockchainClosed*f < n/3Deterministic1K-10K TPS1-7s
RaftGeneral CFTClosedf < n/2Deterministic100K+ ops/s<1ms
Nakamoto PoWBlockchainOpen50% hashrateProbabilistic3-7 TPS~60 min
Casper FFGBlockchainOpen**f < n/3 stakeDeterministicVaries~15 min
AlgorandBlockchainOpenf < n/3 stakeDeterministic1K+ TPS~4s

*Tendermint’s validator set can change over time via governance, but at any given height, membership is known. **Casper FFG uses economic bonding to establish a known validator set from an open pool of potential validators.

Case Study: The Same Problem, Three Solutions

To make the BFT versus CFT versus blockchain decision concrete, consider a real scenario: three banks want to run a shared settlement system. They don’t fully trust each other. Each bank processes approximately 10,000 transactions per day that need to be jointly ordered and settled.

Solution A: Central Trusted Party + CFT

Appoint one bank (or a neutral third party) as the operator. Run a 3-node Raft cluster under their control. The other banks submit transactions via API and trust the operator’s system to be correct.

// Architecture: Central operator with CFT
Operator runs: 3-node Raft cluster
Bank A: submits transactions via authenticated API
Bank B: submits transactions via authenticated API
Bank C: reads results, verifies against own records

Fault tolerance: Crash faults in operator's cluster
Trust model: All banks trust the operator
Performance: 100K+ ops/sec, <1ms latency
Cost: 3 servers, standard ops team

Pros: Simple, fast, well-understood technology. Cons: Requires trusting the operator. If the operator is compromised or malicious, all bets are off. The other banks have no way to verify the operator didn’t reorder, drop, or fabricate transactions.

Solution B: Permissioned BFT

Each bank runs one (or more) BFT replicas. Use PBFT or a similar protocol with n = 4 (one per bank plus a tie-breaker, or one per bank if there are four banks).

// Architecture: Permissioned BFT
Bank A runs: 1 PBFT replica
Bank B runs: 1 PBFT replica
Bank C runs: 1 PBFT replica
Neutral party runs: 1 PBFT replica (or a 4th bank)

Fault tolerance: 1 Byzantine fault (f=1, n=4)
Trust model: Any 1 party can be fully Byzantine
Performance: 50K-80K ops/sec, 1-5ms latency (LAN)
Cost: 4 servers across 4 organizations, BFT expertise needed

Pros: No single party needs to be trusted. Any one bank can be compromised without affecting correctness. Every bank can independently verify the settlement log. Cons: Requires BFT expertise. Higher latency if banks are geographically distributed. More complex operations (4 organizations coordinating software upgrades, key rotation, etc.).

Solution C: Blockchain

Deploy a permissioned blockchain (e.g., Hyperledger Fabric, or a Cosmos chain with the three banks as validators).

// Architecture: Permissioned blockchain
Bank A runs: Validator node + application
Bank B runs: Validator node + application
Bank C runs: Validator node + application

Consensus: Tendermint-based (3 validators, can tolerate 0 Byzantine!)
// Wait — with n=3, f must be 0 for 3f+1. That's not useful.
// Need n=4 for f=1. Add a 4th validator.

Fault tolerance: Same as Solution B
Trust model: Same as Solution B
Performance: 1K-10K TPS, 1-7s block time
Cost: 4 servers, blockchain expertise, smart contract development

Pros: Gets the BFT guarantees plus an ecosystem of tools (explorers, wallets, smart contracts). Audit trail is built in. Cons: Significantly lower throughput than raw BFT. Block-based latency. Requires blockchain-specific expertise in addition to distributed systems expertise. The “blockchain” label may help or hurt politically depending on your organization.

The Verdict

For this scenario, Solution B is likely the best fit: the genuine distrust between banks justifies BFT, but the closed membership and moderate transaction volume don’t require blockchain infrastructure. Solution A is appropriate if the banks can agree on a trusted operator (they often can’t). Solution C is appropriate if the banks want the broader ecosystem features or plan to expand to many participants.

The point isn’t that one solution is universally better — it’s that the choice depends entirely on the trust model and operational requirements, and getting the trust model wrong means either paying for security you don’t need or not getting the security you do.

The Future: Convergence or Divergence?

The blockchain world and the distributed systems world have been on converging paths:

  • Blockchain is adopting classical BFT. Ethereum’s move to proof of stake with Casper FFG incorporates BFT finality. Cosmos was built on BFT from the start. Many newer blockchains use committee-based BFT.

  • Classical systems are adopting blockchain ideas. The concept of a verifiable, append-only log — the blockchain’s core data structure — has influenced systems like AWS QLDB, Hyperledger, and various audit-log systems. Even teams not using BFT consensus are using Merkle trees and hash chains for data integrity.

  • The lines are blurring. Tendermint is a “blockchain consensus protocol” used for non-blockchain applications. PBFT is a “classical BFT protocol” used in blockchain systems. HotStuff was published as an academic protocol and deployed in a blockchain. The protocol doesn’t care what you call the system it runs in.

What remains different is the deployment context. A 3-node Raft cluster in a single data center and a 100-validator Tendermint network spanning the globe face fundamentally different engineering challenges, even though both are solving the consensus problem. The choice between CFT and BFT, between permissioned and permissionless, between immediate and probabilistic finality — these are driven by the trust model and operational context, not by the protocol’s intrinsic properties.

The honest conclusion: for most systems, most of the time, crash fault tolerance is sufficient. When it’s not, the reason is almost always that the operators don’t trust each other — and that’s a social problem that technology can help with but not fully solve. BFT consensus gives you the ability to cooperate with parties you don’t trust. That’s a remarkable capability. Just make sure you actually need it before paying the price.

EPaxos and Leaderless Consensus

The Tyranny of the Leader

Every consensus protocol we have examined so far shares a common architectural assumption: someone has to be in charge. Multi-Paxos has its distinguished proposer. Raft has its leader. ZAB has its primary. Even Viewstamped Replication, despite its name suggesting something more democratic, funnels all decisions through a single node.

This works. It simplifies reasoning, reduces conflicts, and makes implementation tractable. It also creates a bottleneck that, in geo-distributed deployments, becomes the kind of performance problem that makes you question your career choices.

Consider a five-node cluster spread across US-East, US-West, Europe, Asia, and South America. With Raft, the leader sits in one of these regions. Every write must travel to the leader, then the leader must replicate to a majority. If the leader is in US-East, a client in Asia pays the Asia-to-US-East round trip plus the replication latency. The theoretical minimum of two message delays becomes, in practice, a transcontinental odyssey.

The dream of leaderless consensus is simple: any node can propose a command, and the system will figure out a consistent ordering without routing everything through a single point. EPaxos — Egalitarian Paxos, published by Iulian Moraru, David Andersen, and Michael Kaminsky in 2013 — is the most ambitious attempt at realizing this dream.

It is also, as we shall see, a cautionary tale about the distance between a brilliant idea and a correct implementation.

Why Leaders Are a Bottleneck

Before we dive into EPaxos, let us be precise about what the leader bottleneck actually costs us.

Latency asymmetry. In a geo-distributed Multi-Paxos deployment with the leader in region A, clients in region A enjoy low-latency writes (one local round trip plus replication). Clients in region E endure the full cross-region penalty. This asymmetry is not just annoying — it can violate SLA requirements for globally distributed applications.

Throughput ceiling. The leader must process every proposal, serialize it into the log, and coordinate replication. A single node’s CPU, memory bandwidth, and network capacity bound the system’s throughput. You can shard, of course, but then you are no longer solving the same problem.

Failover latency. When the leader fails, the system goes through an election. During this period — which can range from hundreds of milliseconds to several seconds depending on timeout configuration — the system is unavailable for writes. In a leaderless protocol, a single node failure does not create a global availability gap.

Load imbalance. The leader does more work than followers. It must handle client requests, manage the log, send AppendEntries, process responses, and advance the commit index. Followers mostly just respond to RPCs. This asymmetry in resource utilization is wasteful.

The appeal of leaderless consensus, then, is a system where any replica can handle any client request with optimal latency — one round trip to a fast-path quorum — and where load is naturally distributed across all replicas.

EPaxos: The Core Idea

EPaxos begins with a deceptively simple insight: if two commands do not interfere with each other — that is, they operate on different keys or different state — then their relative order does not matter. Only conflicting commands need to be ordered consistently across replicas.

This is the key departure from leader-based protocols, which impose a total order on all commands regardless of whether ordering is necessary. EPaxos imposes a total order only on commands that must be ordered (those that conflict) and allows non-conflicting commands to be executed in any order.

The mechanism for achieving this is dependency tracking. When a replica proposes a command, it collects information about which other commands the new command depends on — that is, which existing commands it conflicts with. These dependencies form a directed graph, and the execution order is determined by topologically sorting this graph (with a specific tie-breaking rule for cycles).

The Instance Space

EPaxos organizes commands into an instance space indexed by (replica, instance_number). Each replica R maintains its own monotonically increasing instance counter. When replica R wants to propose a command, it assigns it to instance (R, i) for the next available i.

This is different from Multi-Paxos, where there is a single global log with a single sequence of slots. In EPaxos, each replica has its own “column” of instances, and the execution order is determined by the dependency graph, not by slot position.

Structure Instance:
    command: Command
    deps: Set<(ReplicaId, InstanceNumber)>  // dependencies
    seq: Integer                             // sequence number for ordering
    status: {PreAccepted, Accepted, Committed, Executed}
    ballot: BallotNumber

// Each replica maintains:
Structure ReplicaState:
    id: ReplicaId
    instances: Map<(ReplicaId, InstanceNumber), Instance>
    next_instance: Map<ReplicaId, InstanceNumber>  // next available instance per replica
    committed_up_to: Map<ReplicaId, InstanceNumber>

The Fast Path

The fast path is the common case — when there are no conflicts or when all replicas in the fast quorum agree on the same set of dependencies. It completes in a single round trip.

Procedure ProposeCommand(command):
    // Step 1: Leader assigns instance, computes initial dependencies
    inst_num = next_instance[self.id]++
    deps = {}
    seq = 0

    // Find all instances in our log that conflict with this command
    for each (replica, inst) in instances:
        if instances[(replica, inst)].command conflicts_with command:
            deps.add((replica, inst))
            seq = max(seq, instances[(replica, inst)].seq + 1)

    instance = Instance{
        command: command,
        deps: deps,
        seq: seq,
        status: PreAccepted,
        ballot: current_ballot
    }
    instances[(self.id, inst_num)] = instance

    // Step 2: Send PreAccept to fast quorum
    // Fast quorum = floor(N/2) + floor((floor(N/2) + 1) / 2) replicas
    // For N=5, fast quorum = 3 (2 + ceil(1.5) = 2 + 2... actually it's N - 1 = 4)
    // Correction: For N=5, fast path quorum is (N-1)/2 + (N-1)/2 = ...
    // Actually: fast quorum size for N=5 is F+floor(F/2)+1 where F = floor((N-1)/2)
    // Let's just say: for N=5, fast quorum = 3 out of 4 other replicas
    replies = SendToFastQuorum(PreAcceptMessage{
        instance: (self.id, inst_num),
        command: command,
        deps: deps,
        seq: seq,
        ballot: current_ballot
    })

    // Step 3: Check if all replies agree on deps and seq
    all_agree = true
    for each reply in replies:
        if reply.deps != deps or reply.seq != seq:
            all_agree = false
            break

    if all_agree:
        // FAST PATH: commit directly
        instance.status = Committed
        SendToAll(CommitMessage{
            instance: (self.id, inst_num),
            command: command,
            deps: deps,
            seq: seq
        })
    else:
        // SLOW PATH: need another round
        GoToSlowPath(inst_num, replies)

Handling PreAccept on a Replica

When a replica receives a PreAccept message, it must do its own dependency computation. This is where the subtlety begins.

Procedure HandlePreAccept(msg):
    // Compute our own view of dependencies
    local_deps = msg.deps  // start with leader's deps
    local_seq = msg.seq

    for each (replica, inst) in instances:
        if instances[(replica, inst)].command conflicts_with msg.command:
            local_deps.add((replica, inst))
            local_seq = max(local_seq, instances[(replica, inst)].seq + 1)

    // Store the instance
    instances[(msg.sender, msg.inst_num)] = Instance{
        command: msg.command,
        deps: local_deps,
        seq: local_seq,
        status: PreAccepted,
        ballot: msg.ballot
    }

    // Reply with our computed deps and seq
    Reply(PreAcceptReply{
        deps: local_deps,
        seq: local_seq,
        instance: (msg.sender, msg.inst_num)
    })

The critical point: if a replica has seen commands that the proposing replica has not, it will include additional dependencies. If all replicas in the fast quorum agree on the same dependencies (including any additions), the fast path succeeds. If they disagree — because different replicas have seen different sets of commands — the slow path is needed.

The Slow Path

The slow path adds one more round of communication. The proposing replica takes the union of all dependencies reported by the fast quorum replicas and runs a Paxos-like Accept phase.

Procedure GoToSlowPath(inst_num, preaccept_replies):
    // Union all dependencies from all replies and our own
    merged_deps = instances[(self.id, inst_num)].deps
    merged_seq = instances[(self.id, inst_num)].seq

    for each reply in preaccept_replies:
        merged_deps = merged_deps UNION reply.deps
        merged_seq = max(merged_seq, reply.seq)

    // Update our instance
    instances[(self.id, inst_num)].deps = merged_deps
    instances[(self.id, inst_num)].seq = merged_seq
    instances[(self.id, inst_num)].status = Accepted

    // Phase 2: Accept (classic Paxos majority quorum)
    replies = SendToMajority(AcceptMessage{
        instance: (self.id, inst_num),
        command: instances[(self.id, inst_num)].command,
        deps: merged_deps,
        seq: merged_seq,
        ballot: current_ballot
    })

    // If majority accepts, commit
    if MajorityAccepted(replies):
        instances[(self.id, inst_num)].status = Committed
        SendToAll(CommitMessage{
            instance: (self.id, inst_num),
            command: instances[(self.id, inst_num)].command,
            deps: merged_deps,
            seq: merged_seq
        })

The slow path requires two round trips total (PreAccept + Accept), which is the same as standard Paxos. So in the worst case, EPaxos is no worse than Multi-Paxos. In the common case (no conflicts), it completes in one round trip from any replica — a genuine improvement for geo-distributed systems.

Execution Ordering: Where the Fun Really Begins

Getting commands committed is only half the battle. The other half — and this is where EPaxos gets genuinely tricky — is determining the execution order.

Each committed instance has a set of dependencies. These form a directed graph. To execute commands consistently across all replicas, every replica must compute the same execution order from this graph.

The algorithm proceeds as follows:

  1. Build the dependency graph for committed instances.
  2. Find strongly connected components (SCCs) using Tarjan’s algorithm.
  3. Execute SCCs in reverse topological order.
  4. Within each SCC, break ties using the sequence number (seq) and instance ID.

Tarjan’s Algorithm in EPaxos

The use of Tarjan’s algorithm is not arbitrary — it is necessary because the dependency graph can contain cycles. Command A might depend on command B (because replica 1 saw B before A’s PreAccept), while command B depends on A (because replica 2 saw A before B’s PreAccept). When dependencies are unioned in the slow path, both dependency edges survive, creating a cycle.

Procedure ExecuteCommands():
    // Build dependency graph from committed instances
    graph = BuildDependencyGraph()

    // Find SCCs using Tarjan's algorithm
    sccs = TarjanSCC(graph)

    // sccs is returned in reverse topological order by Tarjan's
    for each scc in sccs:
        // Sort instances within SCC by (seq, replica_id, instance_number)
        sorted_instances = Sort(scc, key = (inst.seq, inst.replica_id, inst.inst_num))

        for each instance in sorted_instances:
            if instance.status != Executed:
                // Must wait until all dependencies outside this SCC are executed
                for each dep in instance.deps:
                    if dep not in scc:
                        WaitUntilExecuted(dep)

                Execute(instance.command)
                instance.status = Executed

Structure TarjanState:
    index: Integer = 0
    stack: Stack<InstanceId>
    on_stack: Set<InstanceId>
    indices: Map<InstanceId, Integer>
    lowlinks: Map<InstanceId, Integer>
    sccs: List<List<InstanceId>>

Procedure TarjanSCC(graph):
    state = TarjanState{}

    for each node in graph:
        if node not in state.indices:
            StrongConnect(state, graph, node)

    return state.sccs

Procedure StrongConnect(state, graph, v):
    state.indices[v] = state.index
    state.lowlinks[v] = state.index
    state.index++
    state.stack.push(v)
    state.on_stack.add(v)

    for each w in graph.successors(v):
        if w not in state.indices:
            // w has not been visited
            StrongConnect(state, graph, w)
            state.lowlinks[v] = min(state.lowlinks[v], state.lowlinks[w])
        else if w in state.on_stack:
            state.lowlinks[v] = min(state.lowlinks[v], state.indices[w])

    if state.lowlinks[v] == state.indices[v]:
        // v is the root of an SCC
        scc = []
        repeat:
            w = state.stack.pop()
            state.on_stack.remove(w)
            scc.append(w)
        until w == v
        state.sccs.append(scc)

The Execution Blocking Problem

Here is a practical headache the paper does not dwell on: to execute an instance, you must first know the status of all its dependencies. If a dependency is not yet committed — perhaps the proposing replica is slow, or the PreAccept messages have not arrived — you are stuck. You cannot execute the instance, and you cannot execute anything that depends on it.

In the worst case, a single slow instance can block a cascade of executions. The solution is explicit commit: if you need to know whether instance (R, i) is committed and you have not heard, you must run a Paxos round to force the decision. This recovery protocol adds significant implementation complexity.

Procedure WaitUntilCommitted(instance_id):
    if instances[instance_id].status >= Committed:
        return  // already committed

    // Try to learn the committed value
    // Option 1: ask the owner replica
    reply = Ask(instance_id.replica, StatusRequest{instance: instance_id})
    if reply.status >= Committed:
        instances[instance_id] = reply.instance
        return

    // Option 2: run explicit Prepare (Paxos Phase 1) to recover
    RunExplicitPrepare(instance_id)

This recovery path is essentially a full Paxos round, which means that in the presence of failures or slow replicas, EPaxos execution can be significantly delayed. The paper presents this as a straightforward extension; implementers describe it as a major source of bugs.

The Correctness Story: A Humbling Episode

In 2020 — seven years after publication — Sutra et al. published a paper titled “On the correctness of Egalitarian Paxos” that identified a bug in the EPaxos execution algorithm. The specific issue was in how dependencies were handled during recovery after a replica failure.

The problem was subtle. Consider this scenario:

  1. Replica 1 proposes command A at instance (1, 1).
  2. Replica 2 proposes command B at instance (2, 1), with B conflicting with A.
  3. A’s PreAccept reaches some replicas but not others before replica 1 fails.
  4. A recovery procedure is initiated for instance (1, 1).

The bug manifested when the recovery procedure could commit instance (1, 1) with a set of dependencies that was inconsistent with what had been committed at other instances. This could lead to different replicas computing different execution orders — a violation of the fundamental safety property.

The fix required modifications to the recovery protocol, adding additional checks to ensure dependency consistency. The corrected protocol, sometimes called EPaxos*, is what practitioners should implement.

This episode is worth dwelling on for a moment. EPaxos was published at SOSP 2013, one of the most prestigious systems conferences. It was peer-reviewed, formally described, and accompanied by a proof sketch. And it still had a correctness bug that went undetected for seven years.

This is not an indictment of the authors — it is an indictment of the inherent difficulty of getting leaderless consensus right. The state space explosion that comes from allowing any replica to propose, combined with the dependency tracking mechanism, creates a protocol whose correctness is extraordinarily hard to verify by inspection.

Performance: The Geo-Distributed Sweet Spot

EPaxos was designed for one specific scenario: geo-distributed deployments where the leader bottleneck creates unacceptable latency asymmetry.

Non-conflicting commands, fast path (common case):

Client    Replica_R    Replica_A    Replica_B    Replica_C    Replica_D
  |           |            |            |            |            |
  |--Cmd----->|            |            |            |            |
  |           |--PreAcc--->|            |            |            |
  |           |--PreAcc-------------->|            |            |
  |           |--PreAcc-------------------------->|            |
  |           |--PreAcc------------------------------------>|
  |           |<--OK-------|            |            |            |
  |           |<--OK------------------|            |            |
  |           |<--OK-----------------------------|            |
  |           |  (3 matching replies = fast quorum for N=5)    |
  |<--Done----|            |            |            |            |
  |           |--Commit--->|            |            |            |
  |           |--Commit------------>|            |            |
  |           |--Commit------------------------->|            |
  |           |--Commit----------------------------------->|

Total latency: one round trip to the nearest fast-quorum replicas. For a 5-node cluster, we need 3 out of 4 replicas to respond identically. In a geo-distributed deployment, this means the latency is determined by the third-closest replica, not by the leader’s location.

Conflicting commands, slow path:

Client    Replica_R    Replica_A    Replica_B
  |           |            |            |
  |--Cmd----->|            |            |
  |           |--PreAcc--->|            |
  |           |--PreAcc------------>|
  |           |<-PreAccOK--|  (different deps!)
  |           |<-PreAccOK---------|
  |           |            |            |
  |           |  (deps disagree, go to slow path)
  |           |            |            |
  |           |--Accept--->|            |
  |           |--Accept------------>|
  |           |<--AccOK----|            |
  |           |<--AccOK-----------|
  |<--Done----|            |            |
  |           |--Commit--->|            |
  |           |--Commit------------>|

Total latency: two round trips. Same as Multi-Paxos.

Comparison with Multi-Paxos

MetricMulti-PaxosEPaxos (no conflict)EPaxos (conflict)
Round trips1 (from leader)1 (from any replica)2 (from any replica)
Optimal client latencyNear-zero (if co-located with leader)Near-zero (always, via nearest replica)One RTT to farthest quorum member
Worst-case client latency2x cross-region RTT1x cross-region RTT2x cross-region RTT
Throughput bottleneckLeader nodeNone (distributed)Conflict-heavy keys
Message complexityO(N) per commandO(N) per commandO(N) per command
Implementation complexityModerateVery HighVery High

The latency advantage is most dramatic for the common case of non-conflicting commands in geo-distributed settings. If your workload is 95% non-conflicting and your nodes are spread across continents, EPaxos can roughly halve your average latency compared to Multi-Paxos.

If your workload is heavily conflicting, or if your nodes are in the same datacenter (where cross-node latency is microseconds, not milliseconds), the advantage largely disappears and you are left paying the complexity tax for no benefit.

The Conflict Rate Problem

EPaxos’s fast path depends on low conflict rates. But what counts as a conflict?

In the original formulation, two commands conflict if they access the same key and at least one is a write. This means that a hot key — one that receives a disproportionate share of writes — will push commands to the slow path frequently.

Worse, the conflict check happens at the instance level, not the command level. If replica A has seen 1000 commands since the last time it synchronized with replica B, then any new command from B that conflicts with any of those 1000 commands will generate different dependency sets. The more out-of-sync replicas are, the more conflicts occur, even if the actual command conflict rate is low.

This creates an unfortunate feedback loop: high load leads to more in-flight commands, which leads to more dependency divergence, which leads to more slow-path executions, which increases latency, which increases the number of in-flight commands.

In practice, workloads with even moderate contention on popular keys spend a surprising amount of time on the slow path.

Atlas, Caesar, and the Leaderless Zoo

EPaxos inspired a family of leaderless consensus protocols, each trying to address different limitations.

Atlas (2020) simplifies EPaxos’s dependency tracking by using a different approach to ordering. Instead of tracking per-command dependencies, Atlas uses a “timestamp” approach where each replica assigns a timestamp to each command, and the final timestamp is the maximum across a quorum. This eliminates the complex dependency graph and Tarjan’s algorithm but requires all replicas in the quorum to respond (not just a majority within the quorum agreeing on dependencies).

Caesar (2017) introduces a technique for handling conflicts more gracefully. When conflicts are detected, Caesar uses a “wait-free” mechanism that avoids the slow path in more cases. The key insight is that if the conflicting commands can be ordered by their timestamps, no additional round trip is needed — the timestamp ordering is sufficient. Caesar only falls back to the slow path when commands have identical timestamps (rare in practice).

Mencius (2008, predating EPaxos) takes a different approach entirely. Instead of dependency tracking, Mencius pre-assigns log slots to replicas in a round-robin fashion. Replica 0 owns slots 0, 3, 6, …; replica 1 owns slots 1, 4, 7, …; and so on. Each replica runs Paxos independently for its own slots. This provides load balancing without the complexity of dependency tracking, but it means every replica must participate in every “round” — a slow replica slows everyone down.

Tempo (2021) attempts to achieve the best of both worlds: leaderless operation with simpler execution ordering. Tempo uses a clock-based approach where each command is assigned a timestamp, and execution order follows timestamp order. The protocol ensures that conflicting commands always receive ordered timestamps by having replicas propose timestamps and taking the maximum.

Comparative Summary

ProtocolLeader?Fast path RTTsHandles conflicts?Execution ordering
Multi-PaxosYes1 (from leader)N/A (total order)Log order
EPaxosNo1Slow path (2 RTTs)Dependency graph + Tarjan
AtlasNo1Better than EPaxosTimestamp-based
CaesarNo1Timestamp orderingTimestamp-based
MenciusNo1N/A (pre-assigned slots)Slot order

When Leaderless Consensus Actually Helps

After all this complexity, let us be honest about when leaderless consensus is worth the trouble.

It helps when:

  • Nodes are geo-distributed across multiple regions (cross-region latency >> intra-region latency).
  • The workload has low to moderate conflict rates.
  • Latency symmetry matters — all clients should see similar latency regardless of their region.
  • You have a team capable of implementing, testing, and debugging a protocol of this complexity.

It does not help when:

  • All nodes are in the same datacenter (the leader bottleneck is negligible).
  • The workload is heavily conflicting (you will spend most of your time on the slow path).
  • You need simplicity and auditability (Raft is dramatically easier to understand and verify).
  • Your team is not prepared for the implementation complexity.

It is complexity for complexity’s sake when:

  • You could achieve the same result with Multi-Paxos and client-side routing to the nearest leader in a multi-group setup.
  • Your actual bottleneck is not consensus latency but application logic, storage I/O, or network bandwidth.
  • You are building a prototype or a system that will be maintained by a small team.

Implementation Reality Check

The distance between the EPaxos paper and a production implementation is vast. Here are the things the paper does not adequately address:

Garbage collection. The instance space grows without bound. You need a mechanism to prune old instances once they have been executed by all replicas. This requires its own protocol — essentially a distributed garbage collection scheme with its own consistency requirements.

Snapshotting and recovery. When a new replica joins or a failed replica recovers, it needs to reconstruct the current state. With Multi-Paxos, this is relatively straightforward: transfer the log and snapshot. With EPaxos, you need to transfer the dependency graph, the execution state, and all un-garbage-collected instances across all replica columns.

Configuration changes. Adding or removing replicas changes the quorum sizes, including the fast-path quorum. EPaxos’s quorum requirements are more complex than Raft’s joint consensus, and the paper does not provide a complete configuration change protocol.

Read leases and linearizable reads. In Multi-Paxos, the leader can serve linearizable reads locally (with a lease or by confirming leadership). In EPaxos, there is no leader. Linearizable reads require either running a full consensus round or implementing a read protocol that checks with a quorum, which partly negates the latency advantage.

Testing. The state space of EPaxos is enormous. Every possible interleaving of PreAccept, Accept, and Commit messages across all replicas, combined with the dependency tracking, creates a combinatorial explosion. Jepsen-style testing is essential but not sufficient. Model checking (TLA+ or similar) is practically mandatory.

A Final Assessment

EPaxos is a genuinely brilliant protocol. The insight that non-conflicting commands can be fast-path committed from any replica in a single round trip is both correct and useful. The dependency tracking mechanism is elegant.

But brilliance and practicality are different things. The protocol’s complexity — particularly the execution ordering algorithm, the recovery protocol, and the correctness bugs found years after publication — means that for most practitioners, the safer choice is a well-implemented Multi-Paxos or Raft with thoughtful deployment topology.

If you are operating at a scale where geo-distributed consensus latency is your actual bottleneck, and you have a team with deep distributed systems expertise, EPaxos (or one of its descendants like Tempo or Atlas) may be worth the investment. For everyone else, the agony is not worth the optimization.

The leaderless dream is real. The leaderless implementation is a nightmare. Choose your suffering accordingly.

Flexible Paxos and Quorum Relaxation

The Insight That Was Hiding in Plain Sight

In 2016, Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman published a paper that made the distributed systems community collectively facepalm. The paper, “Flexible Paxos: Quorum Intersection Revisited,” presented an insight so simple, so obvious in retrospect, that you have to wonder how it escaped notice for nearly three decades.

Here it is: in Paxos, Phase 1 (Prepare) quorums and Phase 2 (Accept) quorums do not need to be the same size. They do not even need to be majorities. The only requirement is that every Phase 1 quorum intersects with every Phase 2 quorum.

That is it. That is the whole insight.

Classic Paxos uses majority quorums for both phases because majorities trivially intersect — any two majorities of the same set must share at least one member. But majority-majority is merely the most symmetric configuration that satisfies the intersection requirement. It is not the only one, and it is not always the best one.

The implications are significant. By adjusting the quorum sizes for each phase, you can trade Phase 1 availability for Phase 2 availability, or optimize for read-heavy versus write-heavy workloads, or reduce the number of replicas that need to participate in the common-case operation (Phase 2, since Phase 1 only runs during leader election).

How this remained unformalized for decades is a question that says something uncomfortable about how we read our own foundational papers.

The Mathematics of Quorum Intersection

Let us be precise. Consider a system with N replicas. Define:

  • Q1 = the set of valid Phase 1 (Prepare) quorums
  • Q2 = the set of valid Phase 2 (Accept) quorums

Classic Paxos requirement: |q1| > N/2 and |q2| > N/2 for all q1 in Q1 and q2 in Q2.

Flexible Paxos requirement: For all q1 in Q1 and for all q2 in Q2, q1 intersection q2 is non-empty.

That is: q1 ∩ q2 ≠ ∅ for every valid pair (q1, q2).

Why does this work? Recall what the quorum intersection actually provides in Paxos:

  1. Phase 1 (Prepare): A proposer contacts a Phase 1 quorum to discover any previously accepted values. The proposer must learn about any value that might have been chosen in Phase 2.

  2. Phase 2 (Accept): A proposer contacts a Phase 2 quorum to accept its proposed value. If a Phase 2 quorum accepts a value, that value is potentially chosen.

The intersection ensures that if a value v was accepted by a Phase 2 quorum q2, then any future Phase 1 quorum q1 will contain at least one member of q2 — and that member will report that it accepted v. This is what prevents conflicting values from being chosen.

The size of the quorums does not matter. What matters is intersection.

The Simple Quorum Arithmetic

For the simple case where all Phase 1 quorums have size Q1 and all Phase 2 quorums have size Q2, the intersection requirement reduces to:

Q1 + Q2 > N

This is because if Q1 + Q2 > N, then by the pigeonhole principle, any Q1-sized subset and any Q2-sized subset of N elements must overlap.

Classic Paxos sets Q1 = Q2 = floor(N/2) + 1, giving Q1 + Q2 = N + 2 > N (with room to spare). But consider the alternatives for N = 10:

ConfigurationQ1 (Phase 1)Q2 (Phase 2)Q1 + Q2Phase 1 fault tolerancePhase 2 fault tolerance
Classic66124 failures4 failures
Write-optimized92111 failure8 failures
Read-optimized29118 failures1 failure
Balanced-asymmetric74113 failures6 failures
Extreme write101110 failures9 failures

The “Write-optimized” configuration is remarkable: Phase 2 (the common case during normal operation) requires only 2 out of 10 replicas. This means you can tolerate 8 replica failures during steady-state operation. The cost is that Phase 1 (leader election) requires 9 out of 10 replicas — you can tolerate at most 1 failure during a leader election. But leader elections are rare, so this may be an excellent trade.

The “Extreme write” row is degenerate but instructive: if Phase 1 requires all 10 replicas, Phase 2 requires only 1. This means the leader can commit decisions unilaterally — no replication needed during normal operation. Of course, any single failure makes leader election impossible, so this is only useful in very specific scenarios.

Why This Took Decades to Formalize

Lamport’s original Paxos paper uses majority quorums. The “Paxos Made Simple” paper uses majority quorums. Every textbook uses majority quorums. The majority requirement is stated so confidently and so consistently that it reads as fundamental rather than as a design choice.

But go back and read the proof carefully. The proof never actually requires majorities. It requires intersection. Majorities are presented as the way to achieve intersection, not as the requirement itself. The conflation of mechanism (majorities) with requirement (intersection) persisted for decades.

There are a few reasons this happened:

Symmetry is natural. When you are designing a protocol, having the same quorum size for both phases is the obvious default. It is symmetric, easy to reason about, and easy to explain. Asymmetric quorums feel like they need justification.

The proof sketches are brief. Lamport’s proofs are famously concise. The majority requirement is stated as a lemma, and the lemma is proved in a sentence or two. Nobody stopped to ask whether a weaker condition would suffice because the stronger condition was easy to verify and clearly correct.

Practice preceded theory. Systems like ZooKeeper, etcd, and Consul all use majority quorums because that is what the papers said to use. By the time someone thought to question the requirement, an entire ecosystem was built on the assumption.

It did not matter enough. For most practical deployments (3 or 5 nodes in a single datacenter), the distinction between majority quorums and flexible quorums is irrelevant. The optimization only matters at larger scales or in geo-distributed settings — exactly the scenarios that became more common in the 2010s.

Flexible Paxos: Modified Protocol

The modifications to standard Paxos are minimal. The protocol logic is identical; only the quorum sizes change.

// Configuration
Structure FlexiblePaxosConfig:
    N: Integer           // total number of replicas
    Q1_size: Integer     // Phase 1 quorum size
    Q2_size: Integer     // Phase 2 quorum size
    // INVARIANT: Q1_size + Q2_size > N

Procedure Propose(config, value):
    // Phase 1: Prepare — contact Q1_size replicas
    ballot = NextBallot()
    promises = {}

    // Send Prepare to ALL replicas, wait for Q1_size responses
    SendToAll(Prepare{ballot: ballot})
    promises = WaitForResponses(count = config.Q1_size)

    // Find highest-numbered accepted value, if any
    highest_accepted = None
    for each promise in promises:
        if promise.accepted_ballot > highest_accepted.ballot:
            highest_accepted = promise

    if highest_accepted is not None:
        value = highest_accepted.value  // must propose this value

    // Phase 2: Accept — contact Q2_size replicas
    SendToAll(Accept{ballot: ballot, value: value})
    accepts = WaitForResponses(count = config.Q2_size)

    if |accepts| >= config.Q2_size:
        // Value is chosen
        SendToAll(Decide{value: value})
Procedure HandlePrepare(msg):
    if msg.ballot > promised_ballot:
        promised_ballot = msg.ballot
        Reply(Promise{
            ballot: msg.ballot,
            accepted_ballot: accepted_ballot,
            accepted_value: accepted_value
        })
    else:
        Reply(Nack{ballot: promised_ballot})

Procedure HandleAccept(msg):
    if msg.ballot >= promised_ballot:
        promised_ballot = msg.ballot
        accepted_ballot = msg.ballot
        accepted_value = msg.value
        Reply(Accepted{ballot: msg.ballot})
    else:
        Reply(Nack{ballot: promised_ballot})

The code is virtually identical to standard Paxos. The only difference is in the WaitForResponses calls: Phase 1 waits for Q1_size responses, and Phase 2 waits for Q2_size responses.

Multi-Paxos with Flexible Quorums

The real payoff comes when you apply Flexible Paxos to Multi-Paxos, where Phase 1 is run once during leader election and Phase 2 is run for every command.

Procedure MultiPaxosLeaderElection(config):
    // Phase 1 for all future slots — requires Q1 quorum
    ballot = NextBallot()
    SendToAll(Prepare{ballot: ballot})
    promises = WaitForResponses(count = config.Q1_size)

    // Process promises for all slots that have accepted values
    for each promise in promises:
        for each (slot, accepted_ballot, accepted_value) in promise.accepted_slots:
            UpdateSlot(slot, accepted_ballot, accepted_value)

    // Now we are the leader — Phase 2 uses Q2 quorum
    self.is_leader = true
    self.leader_ballot = ballot

Procedure MultiPaxosReplicate(config, slot, command):
    // Only Phase 2 — runs for every command
    assert self.is_leader

    SendToAll(Accept{
        ballot: self.leader_ballot,
        slot: slot,
        value: command
    })
    accepts = WaitForResponses(count = config.Q2_size)

    if |accepts| >= config.Q2_size:
        CommitSlot(slot, command)

With a write-optimized configuration (say N=10, Q1=9, Q2=2), every command only needs one additional replica’s acknowledgment to commit. Leader election is painful (requires 9 out of 10), but it is rare. The steady-state performance is dramatically better.

Practical Applications

WAN-Optimized Deployments

Consider five datacenters: New York, London, Tokyo, Sydney, and Sao Paulo. With classic Paxos (majority = 3), every write must wait for acknowledgment from the leader’s datacenter plus two others. If the leader is in New York, a write must traverse at least to London and one other DC.

With Flexible Paxos (Q1 = 4, Q2 = 2), steady-state writes only need the leader plus one other replica. The leader in New York needs only London’s acknowledgment (the closest replica) for each write. Leader election requires four out of five replicas, but that is acceptable for a rare operation.

The latency improvement for the common case:

Classic (N=5, Q2=3):
    Write latency = RTT to 2nd-closest replica from leader
    Example: Leader in NYC, must reach London AND Tokyo
    Latency ~= max(NYC->London, NYC->Tokyo) ~= 180ms

Flexible (N=5, Q1=4, Q2=2):
    Write latency = RTT to closest replica from leader
    Example: Leader in NYC, must reach London only
    Latency ~= NYC->London ~= 75ms

Improvement: 2.4x for this configuration

Read-Heavy Workloads

For read-heavy workloads where reads must be linearizable (via Paxos-based reads or read leases), you might want the opposite trade: small Phase 1 quorums (for fast reads that require a prepare-like check) and large Phase 2 quorums.

Actually, let me be more careful here. The relationship between Flexible Paxos quorums and read optimizations depends on the specific read protocol. If reads go through the leader (which holds a lease), the Flexible Paxos configuration does not directly affect read performance. But if you implement reads as separate Paxos instances (Phase 1 to discover the latest committed value), then a small Q1 directly speeds up reads.

Tuning for Failure Probability

In practice, simultaneous failure of multiple replicas is rare. If your failure model says “at most 1 replica fails simultaneously with high probability,” you can set Q1 = N and Q2 = 1 (or close to it). The system operates with minimal replication overhead in the common case and relies on the assumption that catastrophic multi-failure scenarios are rare enough to accept the risk.

This is a legitimate engineering trade. The question is not “is this safe?” but “is the failure probability acceptable for my use case?” A system that commits with Q2 = 2 out of 10 can tolerate 8 simultaneous failures during steady state but only 1 failure during leader election. If your risk analysis says multi-failure during the brief leader election window is acceptable, this is a valid configuration.

Generalized Quorum Systems

Flexible Paxos uses simple quorum systems (any subset of the right size). But the intersection requirement enables much richer quorum structures.

Grid Quorums

Arrange N = r x c replicas in a grid with r rows and c columns. Define:

  • Phase 1 quorum: one full column plus one replica from every other column
  • Phase 2 quorum: one full row

For a 3x3 grid (9 replicas):

  • Phase 1 quorum size: 3 + 2 = 5
  • Phase 2 quorum size: 3

Intersection is guaranteed because a full row and a full column in a grid always share exactly one cell. Any set containing a full column intersects with any set containing a full row.

Grid arrangement (9 replicas):

    Col1  Col2  Col3
Row1: A     B     C
Row2: D     E     F
Row3: G     H     I

Phase 2 quorum (Row 1): {A, B, C}
Phase 1 quorum (Col 2 + one from each other): {B, E, H, A, I}
Intersection: {B} (and possibly more)

Grid quorums are useful when replicas have different latency characteristics. You can arrange the grid so that Phase 2 quorums (rows) correspond to geographically close replicas, minimizing common-case latency.

Hierarchical Quorums

Organize replicas into groups (e.g., by datacenter). Define quorums as “a majority of groups, with a majority within each chosen group.”

For 3 datacenters with 3 replicas each (9 total):

  • Phase 2 quorum: 2 out of 3 DCs, with 2 out of 3 replicas in each chosen DC = 4 replicas
  • Phase 1 quorum: must intersect with every such Phase 2 quorum

This structure lets you reason about failures at the datacenter level (losing an entire DC) rather than the individual replica level.

Structure HierarchicalQuorumConfig:
    groups: List<List<ReplicaId>>    // replicas grouped by datacenter
    group_quorum: Integer             // how many groups needed
    intra_group_quorum: Integer       // how many replicas within a group

Function IsPhase2Quorum(replicas, config):
    groups_satisfied = 0
    for each group in config.groups:
        replicas_in_group = |replicas INTERSECT group|
        if replicas_in_group >= config.intra_group_quorum:
            groups_satisfied++
    return groups_satisfied >= config.group_quorum

// Phase 1 quorum must intersect with every possible Phase 2 quorum
// This requires careful calculation based on the specific configuration

Weighted Quorums

Assign weights to replicas (e.g., based on reliability, proximity, or capacity). Define quorums as any set whose total weight exceeds a threshold.

  • Phase 2 threshold: W2
  • Phase 1 threshold: W1
  • Requirement: W1 + W2 > total weight

This is useful when replicas are not equally valuable. A replica in the same datacenter as most clients might receive a higher weight, reducing the number of remote replicas needed for a quorum.

Structure WeightedQuorumConfig:
    weights: Map<ReplicaId, Integer>
    total_weight: Integer              // sum of all weights
    phase1_threshold: Integer          // W1
    phase2_threshold: Integer          // W2
    // INVARIANT: phase1_threshold + phase2_threshold > total_weight

Function IsQuorum(replicas, threshold, config):
    weight_sum = 0
    for each r in replicas:
        weight_sum += config.weights[r]
    return weight_sum >= threshold

The Raft Connection

Raft, as originally specified, uses majority quorums for both leader election (RequestVote, analogous to Phase 1) and log replication (AppendEntries, analogous to Phase 2). Can we apply Flexible Paxos to Raft?

Yes, with caveats.

Raft’s leader election is not exactly Paxos Phase 1. In Paxos, Phase 1 discovers previously accepted values. In Raft, RequestVote serves a dual purpose: it elects a leader and ensures the leader has the most up-to-date log (via the log completeness check). Applying flexible quorums to Raft requires ensuring that the election quorum (Q1) intersects with every replication quorum (Q2) to maintain the Log Matching Property.

The arithmetic is the same: Q1_election + Q2_replication > N. But the implementation requires modifying Raft’s election and commitment logic, which is not trivial given that Raft’s design prioritizes simplicity. Adding flexible quorums to Raft somewhat defeats the purpose of choosing Raft in the first place.

That said, some Raft implementations (notably those used in CockroachDB’s evaluations) have experimented with flexible quorums. The results confirm the theoretical predictions: steady-state latency improves at the cost of longer or less fault-tolerant elections.

Why Adoption Has Been Slow

Given that Flexible Paxos is a strict generalization that subsumes classic Paxos as a special case, why is it not universally adopted?

Configuration complexity. With classic Paxos, the user specifies N and the system computes the quorum size automatically (floor(N/2) + 1). With Flexible Paxos, the user must choose Q1 and Q2, which requires understanding the performance and availability implications. Most operators do not want to think about this.

Existing implementations. Changing the quorum logic in a production consensus system is a high-risk modification. The intersection invariant must be maintained during configuration changes, including during the configuration change itself. This is subtle enough that most teams decide the risk is not worth the benefit.

Marginal benefit for small clusters. For N = 3 (the most common deployment), the options are:

  • Classic: Q1 = Q2 = 2
  • Write-optimized: Q1 = 3, Q2 = 1

Q2 = 1 means no replication during normal operation — a single replica failure loses data. Q1 = 3 means all replicas must participate in leader election. For most teams running 3-node clusters, neither alternative is attractive.

For N = 5:

  • Classic: Q1 = Q2 = 3
  • Write-optimized: Q1 = 4, Q2 = 2
  • Read-optimized: Q1 = 2, Q2 = 4

Here the trade-offs become more interesting, but N = 5 is already less common in practice.

Mental model mismatch. Engineers reason about consensus in terms of “majority.” The majority rule is simple, intuitive, and easy to check. Flexible quorums require reasoning about two different quorum sizes and their intersection, which is harder to hold in your head during an incident at 3 AM.

Safety during reconfiguration. Changing quorum sizes at runtime — say, moving from (Q1=4, Q2=2) to (Q1=3, Q2=3) — requires ensuring that the old and new configurations’ quorums intersect correctly during the transition. This is a generalization of Raft’s joint consensus, but more complex because you have two quorum sizes to change instead of one.

// Safe reconfiguration requires:
// Old Q1 intersects Old Q2 (existing invariant)
// New Q1 intersects New Q2 (new invariant)
// During transition:
//   Phase 1 quorums must satisfy BOTH old Q1 and new Q1
//   Phase 2 quorums must satisfy BOTH old Q2 and new Q2
// This ensures safety regardless of which configuration is "active"

Procedure ReconfigureQuorums(old_config, new_config):
    // Use joint quorums during transition
    joint_q1 = max(old_config.Q1_size, new_config.Q1_size)
    joint_q2 = max(old_config.Q2_size, new_config.Q2_size)

    // This is conservative — joint quorums satisfy both configurations
    // but may temporarily reduce availability
    assert joint_q1 + old_config.Q2_size > old_config.N
    assert joint_q1 + new_config.Q2_size > new_config.N
    assert old_config.Q1_size + joint_q2 > old_config.N
    assert new_config.Q1_size + joint_q2 > new_config.N

    // Step 1: Switch to joint quorums
    ApplyConfig(joint_q1, joint_q2)

    // Step 2: Commit a marker entry under joint quorums
    CommitMarker("reconfiguration in progress")

    // Step 3: Switch to new quorums
    ApplyConfig(new_config.Q1_size, new_config.Q2_size)

    // Step 4: Commit a marker entry under new quorums
    CommitMarker("reconfiguration complete")

Connection to FPaxos in Practice

Despite slow direct adoption, the ideas from Flexible Paxos have influenced several systems:

CockroachDB has explored flexible quorums in its Raft-based replication layer, particularly for geo-distributed deployments where minimizing Phase 2 quorum size can significantly reduce write latency.

Windows Azure Storage uses a witness-based replication scheme that is essentially flexible quorums in disguise: a write is durable once it reaches the primary plus one secondary, but reconfiguration requires all replicas.

Bookkeeper (used in Apache Pulsar) uses a quorum system where writes require Qw acknowledgments and reads require Qr, with Qw + Qr > N. This is exactly the Flexible Paxos insight applied to a storage system, though it was developed independently.

The pattern is clear: the idea of flexible quorums is spreading, even if the specific Flexible Paxos protocol is not being adopted wholesale.

The Deeper Lesson

Flexible Paxos teaches us something important about distributed systems research: sometimes the biggest insights are the simplest. The quorum intersection requirement was always the fundamental invariant. Majority quorums were always just one way to achieve it. But the field was so focused on building on top of the majority assumption that nobody thought to question it.

This should make us wonder: what other simplifying assumptions in our foundational protocols are unnecessarily restrictive? What other degrees of freedom are we leaving on the table because “that is how it has always been done”?

The answer, almost certainly, is “several.” The history of distributed systems is a history of slowly relaxing unnecessary constraints while preserving essential safety properties. Flexible Paxos is one such relaxation. There will be more.

In the meantime, the next time someone tells you that consensus requires a majority, you can smile and say, “Well, actually…” Just be prepared for the blank stares. Some insights, no matter how simple, take decades to sink in.

ISR: Kafka’s Approach to Not Quite Consensus

A Different Kind of Agreement

Apache Kafka is one of the most successful distributed systems ever built. It handles trillions of messages per day at companies like LinkedIn, Netflix, and Uber. It provides durability, ordering, and replication guarantees that, in practice, are good enough for an enormous range of use cases.

And it does not use consensus.

Or rather, it did not, for the first decade of its existence. Kafka’s replication mechanism — the In-Sync Replica (ISR) protocol — is a carefully designed system that provides guarantees similar to consensus without actually implementing a consensus protocol. It is a masterclass in engineering pragmatism: rather than solving the general problem, Kafka solves the specific problem of replicated log appending, and it does so with a mechanism that is simpler, faster, and more operationally transparent than Paxos or Raft.

Understanding the ISR protocol — what it guarantees, what it does not, and why the distinction matters — is essential for anyone who operates Kafka or who wants to understand the design space between “full consensus” and “no coordination at all.”

The ISR Model

Kafka’s replication model is built around a few key concepts:

Partition Leader. Each Kafka partition has a single leader replica. All reads and writes go through the leader. This is similar to Multi-Paxos or Raft — there is a designated leader for each unit of replication.

Followers. Other replicas for the partition are followers. Followers pull data from the leader by issuing fetch requests (similar to Raft’s AppendEntries, but reversed in direction — followers pull rather than the leader pushing).

In-Sync Replicas (ISR). The ISR is the set of replicas that are “caught up” with the leader. A replica is in-sync if it has fetched all messages up to the leader’s log end offset within a configurable time window (replica.lag.time.max.ms, default 30 seconds).

High Water Mark (HW). The high water mark is the offset up to which all ISR members have replicated. Messages below the HW are considered committed. Messages above the HW but below the log end offset (LEO) are written but not yet committed.

Structure PartitionState:
    leader: ReplicaId
    replicas: Set<ReplicaId>           // all assigned replicas
    isr: Set<ReplicaId>                // in-sync replicas (subset of replicas)
    leo: Map<ReplicaId, Offset>        // log end offset per replica
    hw: Offset                          // high water mark

    // INVARIANT: leader IN isr
    // INVARIANT: isr SUBSET_OF replicas
    // INVARIANT: hw = min(leo[r] for r in isr)

The Produce Path

When a producer sends a message to Kafka, the following happens:

Procedure HandleProduce(partition, message, required_acks):
    assert self.id == partition.leader

    // Step 1: Append to leader's local log
    offset = AppendToLog(partition, message)
    partition.leo[self.id] = offset

    // Step 2: Behavior depends on acks setting
    if required_acks == 0:
        // "Fire and forget" — don't even wait for local write
        Reply(Success{offset: offset})

    else if required_acks == 1:
        // Wait for local write only
        Reply(Success{offset: offset})

    else if required_acks == -1:  // acks=all
        // Wait until ALL replicas in ISR have fetched this offset
        WaitUntil(
            for all r in partition.isr:
                partition.leo[r] >= offset
        )
        // Update high water mark
        partition.hw = min(partition.leo[r] for r in partition.isr)
        Reply(Success{offset: offset})

The Fetch Path

Followers continuously fetch from the leader:

Procedure FollowerFetchLoop(partition):
    while true:
        // Fetch messages from leader starting at our current LEO
        response = FetchFromLeader(partition, start_offset = partition.leo[self.id])

        if response.messages is not empty:
            // Append fetched messages to local log
            for each message in response.messages:
                AppendToLog(partition, message)
            partition.leo[self.id] = last_appended_offset + 1

        // Update our local high water mark to match leader's
        partition.hw = min(partition.hw, response.leader_hw)

        // Brief pause if no new data
        if response.messages is empty:
            Sleep(fetch.wait.max.ms)
Procedure HandleFetchFromFollower(partition, follower_id, start_offset):
    // Update our knowledge of the follower's LEO
    partition.leo[follower_id] = start_offset

    // Check if follower should be in ISR
    UpdateISRMembership(partition, follower_id)

    // Advance high water mark if possible
    new_hw = min(partition.leo[r] for r in partition.isr)
    if new_hw > partition.hw:
        partition.hw = new_hw

    // Return messages from start_offset to our LEO
    messages = ReadLog(partition, start_offset, partition.leo[self.id])

    Reply(FetchResponse{
        messages: messages,
        leader_hw: partition.hw
    })

ISR Maintenance

The ISR is dynamic. Replicas can be added and removed based on their replication lag:

Procedure UpdateISRMembership(partition, follower_id):
    current_time = Now()

    if follower_id in partition.isr:
        // Check if follower has fallen behind
        if partition.leo[follower_id] < partition.leo[self.id]:
            if current_time - last_caught_up_time[follower_id] > replica.lag.time.max.ms:
                // Follower is too far behind — remove from ISR
                partition.isr.remove(follower_id)
                // Persist ISR change to controller/ZooKeeper
                PersistISRChange(partition)
                Log("Removed replica {} from ISR for partition {}",
                    follower_id, partition.id)
        else:
            // Follower is caught up
            last_caught_up_time[follower_id] = current_time

    else:
        // Follower is not in ISR — check if it has caught up
        if partition.leo[follower_id] >= partition.leo[self.id]:
            // Follower has caught up — add to ISR
            partition.isr.add(follower_id)
            PersistISRChange(partition)
            Log("Added replica {} to ISR for partition {}",
                follower_id, partition.id)

The min.insync.replicas Setting

The ISR can shrink. In the worst case, it can shrink to just the leader. If the leader is the only ISR member, then acks=all degenerates to acks=1 — the leader acknowledges writes that have not been replicated anywhere.

The min.insync.replicas setting addresses this. When set to M, the leader will reject produce requests with acks=all if the ISR has fewer than M members.

Procedure HandleProduceWithMinISR(partition, message, required_acks):
    if required_acks == -1:  // acks=all
        if |partition.isr| < min.insync.replicas:
            Reply(NotEnoughReplicasException)
            return

    // ... proceed with normal produce path

The typical production configuration is:

  • Replication factor: 3
  • min.insync.replicas: 2
  • acks: all

This means: writes are acknowledged only when at least 2 out of 3 replicas (the leader plus at least one follower) have the data. If two replicas fail, the partition becomes unavailable for writes (but may still serve reads from the remaining replica).

Why This Is Not Consensus

The ISR mechanism provides strong guarantees in practice, but it is not consensus in the formal sense. Here is why:

No Voting on Values

In Paxos or Raft, replicas vote on proposed values. A value is chosen only if a quorum agrees. The voting mechanism ensures that no two conflicting values can both be chosen.

In Kafka, the leader unilaterally decides the log contents. Followers do not vote — they replicate. If the leader appends message X at offset 100, followers will eventually have message X at offset 100 or they will be removed from the ISR. There is no mechanism for followers to reject or propose alternatives.

The Unclean Leader Election Problem

The most important difference between ISR and consensus is what happens when the leader fails.

In Raft, a new leader must have the most up-to-date log. The election protocol guarantees this: a candidate must receive votes from a majority, and replicas will not vote for a candidate with a shorter log. This means the new leader is guaranteed to have all committed entries.

In Kafka, when the leader fails, the controller selects a new leader from the ISR. If all ISR members are available, this is safe — any ISR member has all committed data (all data up to the high water mark).

But what if all ISR members fail? Kafka has a configuration called unclean.leader.election.enable. If set to true (the default in early Kafka versions, now false by default), Kafka will elect a non-ISR replica as leader. This replica may be missing committed messages. This results in data loss.

Timeline of the unclean leader election problem:

Time 1: Leader=R1, ISR={R1, R2, R3}
    R1 log: [A, B, C, D, E]    (LEO=5)
    R2 log: [A, B, C, D, E]    (LEO=5, in ISR)
    R3 log: [A, B, C, D]       (LEO=4, in ISR — slightly behind but within lag threshold)

Time 2: R1 fails, R2 becomes leader, ISR={R2, R3}
    R2 log: [A, B, C, D, E]
    R3 log: [A, B, C, D, E]    (catches up)
    OK so far — no data loss.

Time 3: R2 AND R3 fail. R1 recovers (but was already removed from ISR).

    If unclean.leader.election.enable = true:
        R1 becomes leader with log [A, B, C, D, E]  — actually this is fine here.

    More problematic scenario:

Time 1: Leader=R1, ISR={R1, R2}
    R1 log: [A, B, C, D, E]    (LEO=5, HW=5)
    R2 log: [A, B, C, D, E]    (LEO=5)
    R3 log: [A, B, C]          (LEO=3, NOT in ISR — fell behind)

Time 2: R1 and R2 fail simultaneously.
    If unclean.leader.election.enable = true:
        R3 becomes leader with log [A, B, C]
        Messages D and E are LOST — even though they were committed (HW=5)

With unclean.leader.election.enable = false (the recommended setting), Kafka will wait for an ISR member to recover rather than electing a stale replica. This trades availability for consistency — the partition is unavailable until an ISR member returns.

No Log Reconciliation Protocol

In Raft, when a new leader is elected, there is an explicit log reconciliation protocol. The leader overwrites inconsistent entries on followers. This ensures all replicas converge to the same log.

Kafka handles this more informally. When a new leader is elected, it truncates its log to the high water mark (discarding any uncommitted entries). Followers then truncate their logs to match the new leader’s log. This works correctly in the common case but depends on the HW being propagated correctly — which introduces subtle edge cases.

The HW propagation delay is one of the trickier aspects of Kafka’s replication. The HW is updated on the leader when all ISR members fetch up to a certain offset. But followers learn the HW from the next fetch response. This means there is always a window where a follower’s local HW is behind the leader’s HW. If a failure occurs during this window, the follower may truncate committed messages during leader transition.

Kafka addressed this with KIP-101 (leader epoch), which adds a mechanism for followers to verify their log consistency with the new leader using epoch numbers rather than relying solely on the HW.

ISR vs. Raft: A Detailed Comparison

PropertyRaftKafka ISR
Leader electionVoting with log completeness checkController selects from ISR
Replication directionLeader pushes (AppendEntries)Followers pull (Fetch)
Commit conditionMajority of ALL replicasAll replicas in ISR
Quorum membershipFixed (all replicas)Dynamic (ISR changes)
Handling slow replicasSlow replica = slow commitSlow replica removed from ISR
Data loss on leader failureImpossible (if majority survives)Impossible (if ISR member survives, unclean election disabled)
Availability during partitionsRequires majorityRequires min.insync.replicas ISR members
Log divergence resolutionLeader overwrites followersTruncate to HW, sync from new leader

The most important difference is how they handle slow replicas. In Raft, a slow replica is still part of the quorum. If you have 5 replicas and one is on a slow disk, every commit must still wait for 3 replicas — and the slow one might be the third. You cannot kick it out of the quorum.

In Kafka, a slow replica drops out of the ISR. The system continues committing with the remaining ISR members. This is operationally superior for Kafka’s use case (high-throughput log appending) because it prevents a single slow replica from degrading the entire system.

The trade-off is that the ISR can shrink to the point where a single failure causes data loss (if only one ISR member remains and it fails). Raft’s fixed majority quorum provides a stronger guarantee: as long as a majority of all replicas survive, no data is lost.

KRaft: Kafka’s Move Toward Real Consensus

For the first decade of its existence, Kafka relied on Apache ZooKeeper for cluster metadata management: topic configurations, partition assignments, ISR membership, controller election. ZooKeeper itself implements a consensus protocol (ZAB), so Kafka was already using consensus — just not for data replication.

KRaft (Kafka Raft), introduced in KIP-500 and generally available since Kafka 3.3, replaces ZooKeeper with a built-in Raft implementation for metadata management. This is significant for several reasons:

  1. Operational simplicity. No more managing a separate ZooKeeper cluster. One system instead of two.
  2. Scalability. ZooKeeper was a bottleneck for clusters with many partitions (hundreds of thousands). KRaft’s metadata log scales better.
  3. Faster controller failover. The Raft-based controller election is faster than ZooKeeper-based election.

Importantly, KRaft is used for metadata only. Data replication still uses the ISR protocol. This is a pragmatic choice: the ISR protocol’s performance characteristics (pull-based, dynamic quorum, optimized for throughput) are well-suited for data, while Raft’s stronger guarantees are appropriate for metadata (which is lower volume but requires stricter consistency).

// KRaft metadata log — uses Raft
Structure MetadataLog:
    // Topics, partitions, ISR membership, broker registrations
    // All metadata changes go through Raft consensus
    entries: List<MetadataEntry>
    committed_offset: Offset

// Data log — uses ISR
Structure DataPartitionLog:
    // Actual user messages
    // Replicated via pull-based ISR protocol
    entries: List<ProducerMessage>
    hw: Offset
    isr: Set<ReplicaId>  // ISR itself is managed via MetadataLog

The architecture is a clean separation of concerns: Raft handles the coordination plane (what should be where), and ISR handles the data plane (actually moving bytes around).

When ISR Works Brilliantly

Kafka’s ISR protocol is not a general-purpose consensus protocol, and it does not try to be. It is specifically designed for one thing: high-throughput, ordered, durable log appending. For this use case, it has several advantages:

Throughput. Pull-based replication lets the leader batch writes without waiting for per-message acknowledgment from followers. The leader appends messages to its log at full speed; followers fetch in large batches at their own pace. This is dramatically more efficient than Raft’s per-entry acknowledgment for high-throughput workloads.

Elastic fault tolerance. The dynamic ISR means the system degrades gracefully. Lose one replica? The ISR shrinks but the system continues. Lose two? Still running (if min.insync.replicas allows). In Raft, you cannot dynamically adjust the quorum size — you must reconfigure the cluster.

Operational simplicity. The ISR is observable. Operators can see exactly which replicas are in sync, how far behind each replica is, and when replicas fall out of the ISR. This operational transparency is invaluable. In contrast, Raft’s internals (commit index, match index, next index) are harder to monitor and interpret.

Natural back-pressure. When a follower cannot keep up, it naturally falls out of the ISR. The system does not slow down to wait for it. This is the right behavior for a messaging system where throughput is paramount.

When ISR Falls Short

Exactly-once semantics. Kafka supports “exactly-once” delivery with idempotent producers and transactions. But the transactional protocol is built on top of ISR replication, which creates interesting challenges. The transaction coordinator must ensure that transaction markers are committed (in the ISR sense) before reporting success. If the coordinator fails mid-transaction, recovery depends on the ISR protocol having replicated the transaction state correctly.

Strong consistency for reads. Kafka consumers, by default, read up to the high water mark. But the HW propagation delay means a consumer might not see the latest committed messages immediately. For use cases requiring strict read-your-writes consistency, consumers must either read from the leader (which Kafka does by default) or use specific fetch configurations.

State machine replication. ISR is designed for log appending, not for general state machine replication. You cannot use Kafka’s ISR protocol to replicate a key-value store or a database. For that, you need actual consensus. This is why systems like CockroachDB and etcd use Raft, not ISR.

Split-brain prevention. Kafka relies on the controller (and previously ZooKeeper) to prevent split-brain scenarios. The controller is the single source of truth for who the leader is. If the controller itself has a split-brain problem, Kafka’s ISR protocol provides no protection. This is why moving to KRaft (which uses Raft for the controller) was important.

The Broader Lesson

Kafka’s ISR protocol demonstrates that you do not always need to solve the general problem. Consensus — in its full generality — solves state machine replication for arbitrary deterministic state machines. But if your state machine is specifically a log (append-only, sequential writes, batch reads), you can design a simpler, faster protocol that provides “good enough” guarantees.

The engineering insight is this: the ISR protocol is not a weaker version of consensus. It is a different tool designed for a different problem. Comparing ISR to Raft is like comparing a socket wrench to a Swiss army knife — the socket wrench does fewer things but does its one thing very well.

The lesson for practitioners is to understand what guarantees you actually need before reaching for a consensus protocol. If you need to replicate a log with high throughput and you can tolerate the ISR’s edge cases (which most messaging use cases can), Kafka’s approach may be a better fit than bolting Raft onto everything.

If you need strict linearizability, arbitrary state machine replication, or protection against all possible failure modes without operational intervention, use a real consensus protocol. Just do not assume you always need one. Kafka’s trillion-message-per-day success is proof that sometimes “not quite consensus” is exactly right.

CRDTs: Avoiding Consensus Entirely

The Nuclear Option

Every chapter in this book so far has been about getting nodes to agree. Paxos, Raft, ZAB, EPaxos — all of them are mechanisms for forcing agreement on a single value, a single order, a single truth. The protocols differ in elegance, performance, and how much suffering they inflict on implementers, but they share a common assumption: agreement is necessary.

CRDTs ask a heretical question: what if it is not?

Conflict-Free Replicated Data Types are data structures designed so that concurrent updates by different replicas always converge to the same state, without any coordination. No leader. No quorums. No message ordering. No consensus. Each replica processes updates locally, syncs with other replicas whenever it can, and everyone eventually ends up in the same state.

This sounds too good to be true. It is — for general computation. But for a specific and surprisingly useful class of problems, CRDTs deliver on their promise. The trick is understanding exactly how narrow that class is, and how much design contortion is required to stay within it.

The Mathematical Foundation

CRDTs are built on a simple algebraic structure: the join-semilattice.

A join-semilattice is a set S with a binary operation (called join, written as ⊔ or merge) that is:

  1. Commutative: a ⊔ b = b ⊔ a
  2. Associative: (a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)
  3. Idempotent: a ⊔ a = a

These three properties are what make CRDTs work:

  • Commutativity means the order in which updates arrive does not matter. Replica A can receive update X before Y, while replica B receives Y before X, and they will reach the same state.
  • Associativity means grouping does not matter. You can merge updates pairwise in any order.
  • Idempotency means receiving the same update twice is harmless. No need for exactly-once delivery — at-least-once is sufficient.

If your state and merge operation form a join-semilattice, convergence is guaranteed by mathematics, not by protocol. This is not a probabilistic guarantee or a best-effort one — it is a theorem.

The catch — and there is always a catch — is that your state must only ever move “up” in the lattice. The lattice has a partial order, and every merge operation produces a result that is ≥ both inputs. This means CRDTs can only represent monotonically growing information.

You can count up. You can add to a set. You can record that an event happened. What you cannot do — at least not directly — is count down, remove from a set, or un-happen an event. Every CRDT that appears to support these operations does so through a clever encoding that transforms “removal” into “addition of a tombstone” or similar tricks. We will see the consequences.

State-Based vs. Operation-Based CRDTs

There are two flavors of CRDTs, corresponding to two different ways of achieving convergence:

State-based CRDTs (CvRDTs) synchronize by shipping their entire state to other replicas. The receiving replica merges the received state with its local state using the lattice join. As long as every replica eventually communicates its state to every other replica (directly or transitively), convergence is guaranteed.

Operation-based CRDTs (CmRDTs) synchronize by shipping individual operations. Each operation is broadcast to all replicas, and each replica applies it locally. For this to work, the operations must be commutative (so order does not matter) and the delivery layer must guarantee at-least-once delivery to all replicas.

In theory, the two approaches are equivalent — anything expressible as a CvRDT can be expressed as a CmRDT and vice versa. In practice, they have different trade-offs:

PropertyState-based (CvRDT)Operation-based (CmRDT)
Network payloadEntire state (can be large)Individual operation (small)
Delivery requirementAt-least-once, eventualAt-least-once to all replicas
Merge complexityMust implement merge functionMust ensure commutativity
Metadata overheadState includes all metadataOperations may be smaller
Network efficiencyPoor for large statesGood (small messages)

Most practical systems use state-based CRDTs with delta-state optimizations (shipping only the changes since last sync rather than the full state). This gives the small message size of operation-based CRDTs with the simpler delivery requirements of state-based ones.

The CRDT Zoo: Concrete Examples

G-Counter (Grow-only Counter)

The simplest useful CRDT. Each replica maintains a separate counter, and the global count is the sum of all replica counters.

Structure GCounter:
    counts: Map<ReplicaId, Integer>  // one counter per replica

Function Increment(gcounter, replica_id):
    gcounter.counts[replica_id] += 1

Function Value(gcounter):
    return Sum(gcounter.counts.values())

Function Merge(a: GCounter, b: GCounter):
    result = GCounter{}
    for each replica_id in Union(a.counts.keys(), b.counts.keys()):
        result.counts[replica_id] = max(
            a.counts.get(replica_id, 0),
            b.counts.get(replica_id, 0)
        )
    return result

Why does this work? Each replica only increments its own entry. The merge takes the maximum of each entry. Since each entry only grows, max produces the correct count of increments from each replica. The sum of maxima gives the total count.

This is a legitimate join-semilattice: merge is commutative (max is), associative, and idempotent (max(x, x) = x).

The limitation: You can only count up. A “page view counter” works perfectly. A “users currently online” counter does not.

PN-Counter (Positive-Negative Counter)

To support both increment and decrement, use two G-Counters: one for increments (P) and one for decrements (N). The value is P - N.

Structure PNCounter:
    p: GCounter  // positive counts
    n: GCounter  // negative counts

Function Increment(pncounter, replica_id):
    Increment(pncounter.p, replica_id)

Function Decrement(pncounter, replica_id):
    Increment(pncounter.n, replica_id)

Function Value(pncounter):
    return Value(pncounter.p) - Value(pncounter.n)

Function Merge(a: PNCounter, b: PNCounter):
    return PNCounter{
        p: Merge(a.p, b.p),
        n: Merge(a.n, b.n)
    }

This works because both P and N are G-Counters (monotonically increasing), so the overall structure is still a join-semilattice. The value can go down, but the underlying state only grows.

The catch: The counter can go negative. If two replicas concurrently decrement a counter that is at zero, the result is -2. There is no way to enforce a “minimum value of zero” constraint without coordination. This is fundamental — enforcing invariants across replicas requires consensus.

LWW-Register (Last-Writer-Wins Register)

A register that holds a single value. Concurrent writes are resolved by timestamp: the write with the highest timestamp wins.

Structure LWWRegister:
    value: Any
    timestamp: Timestamp

Function Write(register, value, timestamp):
    if timestamp > register.timestamp:
        register.value = value
        register.timestamp = timestamp

Function Read(register):
    return register.value

Function Merge(a: LWWRegister, b: LWWRegister):
    if a.timestamp > b.timestamp:
        return a
    else if b.timestamp > a.timestamp:
        return b
    else:
        // Tie-breaking: use some deterministic rule
        // (e.g., lexicographic comparison of values, or replica ID)
        return DeterministicTieBreak(a, b)

LWW-Register is the most widely used CRDT and also the most philosophically questionable. It “resolves” conflicts by throwing away all but one of the conflicting writes. Whether the “last” write is actually the one the user intended to keep depends on clock accuracy, which in a distributed system is… well, let us just say it is an area of active prayer.

The uncomfortable truth: LWW-Register does not resolve conflicts. It hides them. If two users concurrently edit a document — one writing “yes” and the other writing “no” — LWW will silently pick one and discard the other. No user is notified. No merge is attempted. The loser’s write simply vanishes.

For some use cases (caching, session storage, “last configuration wins”), this is fine. For others (collaborative editing, financial records), it is a disaster wearing a formal proof.

OR-Set (Observed-Remove Set)

The OR-Set is where CRDTs start getting clever — and where the metadata overhead starts getting real.

The problem with sets is removal. Adding to a set is monotonic (the set grows). Removing is not (the set shrinks). And if one replica adds an element while another concurrently removes it, which operation wins?

The OR-Set’s answer: add wins over concurrent remove. More precisely, a remove only removes copies of the element that the removing replica has observed. If another replica concurrently adds the same element, that add is not removed because the removing replica had not observed it.

Structure ORSet:
    // Each element is tagged with a unique identifier
    // The set contains (element, unique_tag) pairs
    entries: Set<(Element, UniqueTag)>

Function Add(orset, element, replica_id):
    tag = GenerateUniqueTag(replica_id)  // e.g., (replica_id, counter++)
    orset.entries.add((element, tag))

Function Remove(orset, element):
    // Remove all entries with this element that we can currently see
    to_remove = {(e, tag) in orset.entries where e == element}
    orset.entries = orset.entries - to_remove

Function Contains(orset, element):
    return exists (e, tag) in orset.entries where e == element

Function Elements(orset):
    return {e for (e, tag) in orset.entries}

Function Merge(a: ORSet, b: ORSet):
    // Union of all entries
    // Entries that were removed on one side but not added on the other
    // stay removed. Entries that were independently added survive.
    return ORSet{entries: a.entries UNION b.entries}
    // NOTE: this simplified version doesn't handle removes correctly
    // The actual implementation needs to track removed tags separately

Wait — the merge function above is wrong, or at least incomplete. This is the part that papers tend to gloss over. The actual implementation needs to handle the case where one replica has removed an entry (e, tag) while another replica still has it. A simple union would resurrect removed elements.

The correct implementation typically uses one of these approaches:

  1. Tombstones: Keep a separate set of removed (element, tag) pairs. An element is in the set if it has at least one tag that is not in the tombstone set.

  2. Causal context tracking: Each replica tracks a causal context (a vector clock or similar), and the merge operation uses this context to determine whether an absence on one side means “never added” or “was added and then removed.”

// More realistic OR-Set with tombstones
Structure ORSetWithTombstones:
    entries: Set<(Element, UniqueTag)>     // added entries
    tombstones: Set<(Element, UniqueTag)>  // removed entries

Function Add(orset, element, replica_id):
    tag = GenerateUniqueTag(replica_id)
    orset.entries.add((element, tag))

Function Remove(orset, element):
    observed = {(e, tag) in orset.entries where e == element}
    orset.entries = orset.entries - observed
    orset.tombstones = orset.tombstones UNION observed

Function Contains(orset, element):
    active = orset.entries - orset.tombstones
    return exists (e, tag) in active where e == element

Function Merge(a: ORSetWithTombstones, b: ORSetWithTombstones):
    return ORSetWithTombstones{
        entries: (a.entries UNION b.entries),
        tombstones: (a.tombstones UNION b.tombstones)
    }

Function Elements(orset):
    active = orset.entries - orset.tombstones
    return {e for (e, tag) in active}

Now you can see the metadata problem: the tombstone set grows without bound. Every element that is ever removed leaves a tombstone that must be retained forever (or at least until all replicas have observed the removal and a garbage collection protocol has run). For a set with high churn — elements frequently added and removed — the tombstone set can dwarf the actual data.

What Can You Actually Build With CRDTs?

This is the question that separates CRDT enthusiasts from CRDT practitioners. The theory is beautiful. The implementations are… educational. The question is what real systems you can build.

Things CRDTs Handle Well

Counters and accumulators. Page views, like counts, vote tallies (where the total only matters, not per-user votes), distributed metrics collection. G-Counters and PN-Counters are simple, efficient, and genuinely useful.

Grow-only sets. Event logs, “users who have visited this page,” sets of tags where removal is not needed. These are trivial CRDTs (set union is a join) and are widely used.

Last-writer-wins registers. User profile fields, configuration values, session data — anything where “most recent write wins” is an acceptable conflict resolution policy.

Collaborative text editing. This is the marquee use case. CRDTs like RGA (Replicated Growable Array), LSEQ, and the data structures behind Automerge and Yjs provide real-time collaborative editing without a central server. Each user’s edits are represented as CRDT operations (insert character at position, delete character at position), and all users converge to the same document.

This is genuinely impressive and practically useful. Google Docs uses OT (Operational Transformation), which requires a central server. CRDT-based editors like those built on Automerge or Yjs can work peer-to-peer. The trade-off is metadata overhead and complexity.

Things CRDTs Handle Poorly

Anything with constraints. “The counter must not go below zero.” “The set must contain at most 10 elements.” “These two fields must be consistent with each other.” CRDTs cannot enforce cross-replica invariants because enforcing invariants requires coordination — which is exactly what CRDTs are designed to avoid.

Total ordering of events. CRDTs provide causal ordering at best. If you need a total order (a global sequence of events that all replicas agree on), you need consensus. CRDTs and total order are fundamentally incompatible.

Transactions. “Transfer $100 from account A to account B” requires both the debit and credit to be atomic. This is a coordination problem. CRDTs for individual accounts (PN-Counters) will let the debit happen without the credit, or vice versa.

Authorization and access control. “User X is no longer an admin” needs to be enforced consistently across all replicas. If one replica has processed the revocation and another has not, the non-updated replica will still grant admin access. This is a consistency problem that CRDTs cannot solve.

Real-World Usage

Riak

Riak was one of the first production databases to offer CRDT data types. It supports counters, sets, maps, registers, and flags as first-class data types. Under the hood, it uses state-based CRDTs with delta propagation.

Riak’s experience validated the theory but also exposed the practical challenges. The metadata overhead for OR-Sets and maps was significant. Garbage collection of tombstones required careful coordination. And users frequently tried to use CRDTs for things they were not designed for, leading to subtle data integrity issues.

Riak is largely defunct now, but its CRDT implementation was a valuable proof-of-concept for the industry.

Redis CRDT Types

Redis Enterprise (the commercial version) includes CRDT-based data types for active-active geo-replication. Counters, strings (LWW), sets, sorted sets, and other Redis data types are replicated across datacenters using CRDT semantics.

The Redis implementation is pragmatic: it uses LWW for most things and provides “add wins” semantics for sets. It does not expose the full theoretical CRDT taxonomy to users — it just makes Redis data types work across datacenters without explicit conflict resolution.

Automerge and Yjs

Automerge and Yjs are JavaScript libraries for building collaborative applications. They use CRDTs (or CRDT-like structures) to enable real-time collaborative editing of JSON documents, text, and other data structures.

These libraries represent the state of the art in CRDT-based collaborative editing. They handle text insertion and deletion, rich text formatting, and nested data structures. The performance has improved dramatically in recent years (Automerge 2.0 was a ground-up rewrite focused on performance).

The metadata overhead is still significant. A document that is a few kilobytes of text can have megabytes of CRDT metadata, especially if it has a long editing history. Compaction helps but does not eliminate the problem.

// Simplified collaborative text editing with a CRDT
// Based on RGA (Replicated Growable Array) concepts

Structure TextCRDT:
    // Each character has a unique ID and a reference to the character it was inserted after
    characters: List<CharEntry>

Structure CharEntry:
    id: (ReplicaId, SequenceNumber)  // globally unique
    value: Character                  // the actual character (or TOMBSTONE)
    after: CharEntryId               // inserted after this character
    visible: Boolean                  // false if deleted

Function InsertAfter(text, position_id, character, replica_id):
    new_id = (replica_id, next_sequence_number++)
    entry = CharEntry{
        id: new_id,
        value: character,
        after: position_id,
        visible: true
    }
    // Insert into list at correct position
    // (after position_id, using ID ordering to break ties with concurrent inserts)
    InsertInOrder(text.characters, entry)

Function Delete(text, char_id):
    // Don't actually remove — mark as tombstone
    entry = FindEntry(text.characters, char_id)
    entry.visible = false

Function Render(text):
    return Concatenate(entry.value for entry in text.characters if entry.visible)

Function Merge(a: TextCRDT, b: TextCRDT):
    // Union of all character entries
    // Entries present in both: take visible = a.visible AND b.visible
    //   (if either side deleted it, it stays deleted... actually this depends on semantics)
    // Entries present in only one: include them
    // Position determined by 'after' references and tie-breaking
    result = TextCRDT{}
    all_entries = Union(a.characters, b.characters) by id
    for each id in all_entries:
        if id in a AND id in b:
            result.add(CharEntry{
                id: id,
                value: a[id].value,  // same in both
                after: a[id].after,  // same in both
                visible: a[id].visible AND b[id].visible
            })
        else if id in a:
            result.add(a[id])
        else:
            result.add(b[id])
    // Re-sort by insertion order using 'after' references
    TopologicalSort(result.characters)
    return result

The Garbage Collection Problem

Every CRDT implementation eventually confronts the garbage collection problem. CRDTs are monotonically growing data structures — that is how they achieve convergence. But monotonically growing state eventually exhausts memory.

Tombstones in OR-Sets, deleted characters in text CRDTs, old counter values that have been superseded — all of this metadata accumulates over time and must eventually be cleaned up.

The irony is profound: garbage collecting a CRDT requires coordination. You need all replicas to agree that a tombstone is no longer needed (because all replicas have observed the corresponding deletion). This agreement is, itself, a form of consensus.

// CRDT garbage collection requires coordination — the irony
Procedure GarbageCollect(orset):
    // Find tombstones that ALL replicas have observed
    // This requires knowing each replica's "state version"
    min_version = Min(replica_versions for all replicas)

    for each (element, tag) in orset.tombstones:
        if tag.version <= min_version:
            // All replicas have seen this tombstone
            // Safe to remove the tombstone AND the corresponding entry
            orset.tombstones.remove((element, tag))
            orset.entries.remove((element, tag))

    // BUT: how do we know min_version without consensus?
    // Option 1: Use a background consensus protocol (defeats the purpose?)
    // Option 2: Use vector clocks and gossip (eventually consistent GC)
    // Option 3: Accept unbounded growth (not practical)
    // Option 4: "Epoch-based" GC with a coordinator (practical but impure)

Most practical CRDT systems use option 4: a coordinator periodically snapshots the state, all replicas sync to the snapshot, and old metadata is discarded. This works well in practice but means your “coordination-free” data structure actually requires periodic coordination for maintenance.

When CRDTs Are Moving the Problem, Not Solving It

The most important criticism of CRDTs is not that they do not work — they do, for what they are designed for. The criticism is that they are sometimes used to avoid solving a coordination problem that actually needs to be solved.

Example: Shopping cart. A user adds item A on their phone and removes item B on their laptop. With an OR-Set CRDT, both operations succeed locally and merge correctly. But what if item B is a pre-order with a price guarantee? The removal might need to trigger a refund, which requires coordination with the payment system. The CRDT handles the data structure, but the business logic still requires coordination.

Example: Distributed counter for inventory. You use a PN-Counter to track inventory across warehouses. Warehouse A decrements (sold an item), warehouse B decrements (sold an item). The counter goes to -1. Now what? You have oversold. The CRDT correctly tracked the decrements, but the invariant you needed (inventory >= 0) was violated because CRDTs cannot enforce it.

Example: User permissions. You use an OR-Set for a user’s roles. Admin A revokes the “admin” role. Concurrently, admin B adds the “super-admin” role. The OR-Set merges with “add wins” semantics: the user ends up with “super-admin” but not “admin.” Whether this is the correct outcome depends on your security model, and it is almost certainly not what either admin intended.

In all of these cases, the CRDT correctly implements its specification. The problem is that the specification does not capture the actual requirements. The coordination problem has not been eliminated — it has been pushed from the data structure layer to the application layer, where it is harder to see and harder to handle correctly.

CRDTs vs. Consensus: An Honest Comparison

RequirementCRDTsConsensus (Paxos/Raft)
Availability during partitionFull (reads and writes)Majority side only
Consistency modelStrong eventual consistencyLinearizability
Conflict resolutionAutomatic (by data structure design)Prevention (by serialization)
Invariant enforcementNot possiblePossible
Total orderingNot possibleGuaranteed
LatencyLocal (no coordination)Round-trip to quorum
ThroughputUnlimited (local operations)Bounded by leader/quorum
Metadata overheadSignificant (grows over time)Minimal (log compaction)
Implementation complexityModerate (for basic types) to Very High (for rich types)High
Correctness reasoningAlgebraic (lattice properties)Protocol-based (message ordering)

The fundamental trade-off is: CRDTs give you availability and low latency at the cost of giving up strong consistency and invariant enforcement. Consensus gives you strong consistency and invariant enforcement at the cost of availability during partitions and higher latency.

This is not a matter of one being “better.” It is the CAP theorem in action. CRDTs choose AP (availability and partition tolerance). Consensus chooses CP (consistency and partition tolerance). You cannot have all three, and which two you choose depends on your application.

When to Use CRDTs (For Real)

After all the caveats, here is when CRDTs genuinely shine:

  1. Multi-datacenter replication with independent operation. If each datacenter must continue operating during a network partition, and eventual convergence after the partition heals is acceptable, CRDTs are the right tool.

  2. Collaborative editing. Real-time collaborative editing on documents, whiteboards, or other creative tools. The “add wins” semantics of CRDTs is usually the right conflict resolution for human-generated content.

  3. Metrics and counters. Distributed counters, histograms, and other aggregation metrics where approximate real-time values are useful and exact consistency is not required.

  4. Caching with automatic conflict resolution. Distributed caches that replicate across nodes, where LWW or similar semantics are acceptable for stale data.

  5. Offline-first applications. Mobile or edge applications that must work without connectivity and sync when a connection is available. CRDTs let the application function fully offline with guaranteed convergence on reconnection.

And here is when you should not use CRDTs, no matter how appealing they seem:

  1. Financial transactions. Money requires invariants (no negative balances, atomic transfers). CRDTs cannot help.

  2. User authentication and authorization. Security-critical operations require strong consistency. A revoked permission must be revoked everywhere, not “eventually.”

  3. Coordination problems. Leader election, distributed locking, total ordering — these are fundamentally coordination problems. CRDTs are coordination-free by design.

  4. Anything where “both concurrent operations succeed” is the wrong answer. If two users try to book the same hotel room, exactly one should succeed. CRDTs will happily let both succeed and leave you with an overbooked hotel.

A Final Perspective

CRDTs are not a replacement for consensus. They are an alternative for situations where consensus is unnecessary, unavailable, or too expensive. The mathematical elegance is real, and the practical applications — particularly in collaborative editing and geo-replicated systems — are significant.

But the hype around CRDTs has sometimes obscured a fundamental truth: most interesting distributed systems problems require coordination. CRDTs are a way to carve out the subset of problems that do not, and solve those problems well. For everything else, you still need the agony of consensus.

The most dangerous CRDT deployment is one where someone has convinced themselves that they do not need coordination, when in fact they do. The data structure will work perfectly. The application will be subtly, silently wrong. And unlike a consensus protocol that fails loud (timeout, leader election, unavailability), a CRDT that cannot enforce your invariants fails quiet.

Quiet failures are the worst kind.

Virtual Consensus and Log-Based Architectures

Just Use a Log

There is a certain kind of distributed systems architect who, when confronted with any problem, will say: “Just use a log.” Need to replicate state? Write changes to a log and replay them. Need to coordinate services? Put coordination messages in a log. Need to build a database? It is a log with indexes on top. Need to make breakfast? Write the recipe to a log and have your toaster subscribe.

This sounds glib, but there is a deep truth underneath the meme. A log — specifically, a durable, replicated, totally ordered log — is equivalent to consensus. If you have a replicated log, you can build any replicated state machine on top of it. If you have consensus, you can build a replicated log. The two are formally equivalent.

Virtual consensus takes this observation to its logical conclusion: factor out the consensus into a shared log service, and build everything else — state machines, coordination services, databases — as consumers of that log. The consensus happens once, at the log level, and everything above it is “just” deterministic replay.

The idea is elegant, practical, and — in the right contexts — genuinely transformative. It is also, as with most things in distributed systems, rather harder than it sounds.

The Shared Log Abstraction

A shared log provides a simple API:

Interface SharedLog:
    // Append an entry to the log. Returns the position (offset) assigned.
    // This is the ONLY operation that requires consensus.
    Function Append(entry: bytes) -> LogPosition

    // Read the entry at a given position.
    // This is a simple read — no consensus needed.
    Function Read(position: LogPosition) -> bytes

    // Get the current tail of the log (the next position to be written).
    Function GetTail() -> LogPosition

    // Read a range of entries.
    Function ReadRange(start: LogPosition, end: LogPosition) -> List<bytes>

The critical insight: only Append requires consensus (to establish the total order). Read, ReadRange, and GetTail are simple storage operations that can be served from any replica that has the data.

This separation is powerful because it means the consensus protocol — the hard, expensive, latency-sensitive part — is invoked only when new data enters the system. Reading existing data is cheap and can be parallelized.

Corfu: The Original Shared Log

Corfu (published by Balakrishnan et al. at NSDI 2012) was the first system to seriously explore the shared log abstraction as a building block for distributed systems.

Corfu’s architecture separates the log into two planes:

The sequencer is a single node responsible for issuing log positions. When a client wants to append, it first asks the sequencer for the next available position. The sequencer is essentially a counter — it returns the next position and increments.

The storage nodes store the actual log entries. The log is striped across multiple storage nodes for parallelism. Each log position maps to a specific storage node (using a simple hash or a configurable mapping).

// Corfu architecture

Structure CorfuClient:
    sequencer: SequencerClient
    storage_nodes: List<StorageNodeClient>
    layout: LayoutMap  // maps log positions to storage nodes

Procedure Append(client, entry):
    // Step 1: Get next log position from sequencer
    position = client.sequencer.GetToken()

    // Step 2: Write to the appropriate storage node
    storage_node = client.layout.GetNode(position)
    success = storage_node.Write(position, entry)

    if not success:
        // Position was already written (race condition with another client)
        // Backoff and retry
        Retry(client, entry)

    return position

Procedure Read(client, position):
    storage_node = client.layout.GetNode(position)
    return storage_node.Read(position)

Wait — where is the consensus? The sequencer is a single node. If it fails, the system is unavailable for appends. Is this not the leader bottleneck all over again?

Yes and no. The sequencer is not a consensus protocol. It is a simple counter. Its failure handling is straightforward: a new sequencer is elected (using a separate consensus protocol, like Paxos, for the election), the new sequencer reads the storage nodes to find the current tail of the log, and it resumes issuing positions.

The key design choice is that the sequencer does not store any durable state — it is a soft-state cache of the current log position. This means failover is fast (discover the tail from storage nodes, start issuing positions) and the sequencer is not a durability bottleneck.

The consensus for durability happens implicitly: once a client writes an entry to a storage node at a position, that entry is durable. The total order is established by the sequencer’s position assignment. As long as the sequencer assigns unique, monotonically increasing positions, the log is totally ordered.

Corfu’s Storage Layer

Each storage node manages a simple write-once address space:

Structure StorageNode:
    log: Map<LogPosition, bytes>
    trimmed_positions: Set<LogPosition>

Procedure HandleWrite(position, entry):
    if position in log:
        Reply(ErrorAlreadyWritten)
    else if position in trimmed_positions:
        Reply(ErrorTrimmed)
    else:
        log[position] = entry
        Reply(OK)

Procedure HandleRead(position):
    if position in log:
        Reply(OK, log[position])
    else if position in trimmed_positions:
        Reply(ErrorTrimmed)
    else:
        Reply(ErrorNotWritten)

The write-once semantics are crucial. If two clients somehow get the same position from the sequencer (which should not happen but might during sequencer failover), only one will succeed. The other will see ErrorAlreadyWritten and must retry with a new position.

For durability, Corfu replicates each log entry to multiple storage nodes using chain replication. The client writes to the head of the chain, which forwards to the next node, and so on. The write is considered durable when the tail of the chain acknowledges.

// Chain replication for durability
Procedure AppendWithChainReplication(client, entry):
    position = client.sequencer.GetToken()

    // Get the chain for this position
    chain = client.layout.GetChain(position)

    // Write to head of chain — it propagates to tail
    head = chain[0]
    success = head.ChainWrite(position, entry, chain)

    if not success:
        Retry(client, entry)

    return position

Tango: Virtual State Machines on a Shared Log

Tango (Balakrishnan et al., SOSP 2013) builds on Corfu to create virtual replicated state machines. The idea: instead of running a separate consensus protocol for each replicated object, put all operations into the shared log and have each object replay the log to update its state.

Structure TangoObject:
    oid: ObjectId
    state: Any  // the application-level state
    version: LogPosition  // last log position applied to this object
    log_client: CorfuClient

Procedure Apply(tango_obj, operation):
    // Step 1: Append the operation to the shared log
    entry = LogEntry{
        object_id: tango_obj.oid,
        operation: operation
    }
    position = tango_obj.log_client.Append(Serialize(entry))

    // Step 2: Replay the log up to the new position to update state
    Sync(tango_obj, position)

    return tango_obj.state

Procedure Sync(tango_obj, up_to_position):
    // Read and apply all log entries from our current version to up_to_position
    for pos in range(tango_obj.version + 1, up_to_position + 1):
        entry = tango_obj.log_client.Read(pos)
        log_entry = Deserialize(entry)

        if log_entry.object_id == tango_obj.oid:
            // This entry is for our object — apply it
            tango_obj.state = ApplyOperation(tango_obj.state, log_entry.operation)

        // Skip entries for other objects

    tango_obj.version = up_to_position

This is remarkable. Multiple independent objects — a key-value store, a counter, a queue — can all be replicated by sharing a single log. The objects do not know about each other. They do not need their own consensus protocols. They just read the log.

Even more remarkably, this gives you cross-object transactions for free. If you want to atomically update objects A and B, you append a single log entry that contains both operations. When each object replays the log, they both apply the transaction. Because the log is totally ordered, all replicas see the transaction at the same position and apply it atomically.

// Cross-object transaction using shared log
Procedure TransactionalUpdate(log_client, operations: List<(ObjectId, Operation)>):
    entry = TransactionEntry{
        operations: operations
    }
    position = log_client.Append(Serialize(entry))

    // Each object, when it replays the log, will see this entry
    // and apply the operation intended for it
    return position

// Modified Sync to handle transactions
Procedure SyncWithTransactions(tango_obj, up_to_position):
    for pos in range(tango_obj.version + 1, up_to_position + 1):
        entry = tango_obj.log_client.Read(pos)
        parsed = Deserialize(entry)

        if parsed is TransactionEntry:
            for (oid, operation) in parsed.operations:
                if oid == tango_obj.oid:
                    tango_obj.state = ApplyOperation(tango_obj.state, operation)
        else if parsed is LogEntry:
            if parsed.object_id == tango_obj.oid:
                tango_obj.state = ApplyOperation(tango_obj.state, parsed.operation)

    tango_obj.version = up_to_position

The Catch: Replay Cost

The elegance of Tango comes with a cost: every object must replay the entire log, filtering for entries relevant to it. If the log has a million entries and only 1% are relevant to your object, you are reading 990,000 irrelevant entries.

Tango addresses this with stream multiplexing: the log supports multiple named streams, and each object is assigned to a stream. The log maintains per-stream indexes, so an object can efficiently read only the entries in its stream.

// Stream-aware shared log
Interface StreamAwareLog:
    Function Append(stream_id: StreamId, entry: bytes) -> LogPosition
    Function ReadStream(stream_id: StreamId, from: LogPosition) -> List<(LogPosition, bytes)>

    // For cross-stream transactions: append to multiple streams atomically
    Function AppendMultiStream(stream_ids: List<StreamId>, entry: bytes) -> LogPosition

This works, but it introduces complexity. The stream index must be maintained consistently. Cross-stream transactions must be visible to all involved streams. And the space overhead of the index itself can be significant.

The Relationship to Kafka

If this all sounds familiar, it should. Kafka is, at its core, a shared log. Producers append to topics (which are logs). Consumers read from topics. The ordering within a partition is total. Multiple applications can consume the same topic independently.

The differences between Kafka and Corfu are architectural, not conceptual:

PropertyCorfuKafka
OrderingGlobal total order (single log)Per-partition order
SequencingDedicated sequencer nodePartition leader
ReplicationChain replicationISR (pull-based)
StorageWrite-once address spaceAppend-only segments
Primary use caseBuilding replicated state machinesMessage passing and stream processing
Client modelLibrary (in-process)Service (over network)

Kafka trades global total order for partitioned parallelism. This makes it horizontally scalable (different partitions can be on different brokers) but means you cannot do cross-partition transactions without external coordination. Corfu provides global order but concentrates sequencing in a single node, which limits throughput.

The log-centric philosophy is the same. The trade-offs reflect different priorities.

Total Order Broadcast and Its Equivalence to Consensus

At this point, we should make explicit what has been implicit throughout this chapter: total order broadcast (TOB) and consensus are equivalent problems. Given a solution to one, you can build a solution to the other.

Consensus from TOB: To run consensus, broadcast your proposed value via TOB. All nodes receive all proposals in the same order. The first proposal received is the decided value.

TOB from Consensus: To broadcast a message with total order, use consensus to agree on the next message to deliver. Run a consensus instance for each position in the delivery sequence.

A shared log is total order broadcast. Append broadcasts a message. The log position is the total order. Every consumer sees the same messages in the same order.

This equivalence is why “just use a log” is both profound and hand-wavy. It is profound because it reduces every distributed coordination problem to log appending. It is hand-wavy because log appending is consensus — you have not eliminated the hard problem, you have just given it a different name and a cleaner interface.

The value of the log abstraction is not that it makes consensus easy. It is that it makes consensus reusable. Solve consensus once, in the log, and every application built on top gets consensus for free.

CorfuDB and VMware’s NSX

CorfuDB is the open-source descendant of Corfu, developed within VMware. It was used in production as the metadata store for VMware NSX (the network virtualization platform).

The architecture is a practical realization of the Tango vision: CorfuDB provides a replicated object framework where application objects are backed by a shared log. The framework handles serialization, log replay, snapshotting, and transaction support.

// CorfuDB-style replicated object
@CorfuObject
Structure NetworkPolicy:
    rules: List<FirewallRule>
    version: Integer

    @Mutator
    Procedure AddRule(rule):
        // This method is automatically logged to the shared log
        self.rules.append(rule)
        self.version++

    @Accessor
    Function GetRules():
        // This method triggers a log sync before reading
        return self.rules

    @TransactionalMutator
    Procedure ReplaceAllRules(new_rules):
        // This runs within a transaction
        self.rules = new_rules
        self.version++

The VMware NSX use case is illustrative: network policies need to be replicated across multiple controllers consistently. Using a shared log means all controllers see the same sequence of policy changes and apply them in the same order. The log provides both the replication mechanism and the consistency guarantee.

In practice, CorfuDB encountered the usual challenges of log-based systems: garbage collection (old log entries need to be trimmed), checkpointing (to avoid replaying the entire log on startup), and handling of slow consumers (who fall too far behind the log tail).

Delos: Virtual Consensus at Meta

Delos, developed at Meta (formerly Facebook), takes the virtual consensus idea one step further. It asks: what if the consensus implementation itself could be swapped out behind the log API?

The architecture has three layers:

  1. VirtualLog: A log API that applications program against.
  2. Loglet: A pluggable implementation of a log segment. Different loglets can use different consensus protocols (or no consensus at all).
  3. MetaStore: A small, reliable store that tracks which loglet is responsible for which range of the log.
// Delos architecture

Interface Loglet:
    Function Append(entry: bytes) -> LogPosition
    Function Read(position: LogPosition) -> bytes
    Function Seal() -> LogPosition  // prevent further appends, return final position

Structure VirtualLog:
    metastore: MetaStore
    active_loglet: Loglet
    sealed_loglets: List<(LogPositionRange, Loglet)>

Procedure Append(vlog, entry):
    // Append to the currently active loglet
    try:
        position = vlog.active_loglet.Append(entry)
        return vlog.TranslateToGlobalPosition(position)
    catch LogletSealedException:
        // Active loglet was sealed — switch to new one
        vlog.SwitchLoglet()
        return Append(vlog, entry)  // retry with new loglet

Procedure Read(vlog, global_position):
    // Find which loglet contains this position
    loglet = vlog.FindLoglet(global_position)
    local_position = vlog.TranslateToLocalPosition(global_position)
    return loglet.Read(local_position)

Procedure SwitchLoglet(vlog):
    // Seal the current loglet
    final_pos = vlog.active_loglet.Seal()

    // Record the sealed loglet's range in metastore
    vlog.metastore.RecordSealedLoglet(
        vlog.active_loglet,
        position_range = (start, final_pos)
    )

    // Create a new loglet (possibly using a different implementation!)
    new_loglet = CreateLoglet(vlog.metastore.GetActiveLogletConfig())
    vlog.active_loglet = new_loglet
    vlog.sealed_loglets.append((range, old_loglet))

The brilliance of Delos is the Seal() operation. When a loglet is sealed, no more entries can be appended to it. The loglet becomes read-only. A new loglet is created for future appends. This allows:

  1. Live migration between consensus protocols. Seal the old loglet (backed by, say, Paxos), create a new one (backed by Raft), and continue. Applications see a seamless log.

  2. Heterogeneous storage. Old loglets can be backed by cheap, slow storage. The active loglet uses fast, expensive storage. Historical reads go to the cold tier; current appends go to the hot tier.

  3. Testing and experimentation. You can run a new consensus implementation for a subset of the log (a single loglet), compare its behavior with the production implementation, and switch back if problems arise.

  4. Reconfiguration without downtime. Changing the replica set? Seal the old loglet, create a new one with the new replica set, done.

Delos in Practice

Meta uses Delos for several internal control plane services. The initial deployment used a simple loglet backed by a ZooKeeper ensemble. Later deployments used custom loglet implementations optimized for Meta’s infrastructure.

The practical benefits reported by Meta:

  • Faster iteration. New consensus implementations can be deployed behind the VirtualLog API without changing application code.
  • Easier testing. The loglet interface is small and well-defined, making it easier to test new implementations in isolation.
  • Graceful degradation. If a loglet implementation has a bug, it can be sealed and replaced, limiting the blast radius.
  • Separation of concerns. Application developers think in terms of log operations. Infrastructure engineers think in terms of loglet implementations. The VirtualLog bridges the gap.

Why “Just Use a Log” Is Both Profound and Hand-Wavy

Let us be honest about the limitations of the log-based approach.

The Profundity

The shared log abstraction genuinely simplifies distributed system design. Instead of every application implementing its own replication protocol, they all share a single, well-tested, well-understood log. The benefits are real:

  • Correctness by construction. If the log is correct (totally ordered, durable, replicated), then any deterministic state machine built on top is correct. You prove the log correct once; applications are correct by construction.
  • Composability. Multiple objects can share a log, and cross-object transactions are straightforward. This is much harder with per-object consensus protocols.
  • Operational simplicity. One system to monitor, tune, and debug, rather than N separate consensus implementations.

The Hand-Waviness

The log is a bottleneck. A single totally ordered log has a throughput limit determined by the consensus protocol and the sequencer. For write-heavy workloads, this bottleneck is real. Partitioning the log (like Kafka does) restores throughput but sacrifices global total order — the very thing the log was supposed to provide.

Replay is not free. Building state by replaying a log means that startup time grows with log length. Snapshotting mitigates this but adds complexity. A consumer that falls behind must catch up, which can take significant time for high-volume logs.

The log grows. An append-only log grows without bound. Garbage collection, compaction, and trimming are necessary and non-trivial. When can you safely trim old entries? Only when all consumers have processed them — which requires tracking consumer progress, which requires coordination.

Determinism is hard. Log-based state machine replication requires that all replicas execute the same operations and produce the same results. This means operations must be deterministic. No random numbers, no system clock reads, no external I/O during replay. Enforcing determinism in application code is a constant battle.

Latency characteristics are not always great. Every operation goes through the log: append, wait for commit, sync, apply. For read-heavy workloads, this adds unnecessary latency. Optimizations like read leases or local read paths are necessary but add complexity.

// The hidden costs of log-based replication

Procedure HandleClientRequest(request):
    // Step 1: Append to log (consensus latency)
    position = log.Append(Serialize(request))
    // Latency: network RTT to quorum + disk fsync

    // Step 2: Wait for local replica to catch up to this position
    WaitUntil(local_state.version >= position)
    // Latency: depends on how far behind we are

    // Step 3: Read result from local state
    result = local_state.GetResult(position)

    return result

// Total latency = consensus + replay catch-up + local read
// vs. direct state machine replication where latency = consensus + apply
// The catch-up step is the extra cost of indirection through the log

Why This Approach Has Not Taken Over the World

Despite the elegance of virtual consensus and log-based architectures, they remain niche. Most distributed systems use embedded consensus (Raft in etcd, Paxos in Spanner, ZAB in ZooKeeper) rather than building on a shared log. Why?

Coupling vs. decoupling. A shared log introduces a dependency between all services that use it. If the log is down, everything is down. If the log is slow, everything is slow. Embedded consensus has failure isolation — a problem with etcd does not affect your database.

The abstraction is leaky. Applications need to understand log positions, replay, snapshotting, and trimming. The log API is simple, but using it correctly requires understanding the distributed systems underneath. The abstraction saves implementation effort but not understanding effort.

Existing ecosystem. etcd, ZooKeeper, and Consul provide battle-tested coordination services with rich APIs (watches, ephemeral nodes, transactions). A shared log provides a lower-level primitive that requires more work to turn into a usable coordination service.

Performance trade-offs. A general-purpose shared log is optimized for no specific workload. A purpose-built consensus protocol (like Raft in CockroachDB, tuned for the database’s access patterns) can outperform a general-purpose log for its specific use case.

Organizational factors. In most organizations, each team manages its own infrastructure. Sharing a log across teams creates coordination overhead (who manages it? who pays for it? whose SLA applies?). This is an organizational problem, not a technical one, but organizational problems kill more architectures than technical ones.

The Spectrum of Log Usage

In practice, systems exist on a spectrum from “embedded consensus” to “fully virtual consensus”:

Embedded Consensus          Hybrid                    Virtual Consensus
     |                        |                             |
   etcd                    Kafka                         Delos
   CockroachDB             Pulsar                        Corfu/Tango
   ZooKeeper               (log for data,                (everything via log)
   (consensus tightly       consensus for
    coupled to              metadata)
    application)

Most production systems are in the hybrid zone. Kafka uses a log for data but consensus (KRaft) for metadata. Pulsar uses BookKeeper (a log) for data but ZooKeeper (consensus) for coordination. Even CockroachDB, which uses Raft for replication, structures its storage as a log (the Raft log) that is replayed into a state machine (RocksDB/Pebble).

The fully virtual approach (Delos, Corfu) is most viable for control plane services where the workload is metadata-heavy, the throughput requirements are moderate, and the benefits of swappable consensus implementations are high.

A Practical Assessment

The log-based architecture is a genuine contribution to how we think about distributed systems. The insight that consensus can be factored out into a reusable service, and that the log is the right abstraction for that service, has influenced the design of many systems even if they do not use a shared log directly.

For practitioners, the takeaways are:

  1. Think in terms of logs. Even if you are using Raft or Paxos directly, understanding your system as a replicated log with state machines on top clarifies the architecture.

  2. Consider the shared log when building control planes. For metadata management, configuration storage, and coordination, a shared log can simplify the architecture significantly. Delos’s success at Meta is evidence that this works at scale.

  3. Do not use a shared log for everything. High-throughput data paths are better served by purpose-built protocols (like Kafka’s ISR for message streaming, or CockroachDB’s Raft for database replication). The generality of the shared log comes at a performance cost.

  4. The VirtualLog pattern is powerful. Even if you do not use Delos specifically, the idea of separating the log abstraction from the consensus implementation — allowing the implementation to be swapped, upgraded, or reconfigured without affecting applications — is a design pattern worth adopting.

The shared log is not the answer to every distributed systems problem. But it is a remarkably good answer to the question “how do I factor consensus out of my application logic?” — and that question comes up more often than you might think.

In the end, the log is not magic. It is consensus with a better API. And sometimes, a better API is exactly what you need.

Tradeoff Matrix: Latency, Throughput, Fault Tolerance

Every few months, someone publishes a blog post titled something like “Consensus Algorithm Comparison” that contains a table with five columns and six rows, each cell filled with a confident one-word summary. “Fast.” “Slow.” “Complex.” These tables are worse than useless — they give you enough information to feel informed while leaving out everything that actually matters.

This chapter is our attempt to build the comparison table we wish we’d had. It will be large. It will have footnotes. Some cells will contain uncomfortable phrases like “it depends” and “the paper doesn’t say.” That’s because honest comparison is messy, and anyone who tells you otherwise is either selling something or hasn’t built anything.

The Protocols Under Comparison

We’re comparing eleven protocols (or protocol families) that have appeared throughout this book:

  1. Paxos — Single-decree, Lamport’s original
  2. Multi-Paxos — The practical extension everyone actually means when they say “Paxos”
  3. Raft — The understandable one (allegedly)
  4. Zab — ZooKeeper’s protocol
  5. Viewstamped Replication (VR) — The one nobody reads
  6. PBFT — Castro and Liskov’s Byzantine workhorse
  7. HotStuff — Linear BFT with rotating leaders
  8. Tendermint — BFT for blockchains (and beyond)
  9. EPaxos — Leaderless when it can be
  10. Flexible Paxos — Paxos with relaxed quorums
  11. Kafka ISR — Not quite consensus, but close enough for many

A few caveats before we begin. Paxos and Multi-Paxos are algorithm families, not single implementations — there are dozens of variants, and the performance characteristics depend heavily on which variant you pick. EPaxos has been refined several times since the original paper. Kafka ISR isn’t a consensus algorithm in the formal sense, but it solves a sufficiently similar problem that excluding it would be dishonest. And any comparison that includes both crash-fault-tolerant and Byzantine-fault-tolerant protocols is inherently comparing apples to slightly paranoid oranges.

With those disclaimers lodged, let’s proceed.

Dimension 1: Message Complexity (Normal Case)

Message complexity tells you how many messages need to be exchanged to commit a single operation in the common, happy-path case — no failures, no leader changes, the sun is shining and your network is behaving.

ProtocolMessages (Normal Case)Message PatternNotes
Paxos (single-decree)2n (Prepare + Accept)2 round-trips, leader to allFirst round can be skipped if leader is stable (Multi-Paxos optimization)
Multi-Paxos (steady state)n1 round-trip, leader to allPrepare phase amortized away after leader election
Raftn1 round-trip, leader to allAppendEntries + responses
Zabn1 round-trip, leader to allPropose + Ack in broadcast phase
VRn1 round-trip, primary to allPrepare + PrepareOK
PBFT~n^23 phases, all-to-all in commitPre-prepare + Prepare (n msgs each) + Commit (n msgs each)
HotStuffn per phase, 3 phasesLinear, leader collects votes3 round-trips in basic; pipelined reduces effective latency
Tendermint~n^2Propose + Prevote + PrecommitGossip-based, all-to-all in vote steps
EPaxosn (fast path), 2n (slow path)1 or 2 round-tripsFast path if no conflicts; slow path on dependency conflicts
Flexible Paxosn (steady state)Same as Multi-PaxosSmaller quorums possible, so “n” may be smaller than majority
Kafka ISRISR size1 round-trip, leader to ISROnly replicates to in-sync replicas, not all brokers

The thing to notice immediately is the gap between CFT and BFT protocols. The crash-fault-tolerant protocols all converge on roughly n messages in steady state (one round-trip from leader to replicas). The BFT protocols pay a tax of either n^2 messages or additional round-trips. This is not a coincidence — it’s the fundamental cost of not trusting each other.

EPaxos deserves special attention. Its fast path of n messages with a single round-trip is genuinely impressive, but it requires a “fast-path quorum” that’s larger than a simple majority (specifically, floor(3f/4) + 1 replicas for f faults, which for five nodes means four out of five must respond in agreement). The fast path also only works when there are no conflicting commands. In workloads with high contention, EPaxos degrades to its slow path more often than the paper’s evaluation section might lead you to believe.

What “Message Complexity” Actually Means in Practice

Here’s where we need to be honest about what these numbers don’t tell you.

Message complexity counts messages. It does not count bytes. A Raft AppendEntries RPC carrying a 4KB state machine command is one message, and a Raft AppendEntries RPC carrying a 4MB batch of commands is also one message, but your network disagrees that these are equivalent.

Message complexity also doesn’t account for persistence. Every protocol on this list (except arguably some BFT protocols with sufficiently many replicas) requires writing to stable storage before responding. That fsync call is almost always the latency bottleneck, not the network hop. A protocol with 2n messages but one persistence point may well be faster than a protocol with n messages but two persistence points.

Finally, message complexity ignores batching. Every production implementation batches multiple client requests into a single consensus round. Multi-Paxos with batching and Multi-Paxos without batching have the same message complexity but wildly different throughput. We’ll come back to this in the throughput section.

Dimension 2: Message Complexity (Leader Change / View Change)

The sunny-day numbers are nice, but distributed systems live in the rain. What happens when a leader fails?

ProtocolLeader Change MessagesLeader Change LatencyNotes
Paxos2n (new Prepare phase)1 additional round-tripRe-runs Phase 1; conceptually simple
Multi-Paxos2n + recovery1 round-trip + catch-upMust discover and fill any gaps in the log
Raftn (RequestVote) + catch-up1 round-trip + election timeoutRandomized timeouts to prevent split votes
ZabO(n * log_size)Multiple round-tripsDiscovery + Synchronization + Broadcast phases; can be expensive
VRO(n^2)2+ round-tripsView change messages contain full logs; expensive
PBFTO(n^2)Multiple round-tripsView change requires 2f+1 view-change messages with proofs
HotStuffO(n)1 round-trip for pacemakerDesigned for frequent leader rotation; view change is cheap
TendermintO(n^2)Round timeoutValidators just move to next round/proposer
EPaxosN/A (leaderless)N/ANo leader to fail; but recovery of in-progress instances is complex
Flexible PaxosSame as Multi-PaxosSame as Multi-PaxosQuorum flexibility doesn’t change recovery
Kafka ISRO(ISR size)Controller detects + shrinks ISRController elects new partition leader from ISR

This is where the protocols diverge dramatically.

HotStuff’s selling point is right here — it was specifically designed so that leader rotation (and by extension, leader failure handling) is O(n) rather than O(n^2). This matters enormously in BFT settings where you might want to rotate leaders frequently to limit a malicious leader’s ability to cause damage.

Zab’s view change deserves its reputation for complexity. The discovery phase requires the new leader to contact all followers, determine who has the most up-to-date state, synchronize that state to a quorum, and only then begin accepting new proposals. In a cluster with significant state divergence (which happens when the old leader was partitioned while still accepting writes that some followers received and others didn’t), this process can take a disturbing amount of time.

VR’s view change is similarly expensive because view-change messages historically carry full logs (though practical implementations obviously optimize this). The theoretical message complexity of O(n^2) comes from every replica needing to send its state to every other replica, though in practice you only need the new primary to collect f+1 view-change messages.

EPaxos sidesteps leader election entirely, which is wonderful until you need to recover an in-progress command instance. The recovery protocol requires running an explicit recovery phase that resembles Paxos Phase 1, and if you’re unlucky enough to have multiple concurrent recoveries for dependent commands, you can end up in a situation that will make you appreciate why leaders exist.

Kafka’s approach is characteristically pragmatic — the controller (which is itself elected, previously via ZooKeeper, now via KRaft) simply picks a new leader from the ISR. If the ISR is empty, you have bigger problems, and Kafka lets you choose between unavailability and data loss via the unclean.leader.election.enable configuration flag. That’s not theoretical elegance, but it is honest.

Dimension 3: Latency (Message Delays to Commit)

Latency measures how many sequential network round-trips a client must wait before its request is committed. This is distinct from message complexity — a protocol could send 100 messages in parallel (high message complexity) but complete in one round-trip (low latency).

ProtocolMessage Delays (Normal Case)Message Delays (With Leader)Notes
Paxos4 (client→leader, Prepare, Accept, reply)2 if leader is pre-electedSingle-decree; each decree is independent
Multi-Paxos22Client→leader, leader→quorum→leader→client
Raft22Same as Multi-Paxos in steady state
Zab22Same pattern
VR22Same pattern
PBFT55Client→primary, pre-prepare, prepare, commit, reply
HotStuff (basic)77Client + 3 phases of leader↔replica + reply
HotStuff (pipelined)7 (amortized ~2)7 (amortized ~2)Pipelining overlaps phases of different commands
Tendermint44Propose, Prevote, Precommit, Commit
EPaxos (fast path)2N/A (leaderless)Client→replica→quorum→replica→client
EPaxos (slow path)4N/AAdditional Accept phase on conflict
Flexible Paxos22Same as Multi-Paxos
Kafka ISR22Producer→leader, replicate to ISR, ack

For CFT protocols, the answer is boringly uniform: two message delays in steady state. Client sends to leader, leader replicates to a quorum, leader responds. Everyone has converged on this because it’s the theoretical minimum for fault-tolerant replication with a stable leader.

The BFT protocols show more variation, and this is one of the genuine tradeoffs between PBFT, HotStuff, and Tendermint. PBFT’s five message delays are a consequence of its three-phase protocol (pre-prepare, prepare, commit). HotStuff basic pays seven delays for its three chained QC phases, but pipelining effectively amortizes this to two delays per committed decision (at the cost of increased latency for any individual decision — it takes longer for your specific command to commit, but the system commits one command per round-trip once the pipeline is full). Tendermint sits in between at four delays.

The Persistence Tax

Every one of these latency numbers assumes negligible disk write time. In reality, the fsync (or fdatasync, if your implementation is sophisticated enough to know the difference) at each persistence point adds anywhere from 0.1ms (NVMe SSD) to 10ms (spinning disk) to 50ms+ (cloud storage with “durable” semantics that actually mean “we’ll get around to it”).

Raft implementations typically require two persistence points per commit: one when the leader writes the entry to its own log, and one when each follower writes the entry to its log. But the leader can send AppendEntries and persist in parallel (this is an optimization that the Raft paper mentions in passing and that every production implementation uses, but which trips up everyone building Raft for the first time).

The bottom line: in a data center with modern SSDs, cross-node network latency (0.1-0.5ms per hop) and disk persistence latency (~0.1ms per fsync) are in the same ballpark. Over a WAN, network latency dominates. On spinning disks, persistence dominates. Your protocol choice matters less than your hardware in the typical case.

Dimension 4: Throughput Characteristics

Throughput is where the comparison gets genuinely complicated, because throughput depends on factors that are mostly orthogonal to the consensus protocol itself.

ProtocolTheoretical Throughput LimitPractical BottleneckBatching Potential
PaxosLow (per-decree overhead)Per-operation Prepare phaseLow without Multi-Paxos
Multi-PaxosLeader network bandwidthLeader CPU + NICHigh (batch in Accept)
RaftLeader network bandwidthLeader CPU + NICHigh (batch in AppendEntries)
ZabLeader network bandwidthLeader CPU + NICHigh (batch in proposals)
VRLeader network bandwidthLeader CPU + NICHigh
PBFTAll-node network bandwidthn^2 messages per commitModerate (batch in pre-prepare)
HotStuffLeader network bandwidthLeader CPU (aggregating votes)High
TendermintAll-node network bandwidthGossip overhead + n^2Moderate (batch in block)
EPaxosAggregate cluster bandwidthDependency tracking overheadHigh per-replica
Flexible PaxosLeader network bandwidthSame as Multi-PaxosHigh
Kafka ISRLeader network bandwidthDisk I/O (by design)Very high (batch-optimized)

The leader-based protocols (Multi-Paxos, Raft, Zab, VR, Flexible Paxos, Kafka ISR) all share the same fundamental throughput limitation: the leader is a bottleneck. Every write goes through the leader. The leader must receive the client request, append it to its log, send it to all followers, wait for a quorum of acknowledgments, and respond to the client. The leader’s NIC, CPU, and disk are the ceiling.

This is the main theoretical advantage of EPaxos — by eliminating the leader, it can distribute load across all replicas. If you have five nodes and the workload has no conflicts, each node can independently process roughly 1/5 of the requests, giving you approximately 5x the throughput of a leader-based protocol. The original EPaxos paper demonstrates this convincingly on conflict-free workloads.

But the moment you introduce conflicts (which you will, because most real workloads have hot keys), EPaxos must run its slower path, which includes additional coordination. The dependency tracking logic also adds CPU overhead per operation. In benchmarks with realistic conflict rates (5-25%), EPaxos’s throughput advantage shrinks considerably, and in high-conflict workloads it can actually perform worse than Raft because the slow path is more expensive than just going through a leader.

For the BFT protocols, the throughput picture is dominated by the O(n^2) message overhead. PBFT with n = 4 (the minimum for f = 1) sends roughly 16 messages per commit. At n = 100, that’s roughly 10,000 messages per commit. This is why BFT protocols are typically deployed with small replica counts (4, 7, maybe 13) and why the HotStuff paper’s reduction to O(n) messages was genuinely significant.

Kafka ISR deserves special mention for throughput because its entire architecture is optimized for it. Zero-copy reads, sequential disk I/O, batching at every layer, page cache utilization — Kafka’s throughput numbers are impressive not because the ISR protocol is theoretically superior, but because the implementation has been ruthlessly optimized for the append-only, batched-writes, sequential-reads use case. This is an important lesson: protocol-level message complexity matters less than implementation-level engineering for throughput.

What the Papers Claim vs. What You Actually Get

Let’s talk about performance numbers in academic papers.

Here’s a rough calibration table, assembled from reading too many evaluation sections at 2 AM:

ProtocolPaper ClaimsRealityWhy the Gap
Multi-Paxos~100K ops/sec10K-100K ops/secPapers use small payloads, RAM-only “persistence,” ideal networks
Raft~20K-100K ops/sec5K-50K ops/secetcd tops out around 10-20K writes/sec in practice
EPaxos2-5x Paxos throughput1-2x in practiceConflict rates in real workloads reduce fast-path utilization
PBFT~10-40K ops/sec1K-10K ops/secPapers benchmark with 4 nodes, tiny payloads, localhost
HotStuff“Near Raft throughput with BFT”50-70% of RaftThreshold signature aggregation has real CPU cost
Tendermint~1000 TPS (blockchain mode)100-1000 TPSBlock size, gossip overhead, application-level validation
Kafka ISRMillions of msgs/secHundreds of thousands/sec per partitionKafka’s numbers are aggregate across partitions; single partition is bottlenecked

The gap exists for several reasons, all of which are understandable but rarely acknowledged:

Paper benchmarks use favorable conditions. Small payloads (often 0 bytes or 16 bytes), no application-level processing, networks with minimal jitter, and usually the minimum number of nodes. These conditions are valid for measuring the protocol overhead, but they’re not your production environment.

Paper benchmarks often skip persistence. The phrase “we configure the system to batch sync writes to disk every 10ms” appears in more evaluation sections than anyone wants to admit. That’s not durable consensus — that’s consensus with a 10ms window of data loss. When you add real fsync on every commit (which you must, for correctness), throughput drops substantially.

Paper benchmarks run on dedicated hardware. Your consensus nodes are probably sharing machines with seventeen other services, running in containers, on a network shared with the analytics team’s nightly Spark jobs. The variability alone kills best-case numbers.

Paper benchmarks don’t measure tail latency. Mean latency at 50% load tells you nothing about p99 latency at 90% load, which is the number that actually determines your SLA.

Dimension 5: Fault Tolerance

ProtocolFault ModelNodes for f FaultsTolerates Byzantine?Tolerates Omission?Tolerates Partition?
PaxosCrash2f+1NoYes (crash-stop)Minority partition
Multi-PaxosCrash2f+1NoYesMinority partition
RaftCrash2f+1NoYesMinority partition
ZabCrash2f+1NoYesMinority partition
VRCrash2f+1NoYesMinority partition
PBFTByzantine3f+1YesYesMinority partition
HotStuffByzantine3f+1YesYesMinority partition
TendermintByzantine3f+1YesYesMinority partition
EPaxosCrash2f+1NoYesMinority partition
Flexible PaxosCrashVariesNoYesDepends on quorum config
Kafka ISRCrash (tunable)min.insync.replicasNoYesISR-dependent

The fault tolerance story is straightforward in theory and a mess in practice.

All CFT protocols tolerate f crash faults with 2f+1 nodes. All BFT protocols tolerate f Byzantine faults with 3f+1 nodes. If you stopped here, you’d think this was simple.

It isn’t, because “crash fault” is a model, not a reality. Real failures include:

  • Disk corruption without node crash — your node is running but serving garbage data. CFT protocols assume this doesn’t happen.
  • Clock skew — Raft’s leader election depends on timeouts; extreme clock skew can cause unnecessary elections or, worse, split-brain in poorly implemented variants.
  • Network asymmetry — node A can reach B but not C, while B can reach both A and C. Most protocols assume symmetric partitions.
  • Slow nodes — not crashed, not Byzantine, just slow enough to be useless. This falls through the cracks of both fault models.
  • Correlated failures — the assumption of independent failures underlies the f out of 2f+1 math. When all your nodes are in the same availability zone and AWS has a bad day, independence goes out the window.

Flexible Paxos is interesting in the fault tolerance dimension because it lets you tune the tradeoff. If you set a Phase 1 quorum of n-1 and a Phase 2 quorum of 1, you get amazingly fast writes (only one replica must acknowledge) but terrible leader election (must contact all replicas). The failure tolerance depends on your quorum choices, and choosing wrong means silent data loss. This flexibility is powerful and terrifying in equal measure.

Kafka ISR is fault-tolerant in a way that makes theorists uncomfortable. The ISR can shrink to a single replica (the leader), at which point you have zero fault tolerance but remain available. Whether this is acceptable depends on your min.insync.replicas setting and your willingness to trade safety for availability. In practice, many Kafka deployments run with min.insync.replicas=2 and replication.factor=3, giving them crash tolerance of one broker — equivalent to f=1 with 2f+1=3 nodes, just with a less formal proof.

The Jepsen Reality Check

No discussion of fault tolerance claims is complete without mentioning Jepsen, Kyle Kingsbury’s testing framework that has found consistency violations in nearly every distributed system it has tested. Here’s a partial scorecard relevant to our protocols:

SystemJepsen FindingsStatus
etcd (Raft)Stale reads under certain configurations; lease-related issuesFixed in subsequent releases
ZooKeeper (Zab)Issues under network partitions with specific timingMostly fixed; some edge cases remain
CockroachDB (Raft)Serializability violations under clock skewFixed; improved clock skew handling
MongoDB (custom)Significant data loss during rollback; stale readsMultiple rounds of fixes
KafkaUnder-replicated partitions can lose data; exactly-once edge casesConfiguration-dependent; improved over releases
Redis (Sentinel/Cluster)Split-brain, data loss during failoverFundamental design limitations
Consul (Raft)Stale reads under default configurationConfiguration clarified

The pattern is consistent: every system has bugs, and those bugs are most likely to manifest during network partitions, leader elections, and membership changes — exactly the scenarios that matter most. The protocols are correct on paper, but the implementations have gaps. This is not a criticism of the implementers (these are some of the best engineering teams in the industry) — it’s evidence that implementing consensus correctly is genuinely, perhaps irreducibly, difficult.

Dimension 6: Leader Requirement

ProtocolRequires Leader?Leader’s RoleMulti-Leader?
PaxosProposer (not technically a leader)Proposes valuesMultiple proposers possible (but may conflict)
Multi-PaxosYes (distinguished proposer)Sequences all operationsNo
RaftYes (strong leader)Sequences all operations, log authorityNo
ZabYes (leader)Sequences all operationsNo
VRYes (primary)Sequences all operationsNo
PBFTYes (primary)Orders requests within a viewNo, but rotates on view change
HotStuffYes (rotating leader)Proposes blocks, collects votesNo, but designed for rotation
TendermintYes (proposer per round)Proposes blocksRotates every round
EPaxosNoN/AAll replicas can lead any instance
Flexible PaxosYes (in Multi-Paxos mode)Sequences operationsNo
Kafka ISRYes (per-partition leader)Handles reads and writesYes (different leaders per partition)

The leader question matters for two reasons: performance and availability.

Performance: A single leader is a throughput bottleneck. Every operation must pass through one node. Kafka sidesteps this with per-partition leaders — different partitions can have different leaders, distributing the load. EPaxos sidesteps it by eliminating the leader entirely. Everyone else accepts the bottleneck as the price of simplicity.

Availability: A leader failure causes a period of unavailability while a new leader is elected. For Raft, this is typically 150-300ms (one election timeout). For Zab, it can be seconds. For PBFT view changes, it can be… longer than you’d like. HotStuff was explicitly designed to make leader changes fast, because in a BFT setting where you rotate leaders frequently, expensive view changes are fatal to performance.

Raft’s “strong leader” design is worth highlighting. In Raft, the leader is the sole authority on the log — followers accept whatever the leader tells them, and log entries only flow from leader to follower, never the other way. This simplifies reasoning about the protocol tremendously but means the leader does more work than in protocols like Multi-Paxos, where followers can occasionally know about entries the current leader doesn’t (from a previous leader’s incomplete rounds).

Dimension 7: Ordering Guarantees

ProtocolOrdering GuaranteeTotal Order?Per-Key Ordering?Notes
PaxosPer-instance onlyNo (single decree)NoEach instance independent
Multi-PaxosTotal orderYesYes (implied)Log is totally ordered
RaftTotal orderYesYes (implied)Log is totally ordered
ZabTotal order + causalYesYesFIFO ordering for same client
VRTotal orderYesYes (implied)Log is totally ordered
PBFTTotal orderYesYesWithin a view; across views with view-change
HotStuffTotal orderYesYesChained QCs provide total order
TendermintTotal order (per chain)YesYesBlock sequence provides total order
EPaxosPartial order (fast path)Only after execution orderingPer-key with conflicts resolvedDependency graph, not a log
Flexible PaxosTotal orderYesYesSame as Multi-Paxos
Kafka ISRTotal order per partitionPer-partition onlyOnly within partitionNo cross-partition ordering

EPaxos is the outlier here, and this is both its greatest strength and its most confusing aspect. EPaxos doesn’t maintain a totally ordered log — instead, it builds a dependency graph where commands that don’t conflict can be ordered independently. This allows parallelism (non-conflicting commands at different replicas don’t need to coordinate) but makes the execution layer more complex. You need a deterministic algorithm to linearize the dependency graph at each replica, and that algorithm (based on Tarjan’s strongly connected components) is subtle enough that multiple published versions had bugs.

Kafka’s per-partition ordering is a practical compromise that works brilliantly for many use cases but catches people off guard when they need cross-partition ordering. If you need events for user A and user B to be ordered relative to each other, they must go to the same partition. If you also need events for user B and user C to be ordered, all three users must be on the same partition. Transitive ordering requirements can collapse your partition scheme into a single partition faster than you’d expect.

Dimension 8: Membership Change Support

ProtocolMembership ChangesMechanismComplexity
PaxosNot specified(Left as exercise for reader)N/A
Multi-PaxosNot specified in originalVarious ad-hoc approachesHigh — many subtle bugs
RaftJoint consensus or single-serverRaft paper specifies both approachesModerate — but edge cases abound
ZabDynamic reconfiguration (3.5.0+)Reconfiguration proposalHigh — added years after initial release
VRReconfiguration protocolEpoch-basedModerate
PBFTNot specifiedView change can implicitly handleVery high if attempted
HotStuffAddressed in some variantsCommittee rotationModerate in blockchain context
TendermintValidator set changesVia application-level EndBlockModerate
EPaxosNot specifiedOpen research problemVery high
Flexible PaxosNot specifiedQuorum changes compound the problemVery high
Kafka ISRBuilt-inPartition reassignment + ISR dynamicsLow (from user perspective)

Membership changes — adding or removing nodes from the consensus group — are the feature that every protocol paper hand-waves over and every implementation team curses. Lamport’s Paxos paper doesn’t address it. The PBFT paper doesn’t address it. EPaxos doesn’t address it. This is not because the problem is trivial; it’s because it’s orthogonal to the core algorithm and fiendishly hard to get right.

Raft deserves credit for being the first major protocol paper to include a detailed membership change mechanism. The joint consensus approach (where the cluster temporarily operates with a configuration that spans old and new memberships) is correct but complex. The single-server change approach (adding or removing one server at a time) is simpler but requires careful sequencing.

Even Raft’s approach has subtleties that have tripped up implementations. The most notorious: a server that has been removed from the cluster but doesn’t know it yet can disrupt the cluster by starting elections with a higher term number. The pre-vote mechanism (RequestPreVote before RequestVote) was added to address this, but it’s not in the original paper.

Kafka wins this category handily from a user perspective — partition reassignment is a well-tested operational procedure, and the ISR mechanism naturally handles temporary membership changes (a slow broker drops out of the ISR and rejoins later). The KRaft migration adds complexity, but the partition-level membership story remains solid.

Dimension 9: Implementation Complexity

This is the most subjective dimension, but also one of the most practically important. A protocol that’s theoretically optimal but impossible to implement correctly is worse than a protocol that’s theoretically suboptimal but actually works.

ProtocolImplementation ComplexityLines of Code (rough)Key Difficulties
PaxosHigh1K-5KMapping to practical system; gaps in log
Multi-PaxosVery High5K-20KNo canonical specification; endless variants
RaftModerate3K-10KCanonical spec helps; still many edge cases
ZabHigh10K-30K (in ZooKeeper)Tightly coupled to ZooKeeper; complex recovery
VRModerate3K-10KWell-specified; but few reference implementations
PBFTVery High10K-30KCryptographic operations; state transfer; garbage collection
HotStuffHigh5K-15KThreshold signatures; pipelining correctness
TendermintHigh15K-40K (full node)ABCI interface; gossip layer; evidence handling
EPaxosVery High5K-15KDependency tracking; execution ordering; recovery
Flexible PaxosVery HighSame as Multi-Paxos + configAll of Multi-Paxos plus quorum configuration safety
Kafka ISRModerate (within Kafka)N/A (part of larger system)ISR management; controller failover; exactly-once

The lines-of-code numbers are extremely rough and depend heavily on language, coding style, and how much you include (just the consensus layer? The RPC layer? The storage layer? Testing?). They’re meant to give a relative sense, not absolute numbers.

Raft’s moderate complexity is its entire selling point, and it’s a real one. The Raft paper includes enough detail that a graduate student can implement it in a semester (with bugs that will take two more semesters to fix). Multi-Paxos requires reading between the lines of Lamport’s papers, several follow-up papers, and ideally a few blog posts by people who’ve implemented it. PBFT requires all of that plus cryptography.

EPaxos is in the “very high” category not because the core algorithm is inherently more complex than Multi-Paxos, but because the dependency tracking and execution ordering are genuinely difficult to get right. The original paper had a bug in the execution algorithm that was discovered years after publication. If the authors can get it wrong, so can you.

The Grand Comparison Table

For reference, here’s the complete matrix in one (necessarily compressed) view:

PaxosMulti-PaxosRaftZabVRPBFTHotStuffTendermintEPaxosFlex. PaxosKafka ISR
Msg complexity (normal)2nnnnnn^2n/phasen^2n or 2nnISR size
Msg complexity (view change)2n2n+nn*logn^2n^2nn^2N/A2n+O(ISR)
Latency (msg delays)4/2222257/~242/422
ThroughputLowLeader-boundLeader-boundLeader-boundLeader-boundn^2-boundLeader-boundn^2-boundDistributedLeader-boundLeader/part.
Fault toleranceCrash, 2f+1Crash, 2f+1Crash, 2f+1Crash, 2f+1Crash, 2f+1Byz, 3f+1Byz, 3f+1Byz, 3f+1Crash, 2f+1Crash, variesCrash, ISR
Leader requiredProposerYesYes (strong)YesYesYesYes (rotating)Yes (rotating)NoYesYes/partition
Total orderingNoYesYesYes+causalYesYesYesYesPartialYesPer-partition
Membership changeUnspecifiedUnspecifiedSpecifiedAdded laterSpecifiedUnspecifiedVariesApp-levelUnspecifiedUnspecifiedBuilt-in
Impl. complexityHighVery highModerateHighModerateVery highHighHighVery highVery highModerate

Why Microbenchmarks Lie

We’ve alluded to this throughout, but it deserves its own section.

A microbenchmark of a consensus protocol typically measures the protocol in isolation: how fast can it commit operations with no application-level processing, minimal payload sizes, and a perfectly behaved network? These numbers are useful for comparing the overhead of the protocol itself, but they tell you almost nothing about the performance of a system built on top of the protocol.

Here’s what microbenchmarks systematically miss:

1. Application processing time. If your state machine takes 1ms to apply a command, and your consensus protocol commits in 0.5ms, the protocol isn’t your bottleneck. Shaving 0.1ms off consensus latency (by, say, switching from Raft to a theoretically faster protocol) saves you 6.7% of total latency, not 20%.

2. Serialization overhead. Consensus protocols send messages. Those messages must be serialized and deserialized. In a benchmark with 16-byte payloads, serialization is negligible. In a system with 64KB payloads containing Protocol Buffers, serialization can account for 10-30% of CPU time.

3. Garbage collection pauses. If your consensus implementation is in Java (ZooKeeper, Kafka) or Go (etcd), GC pauses will periodically spike your latency in ways that the protocol cannot prevent. A C++ or Rust implementation will behave differently, but you don’t get to choose the implementation language of most off-the-shelf systems.

4. Connection management. Benchmarks run on a fixed number of connections in a controlled environment. Production systems have connection churn, TLS handshakes, keep-alive management, and TCP behavior that depends on the physical network path.

5. Interaction with other system components. Raft’s performance in etcd depends on boltdb’s write characteristics, gRPC’s overhead, the Go scheduler’s behavior under load, and how etcd’s MVCC layer interacts with the Raft log. None of these show up in a Raft microbenchmark.

The lesson: don’t choose a consensus protocol based on microbenchmark numbers. Choose based on the properties that matter for your system (fault model, ordering requirements, membership flexibility), and then optimize the implementation. A well-engineered Raft implementation will outperform a naive EPaxos implementation in virtually every realistic scenario.

Dimension 10: Production Maturity and Ecosystem

No comparison is complete without acknowledging that protocols don’t exist in a vacuum — they exist in ecosystems. The best protocol with no production-grade implementation loses to a mediocre protocol with excellent tooling.

ProtocolNotable ImplementationsYears in ProductionTooling QualityCommunity Size
PaxosGoogle Chubby, Spanner (internal)20+Internal (limited public)Academic + Google
Multi-PaxosGoogle Spanner, Megastore (internal); Ratis20+Limited public toolingSmall public community
Raftetcd, CockroachDB, TiDB, Consul, hashicorp/raft, openraft10+ExcellentVery large
ZabZooKeeper15+Good (ZK ecosystem)Large
VRViewstamped Replication Revisited implementationsLimitedMinimalVery small
PBFTHyperledger Fabric (early), custom implementations15+ (mostly academic)LimitedSmall
HotStuffDiem/Libra (defunct), Aptos, Flow blockchain5+ModerateGrowing (blockchain)
TendermintCosmos, Binance Chain, various blockchains7+Good (Cosmos SDK)Large (blockchain)
EPaxosResearch prototypes, limited production useLimitedMinimalVery small
Flexible PaxosResearch prototypes; influenced WPaxosVery limitedMinimalVery small
Kafka ISRApache Kafka, Confluent Platform12+ExcellentVery large

Raft and Kafka ISR dominate the ecosystem dimension. Raft has at least a dozen production-grade implementations across multiple languages, comprehensive documentation, university courses, interactive visualizations, and thousands of engineers who’ve operated Raft-based systems. Kafka has an even larger ecosystem: client libraries in every language, a rich connector ecosystem (Kafka Connect), a stream processing framework (Kafka Streams), managed service offerings, and a conference circuit.

EPaxos and Flexible Paxos, despite their theoretical advantages, have essentially zero production ecosystem. If you choose EPaxos, you’re implementing it yourself (or using one of a handful of research prototypes) with minimal community support. When you hit a bug — and you will hit bugs — your debugging resources are the original paper and its errata.

VR is in a similar position. Despite being a well-designed protocol (arguably cleaner than Paxos), its lack of a canonical implementation and tiny community make it a risky choice for production systems. Choosing VR is a statement of intellectual independence that your on-call team will not appreciate.

This ecosystem effect is self-reinforcing. More implementations mean more bug fixes, more operational knowledge, more tooling, more documentation, which attracts more implementations. Raft’s dominance in the CFT space is as much a network effect as a technical judgment.

The Real Comparison

After all these dimensions, all these tables, and all these caveats, here’s the uncomfortable truth: for most systems, the choice between Raft, Multi-Paxos, and Zab doesn’t matter much. They have roughly the same message complexity, the same latency, the same fault tolerance, and the same throughput bottleneck (the leader). The difference is in implementation maturity, ecosystem support, and how much you enjoy reading the original papers.

The meaningful distinctions are at a higher level:

  • Do you need Byzantine fault tolerance? If yes, your options are PBFT, HotStuff, Tendermint, or their derivatives. This is the single biggest fork in the decision tree.
  • Is a leader bottleneck unacceptable? If yes, consider EPaxos (if you can handle the complexity) or a partitioned approach (like Kafka) where different partitions have different leaders.
  • Do you need total ordering across all operations? If not, you can potentially use something simpler or use per-partition ordering.
  • Are you willing to trade safety for availability? If yes, Kafka ISR with unclean.leader.election.enable=true or a Flexible Paxos configuration with small Phase 2 quorums might be appropriate.

We’ll turn these distinctions into a concrete decision framework in the next chapter.

But before we leave this comparison behind, one more uncomfortable truth: the protocol you choose matters less than how well you implement, test, and operate it. A mediocre protocol with excellent monitoring, comprehensive testing, well-documented runbooks, and an on-call team that understands the failure modes will outperform an optimal protocol that nobody on the team fully understands. The tradeoff matrix helps you make an informed choice, but the choice is the beginning of the work, not the end.

The tables in this chapter are a map. The territory is your production environment, your team’s expertise, your specific workload, and the 3 AM incidents that will inevitably test everything you thought you knew about your consensus protocol. Choose wisely, but more importantly, implement carefully.

When to Use What (and When to Give Up)

This is the chapter I wish had existed when I started building distributed systems. Not the theory — the theory is in every textbook. What’s missing is someone who’s been through the decision process multiple times saying, plainly, “given your constraints, here’s what to use and here’s why.”

So that’s what this chapter is. A decision framework built from scar tissue.

The Questions You Should Ask First

Before you even look at consensus algorithms, you need to answer some questions about your system. Most teams skip this step, jumping straight to “we’ll use Raft because we’ve heard of it.” That’s not always wrong (Raft is a fine default), but it’s not engineering — it’s reflex.

Question 1: What Is Your Failure Model?

This is the single most important question and the one that narrows the field the fastest.

Crash faults only: Your nodes may crash and restart, but when they’re running, they behave correctly. They don’t lie, they don’t send conflicting messages to different peers, they don’t get compromised. This is the assumption for Paxos, Raft, Zab, VR, EPaxos, Flexible Paxos, and Kafka ISR.

Byzantine faults: Your nodes may behave arbitrarily. They can send wrong data, refuse to participate, collude with other malicious nodes, or actively try to subvert the protocol. This is the assumption for PBFT, HotStuff, and Tendermint.

Here’s the practical test: do all your nodes run the same software, managed by the same team, in infrastructure you control? If yes, crash faults are almost certainly sufficient. If you’re running a multi-organization system, a public blockchain, or anything where a compromised node could lie to its peers, you need Byzantine tolerance.

The honest addendum: even “crash faults only” is an idealization. Disk corruption, memory bit-flips, and kernel bugs can cause nodes to behave incorrectly without crashing. Some teams add application-level checksums on top of a CFT protocol rather than paying the full BFT overhead. This is pragmatic but not formally justified, so don’t put it in your design document unless you’re prepared for that conversation.

Question 2: What Is Your Network Topology?

Single data center, low latency (< 1ms between nodes): You’re in the best case. Any protocol will perform well. Leader-based protocols have minimal latency overhead because leader-to-follower round-trips are fast.

Multi-data center, moderate latency (10-100ms between DCs): Leader placement matters enormously. If your leader is in US-East and your client is in US-West, every write pays a cross-country round-trip. Consider protocols that support follower reads (Raft read leases, Zab sync’d reads) or multi-leader approaches. Or accept the latency and put the leader near the majority of clients.

Global, high latency (100-300ms between regions): This is where consensus hurts the most. Two message delays at 200ms per hop means 400ms minimum commit latency. EPaxos can help if conflicts are rare (the nearest replica handles the request). Flexible Paxos with a commit quorum located near the clients can help. But honestly, at this scale, you should strongly consider whether you actually need synchronous consensus or whether eventual consistency with conflict resolution would suffice. We’ll come back to this.

Unreliable network (frequent partitions, packet loss, high jitter): All consensus protocols degrade under adverse network conditions, but some degrade more gracefully than others. Raft’s leader election, based on randomized timeouts, handles partition healing reasonably well. PBFT’s view change under persistent network instability can be painful. Kafka’s ISR model gracefully handles transient slow brokers by shrinking the ISR.

Question 3: What Is Your Read/Write Ratio?

Write-heavy (> 50% writes): The leader is under constant pressure. Throughput is limited by the leader’s ability to replicate. Batching becomes essential. Consider Kafka ISR (designed for this), or partitioning your data so different partitions have different leaders.

Read-heavy (> 90% reads): The consensus protocol is involved in a small fraction of operations. The interesting question is how you serve reads: from the leader only (simple, consistent, bottleneck), from any replica (fast, potentially stale), or from followers with read leases (a compromise). Raft and Zab both support linearizable reads via the leader and eventually consistent reads from followers. If you’re at 99% reads, the consensus algorithm barely matters — your read path architecture matters more.

Mixed with hot keys: The worst case. If a small number of keys receive a disproportionate share of writes, even partitioning doesn’t help (the hot partition’s leader is still a bottleneck). EPaxos can help with the non-hot keys, but the hot keys still serialize. This is a fundamental problem that no consensus algorithm solves — you need application-level solutions (write combining, buffering, CRDTs for commutative operations).

Question 4: How Many Nodes?

3 nodes (f=1): The minimum for any meaningful fault tolerance. Raft, Zab, and Kafka ISR all work well here. BFT protocols need at least 4 nodes (for f=1), so this rules them out.

5 nodes (f=2): The sweet spot for most deployments. Handles two simultaneous failures, which covers rolling upgrades with one unexpected failure. This is the most common configuration for etcd, ZooKeeper, and most consensus deployments.

7-13 nodes: Uncommon for CFT protocols (diminishing returns, increased replication overhead), but normal for BFT deployments where you need higher fault tolerance.

Hundreds or thousands of nodes: You’re either building a blockchain or doing something unusual. Classical consensus protocols don’t scale to this range. You need either a hierarchical approach (consensus within small groups, coordination between groups) or a protocol specifically designed for large validator sets (e.g., HotStuff variants used in blockchain systems).

Question 5: Can You Tolerate a Leader Bottleneck?

If yes (which is the case for most systems with moderate throughput requirements), use a leader-based protocol. They’re simpler to implement, simpler to reason about, and simpler to debug.

If no, your options are:

  1. EPaxos — eliminates the leader but adds complexity in dependency tracking
  2. Partitioned leader (Kafka model) — different leaders for different partitions
  3. Read-only replicas with consensus only for writes — offloads reads but writes still go through a leader
  4. Give up on consensus — use eventual consistency with CRDTs or last-writer-wins

Most teams that think they can’t tolerate a leader bottleneck actually can, once they add batching. A Raft leader on modern hardware with NVMe storage and 10Gbps networking can handle 50,000+ writes per second with batching. If you need more than that on a single consensus group, you likely need to partition regardless.

Question 6: What Are Your Durability Requirements?

This question is less about the consensus protocol and more about how you configure it, but it has important implications.

Every write must be durable before acknowledgment: This is the “correct” mode. Every node must fsync before responding. This is what the proofs assume. Latency cost: 0.05-0.2ms per fsync on NVMe, 5-15ms on spinning disk.

Batch durability (fsync every N ms): Many systems, including Kafka by default, batch fsync calls. This means a crash within the batch window loses data. If you’re okay with this (and for many use cases, you should be), it dramatically improves throughput.

Replication is sufficient (no fsync): If you replicate to three nodes and assume they won’t all crash simultaneously, you can skip fsync entirely and rely on replication for durability. This is formally incorrect (a correlated failure — like a data center power outage — can lose data) but is used in practice by systems that prioritize throughput over absolute durability.

The choice here doesn’t change which protocol you pick, but it changes what performance you get from it. Quoting throughput numbers without specifying durability settings is meaningless.

The Decision Tree

Here’s the decision framework, presented as a series of branching decisions. Follow the path that matches your requirements.

START: Do you need Byzantine fault tolerance?
│
├── YES: Are you building a blockchain or multi-organization system?
│   │
│   ├── YES (public blockchain): How many validators?
│   │   ├── < 100: Tendermint or HotStuff variant
│   │   └── > 100: HotStuff variant (linear message complexity essential)
│   │
│   ├── YES (private/consortium): How many organizations?
│   │   ├── < 10: PBFT is viable, Tendermint also works
│   │   └── > 10: HotStuff or Tendermint
│   │
│   └── NO (single org, but paranoid about compromised nodes):
│       └── Consider adding integrity checks on top of a CFT protocol
│           instead of paying the full BFT tax. It's not formally
│           sound but it's practical.
│
└── NO: What's your primary use case?
    │
    ├── Metadata store / coordination service (low volume, strong consistency)
    │   └── Use an existing system:
    │       ├── etcd (Raft-based, good K8s integration)
    │       ├── ZooKeeper (Zab-based, mature, Java ecosystem)
    │       └── Consul (Raft-based, service mesh features)
    │       DON'T build your own.
    │
    ├── Distributed lock / leader election
    │   └── Use an existing system:
    │       ├── etcd with client library
    │       ├── ZooKeeper with recipes
    │       └── Redis (if you accept the caveats about Redlock)
    │       DEFINITELY don't build your own.
    │
    ├── Replicated database (state machine replication)
    │   │
    │   ├── Single-region:
    │   │   ├── < 50K writes/sec: Raft (most tooling, best understood)
    │   │   └── > 50K writes/sec: Partition across multiple Raft groups
    │   │
    │   └── Multi-region:
    │       ├── Can tolerate leader in one region: Raft with geo-aware leader
    │       ├── Need low-latency writes everywhere: EPaxos or Flexible Paxos
    │       └── Can tolerate eventual consistency: CRDTs or last-writer-wins
    │
    ├── Message queue / event streaming
    │   └── Use Kafka or a Kafka-like system (Redpanda, Pulsar)
    │       The ISR model is designed for this. Classical consensus
    │       is overkill for append-only logs with consumer groups.
    │
    ├── Replicated log / write-ahead log
    │   ├── Raft (simplest mental model: the log IS the consensus)
    │   └── Multi-Paxos (if you need gap tolerance for performance)
    │
    └── Something else:
        └── Start with Raft. Seriously. You can always switch later
            (you won't switch later, but the option is comforting).

Specific Recommendations

Let’s go through common scenarios with concrete recommendations.

Scenario: You’re Building a Metadata Store

Use case: Storing configuration, service discovery information, cluster membership, leader election state. Low write volume (tens to hundreds of writes per second), strong consistency required, high read volume.

Recommendation: Don’t build one. Use etcd or ZooKeeper.

If you’re in the Kubernetes ecosystem, etcd is the obvious choice — it’s what Kubernetes itself uses, it has good client libraries, and there’s a large community that has collectively discovered and fixed most of the operational foot-guns.

If you’re in the Java/JVM ecosystem, ZooKeeper is battle-tested at massive scale (it runs the configuration layer at LinkedIn, Twitter/X, and hundreds of other companies). Its API is quirky (ephemeral nodes, sequential znodes, watches with session semantics), but the recipes built on top of it (distributed locks, leader election, group membership) work.

When to break this recommendation: When your requirements include very low latency (sub-millisecond), very high read throughput (millions/sec), or you’re in an environment where running etcd or ZooKeeper is operationally impractical (embedded systems, resource-constrained environments). In these cases, embedding a Raft library (like hashicorp/raft in Go or openraft in Rust) into your application might be justified. But understand that you’re taking on the maintenance burden of a consensus-based system, and that burden is significant.

Scenario: You’re Building a Distributed Lock Service

Use case: Mutual exclusion across distributed processes. Correctness is critical — two processes holding the same lock simultaneously means data corruption.

Recommendation: Use etcd or ZooKeeper leases/ephemeral nodes.

Please do not build your own distributed lock using Redis. I know about Redlock. Martin Kleppmann wrote a detailed analysis of why Redlock doesn’t provide the guarantees it claims, and Salvatore Sanfilippo (Redis’s creator) wrote a rebuttal, and Kleppmann wrote a rebuttal to the rebuttal, and the fundamental issue remains: Redis is not a consensus system, and bolting lock semantics onto a non-consensus system requires assumptions about timing that don’t hold in practice.

If you need a distributed lock and you need it to actually be correct (not “correct unless a GC pause happens at the wrong time”), use something backed by a real consensus protocol.

When to break this recommendation: When a lock violation is annoying but not catastrophic. If the worst case of two holders is “we process a message twice” rather than “we corrupt financial records,” then Redlock-style approaches or even optimistic locking with CAS operations might be good enough. “Good enough” is underrated in engineering.

An important nuance: distributed locks have a fencing problem that most implementations ignore. Even with a consensus-backed lock, a process can acquire the lock, pause (GC, swap, CPU scheduling), and resume after the lock has expired and been acquired by another process. The original holder doesn’t know the lock has expired. The solution is fencing tokens — monotonically increasing tokens issued with each lock acquisition, which downstream resources check before accepting operations. Without fencing tokens, your distributed lock is a suggestion, not a guarantee. ZooKeeper’s sequential znodes provide this naturally. Most other lock implementations require you to build it yourself.

Scenario: You’re Building a Replicated Database

Use case: A database where writes are replicated to multiple nodes for fault tolerance, and clients need linearizable (or at least serializable) reads.

Recommendation: Raft, embedded via a library.

This is the use case that CockroachDB, TiDB, and YugabyteDB all chose Raft for. The reasons are practical:

  1. Raft has the best documentation and most reference implementations of any consensus protocol
  2. The log-based model maps directly to write-ahead logging, which every database already has
  3. Per-range Raft groups (CockroachDB’s approach) let you scale by partitioning while keeping per-partition consensus simple
  4. Leader leases enable low-latency reads without consensus

If you’re building a multi-region database and can’t tolerate the latency of going to a single leader, consider EPaxos — but only if you have the engineering depth to implement it correctly and the workload characteristics (low conflict rate) to benefit from it.

When to break this recommendation: If your database can tolerate eventual consistency for reads (with read-your-writes or causal consistency guarantees), you may not need consensus for the read path at all. Use consensus for writes and a gossip-based or anti-entropy protocol for read replicas. This is essentially what Amazon DynamoDB does — strong consistency is available but costs more because it goes through the consensus path, while eventually consistent reads go to any replica.

Scenario: You’re Building a Message Queue

Use case: Durable message delivery with ordering guarantees, consumer groups, replay capability.

Recommendation: Use Kafka. Or Redpanda. Or Pulsar. Do not build your own.

The temptation to build “a lightweight Kafka” is one of the most dangerous impulses in distributed systems engineering. You start with “we just need a simple pub/sub with persistence” and end up spending two years reimplementing consumer group coordination, exactly-once semantics, and partition rebalancing.

Kafka’s ISR model is specifically designed for the append-only, high-throughput, per-partition-ordering use case. It’s not academically elegant, but it’s operationally proven at scales that would make most consensus protocols weep.

When to break this recommendation: When you need stronger ordering guarantees than per-partition ordering (total order across all messages), when your messages are very small and Kafka’s per-message overhead is significant, or when you’re in an environment where running a Kafka cluster is impractical (embedded, edge, very small scale). In the last case, consider an embedded Raft log.

Scenario: You’re Building a Blockchain

Use case: A distributed ledger with Byzantine fault tolerance, potentially with untrusted participants.

Recommendation: Tendermint (for application-specific chains) or a HotStuff variant (for high-throughput chains).

Tendermint’s ABCI (Application BlockChain Interface) is a clean separation between consensus and application logic — you implement a state machine, Tendermint handles the consensus. This is genuinely well-designed and saves you from implementing BFT consensus yourself.

For permissionless blockchains with large validator sets, you need something with linear message complexity — HotStuff or its descendants. PBFT’s quadratic overhead makes it impractical beyond ~20 validators.

When to break this recommendation: When your “blockchain” is actually a permissioned system among a small number of known organizations. In that case, ask yourself whether you really need a blockchain or whether a replicated database with audit logging would suffice. The answer is usually the latter, but admitting that doesn’t generate as much funding.

Scenario: You’re Building a Geo-Distributed System

Use case: A system spanning multiple geographic regions, where users in each region expect low-latency access.

Recommendation: It depends on your consistency requirements, and this is where you need to be brutally honest with yourself.

Here’s the latency reality:

Source → DestinationRound-trip Latency
Same data center0.1-0.5 ms
Same region, different AZ1-3 ms
US-East ↔ US-West60-80 ms
US-East ↔ EU-West80-100 ms
US-East ↔ Asia-Pacific150-250 ms

A consensus commit with two message delays at 200ms round-trip latency is 400ms minimum. That’s visible to users. Your options:

  1. Accept the latency. Put the leader in the region with the most users, accept that other regions pay cross-region latency for writes. Use follower reads (with stale reads) for the read path. This is the simplest option and often the right one.

  2. Use EPaxos or Flexible Paxos. EPaxos can commit locally for non-conflicting operations. Flexible Paxos can be configured with commit quorums biased toward specific regions. Both reduce latency for some operations at the cost of complexity.

  3. Use per-region consensus with cross-region reconciliation. Run independent consensus groups in each region and reconcile asynchronously. This requires application-level conflict resolution and is essentially eventual consistency with extra steps.

  4. Use CRDTs or eventual consistency. If your data model permits it (counters, sets, LWW registers), skip consensus for the write path and use CRDTs. Convergence is guaranteed without coordination. This is the approach used by Riak, Redis CRDTs, and many mobile/edge systems.

  5. Give up on geo-distribution. Run everything in one region. Use a CDN for read-heavy content. Accept that users far from the region see higher latency. This is what most systems actually do, and there’s no shame in it.

Scenario: You Need Exactly-Once Semantics

Use case: Each client operation must be applied exactly once, even in the presence of retries, leader changes, and network duplicates.

Recommendation: Build idempotency into your application layer, regardless of which consensus protocol you use.

No consensus protocol gives you exactly-once semantics out of the box. They give you at-most-once (if you don’t retry) or at-least-once (if you do retry). Exactly-once requires the application to deduplicate, typically by assigning unique IDs to operations and tracking which IDs have been applied.

Kafka’s exactly-once semantics (introduced in KIP-98, refined over several releases) is the most mature implementation of this in the consensus-adjacent space. It works through idempotent producers (per-partition sequence numbers) and transactional writes (two-phase commit across partitions). It took the Kafka team years to get right, which should calibrate your expectations for implementing it yourself.

The “Just Use etcd/ZooKeeper” Advice

You’ll hear this advice a lot: “just use etcd” or “just use ZooKeeper.” It’s often good advice. Let’s examine when it is and when it isn’t.

When It’s Good Advice

  • You need a coordination primitive (locks, leader election, configuration) and your write volume is low (< 1000 writes/sec)
  • You’re already running etcd (e.g., for Kubernetes) and adding another use case is operationally simple
  • Your data fits in memory (etcd keeps all data in RAM; ZooKeeper has a configurable but typically in-memory data model)
  • You need strong consistency and don’t want to think about consensus protocol details

When It’s Lazy Advice

  • Your data is large (GB+). etcd and ZooKeeper are not databases — they’re coordination services. etcd has a default 2MB value size limit and recommends keeping the total data store under 8GB.
  • Your write volume is high. etcd’s write throughput is typically 10,000-20,000 operations per second, which is a hard ceiling imposed by Raft’s single-leader design and boltdb’s write characteristics.
  • You need per-key watches at massive scale. ZooKeeper’s watch mechanism doesn’t scale well to millions of watches, and etcd’s watch revision model has its own scaling constraints.
  • You need complex queries. These are key-value stores with range scans, not SQL databases.
  • You need different consistency levels for different operations. etcd and ZooKeeper give you strong consistency for everything, which is great for correctness and wasteful for operations that don’t need it.

The Hidden Costs

Even when “just use etcd” is the right answer, the hidden costs are worth acknowledging:

Operational overhead. Running a consensus-based system requires monitoring (leader elections, follower lag, disk usage, snapshot frequency), capacity planning (etcd performance degrades as data size grows), and upgrade procedures (rolling upgrades of a Raft cluster require care).

Client complexity. Using etcd or ZooKeeper correctly requires understanding sessions, leases, watches, and their failure modes. A ZooKeeper client that doesn’t handle session expiry correctly can hold a lock it no longer owns. An etcd client that doesn’t handle lease revocation can operate on stale data.

Blast radius. If your coordination service goes down, everything that depends on it goes down. etcd is a single point of failure for Kubernetes, and a struggling etcd cluster can take down an entire Kubernetes deployment. This is not hypothetical — it happens with distressing regularity.

When to Give Up on Consensus

Sometimes the right answer is: don’t use consensus at all.

Give Up and Use Eventual Consistency When:

  • Your data model supports it. If your operations are commutative (counters), idempotent (sets), or have a natural conflict resolution (last-writer-wins with timestamps), you don’t need consensus. CRDTs formalize this — they guarantee convergence without coordination.

  • Availability matters more than consistency. Consensus requires a majority quorum. If a network partition isolates a minority, that minority is unavailable. If your system must remain available during arbitrary partitions (e.g., mobile apps, edge devices, multi-region systems where “just wait for the partition to heal” isn’t acceptable), eventual consistency is the only option. This is the CAP theorem, and no amount of clever protocol design changes it.

  • Latency is more important than ordering. If serving a stale read in 1ms is better than serving a fresh read in 100ms, you don’t need consensus for reads. Many systems use consensus for writes but serve reads from local replicas without coordination.

  • Your operations are naturally partitioned and independent. If user A’s data never interacts with user B’s data, you can use per-user consensus (or per-user single-node ownership) without global consensus. This is the fundamental insight behind sharding, and it’s more widely applicable than people think.

Give Up and Use a Single Node When:

I’m serious about this one. The most underrated distributed systems strategy is “don’t distribute.”

A single modern server with:

  • 128 cores
  • 1TB RAM
  • NVMe storage doing 1M+ IOPS
  • 100Gbps networking

…can handle more load than most systems will ever see. A well-optimized single-node database can serve 100,000+ write transactions per second. If your system’s total throughput requirement is below this (and most are), a single node with good backups provides:

  • Perfect consistency — no consensus needed, one node is the source of truth
  • Lowest possible latency — no network hops for commits
  • Simplest possible operations — no quorum management, no leader election, no split-brain
  • Recovery via restore from backup — your RTO is “how fast can you restore a snapshot and replay a WAL,” which for most systems is minutes, not hours

The downsides are real: no automatic failover (you need a human or a script to promote a standby), and if the node truly dies (disk failure, motherboard failure), you lose data since the last backup. But for many systems, “5 minutes of downtime during failover + potential loss of a few seconds of data” is an acceptable tradeoff for eliminating all consensus complexity.

The psychological barrier to this approach is that it feels insufficiently distributed. We’ve been trained to believe that single points of failure are always unacceptable. But a single point of failure with 99.99% uptime (52 minutes of downtime per year) and a clear recovery procedure is often better than a distributed system with 99.9% uptime (8.7 hours of downtime per year, distributed across a dozen confusing partial-failure scenarios that your on-call engineer has never seen before).

Give Up and Use a Managed Service

The most underrated option of all: let someone else suffer.

NeedManaged ServiceWhat You Get
Metadata/coordinationAmazon DynamoDB (strong consistency mode), Google Cloud Spanner, Azure Cosmos DBConsensus-backed storage without managing consensus
Message queueAmazon Kinesis, Google Pub/Sub, Azure Event Hubs, Confluent Cloud (managed Kafka)Kafka-like semantics without Kafka operations
Distributed lockAmazon DynamoDB (conditional writes), Google Cloud Spanner (transactions)Lock semantics via transactional storage
Replicated databaseGoogle Cloud Spanner, CockroachDB Cloud, Amazon AuroraConsensus-backed SQL without the agony
BlockchainHyperledger on managed services, various BaaS offeringsBFT consensus without the BFT operations

The cost of a managed service is money and vendor lock-in. The cost of running your own consensus system is engineering time, on-call burden, and career years lost to debugging leader elections at 3 AM. For most organizations, the managed service is cheaper.

This isn’t a cop-out — it’s a recognition that implementing and operating consensus-based systems is genuinely hard, and that unless consensus is your core business, your engineering effort is better spent on the things that differentiate your product.

Common Mistakes in Protocol Selection

Before we close with the final framework, let’s catalog the mistakes I’ve seen teams make when choosing a consensus protocol. Each of these has cost real engineering time and real production incidents.

Mistake 1: Choosing Based on Microbenchmarks

“EPaxos is 2x faster than Raft in this benchmark, so we should use EPaxos.” The benchmark used 0-byte payloads, a conflict-free workload, and three nodes in the same rack. Your workload has 4KB payloads, 15% key contention, five nodes across two data centers, and a state machine that takes 2ms to apply a command. The protocol isn’t your bottleneck, and the benchmark doesn’t represent your workload.

Mistake 2: Over-Engineering the Fault Model

“We need Byzantine fault tolerance because what if a node gets hacked?” If all your nodes run the same software, managed by the same team, in the same cloud account, a compromised node has bigger implications than consensus failure. You probably need better infrastructure security, not BFT. The 3x overhead of BFT (in both nodes and messages) is a steep price for a threat model that doesn’t match your actual risks.

Mistake 3: Under-Engineering the Fault Model

The opposite mistake: “crash faults are fine because our nodes never behave byzantinely.” But then you run on cloud VMs with ephemeral storage, and a VM migration causes data corruption that your CFT protocol interprets as valid. Or you run on machines with faulty ECC memory, and bit-flips cause state divergence between replicas. The crash-fault model assumes that non-crashed nodes are correct. If your environment doesn’t guarantee this, you need additional safeguards (checksums, validation, periodic consistency checks) even with a CFT protocol.

Mistake 4: Ignoring Operational Complexity

“We’ll implement Multi-Paxos because it has better theoretical properties than Raft.” Have you considered who will operate it? Who will debug it at 3 AM? Who will explain to the next team member how it works? The team that can’t explain their consensus protocol to a new hire has a protocol they can’t safely operate.

Mistake 5: Premature Distribution

“We need consensus because we need high availability.” Do you? What’s your actual uptime requirement? If 99.9% is sufficient (8.7 hours of downtime per year), a single node with automated failover to a standby might achieve that. Consensus gives you sub-second failover, which is necessary for 99.99%+ uptime but overkill for 99.9%.

Mistake 6: Forgetting About Reads

Teams spend months optimizing the write path (consensus protocol, replication, durability) and then serve all reads from the leader. If your workload is 95% reads, the read path is 20x more important than the write path. Linearizable reads from the leader, read leases from followers, stale reads from any replica — these are the decisions that determine your system’s practical performance, and they’re largely orthogonal to the choice of consensus protocol.

Mistake 7: Assuming the Paper Is Complete

Every consensus protocol paper omits details that are essential for production. Snapshotting, state transfer, log compaction, membership changes, client interaction, exactly-once semantics — some papers address some of these, but no paper addresses all of them. Budget 2-3x the implementation effort you estimate from reading the paper. If you’re new to consensus implementation, budget 5x.

The Final Framework

If you’ve made it this far and still aren’t sure, here’s the simplest possible decision framework:

  1. Can you avoid distribution entirely? Use a single node with backups.
  2. Can you use a managed service? Do that.
  3. Can you use an existing system (etcd, ZooKeeper, Kafka)? Do that.
  4. Must you implement consensus yourself? Use Raft. Not because it’s the best protocol for every situation, but because it has the most documentation, the most reference implementations, the most battle-tested libraries, and the most community knowledge about what goes wrong. The theoretical advantages of other protocols rarely outweigh the practical advantages of Raft’s ecosystem.
  5. Does Raft not work for your specific requirements? Now, and only now, should you consider alternatives:
    • Need BFT → Tendermint or HotStuff
    • Need leaderless → EPaxos
    • Need tunable quorums → Flexible Paxos
    • Need high-throughput append-only log → Kafka ISR model

The fact that step 5 exists for a handful of scenarios doesn’t change the reality that most teams should stop at steps 1 through 4. The agony of consensus is real, but it’s optional for most of us. The trick is knowing whether you’re one of the few who genuinely need to experience it firsthand — or whether you can learn from those who already have and use their implementations instead.

A Decision Checklist

For the engineer who needs to make a decision by the end of the week, here’s the checklist version:

  • Have I confirmed that I actually need distributed consensus? (Not caching, not pub/sub, not eventual consistency — actual consensus?)
  • Have I checked whether a managed service solves my problem?
  • Have I checked whether an existing open-source system (etcd, ZooKeeper, Kafka) solves my problem?
  • Have I documented my failure model (crash vs. Byzantine)?
  • Have I measured my expected write throughput and confirmed it exceeds what a single node can handle?
  • Have I measured my latency requirements and confirmed they’re compatible with consensus round-trips?
  • Have I considered how I’ll handle the operations burden (monitoring, upgrades, debugging)?
  • Have I identified who on my team can debug consensus issues in production?
  • Have I allocated time for testing with a tool like Jepsen or at least a chaos testing framework?
  • Have I accepted that this will be harder than I think?

If you can check all of these boxes, you’re better prepared than 90% of teams that embark on building consensus-based systems. If you can’t check the last one, go back and check it. It will be harder than you think. It always is.

Why Everyone Just Copies What Kafka Does

There’s a pattern in distributed systems engineering that goes like this: a team faces a problem involving reliable data movement between services. They consider their options. Someone mentions Kafka. The room divides into two camps: those who say “let’s just use Kafka” and those who say “let’s build something simpler — like Kafka, but lighter.” Both camps end up implementing some variant of Kafka’s design. One just does it with more steps.

This chapter is about why that happens. Not because Kafka is perfect — it isn’t — but because Kafka landed on a set of design decisions that turn out to be very hard to improve upon for a surprisingly wide range of problems. Understanding what those decisions are and why they work tells us something important about the relationship between theoretical consensus and practical systems.

The Log-Centric Worldview

Kafka’s foundational insight, articulated by Jay Kreps in his 2013 blog post “The Log: What every software engineer should know about real-time data’s unifying abstraction,” is that an append-only log is the most natural primitive for distributed data infrastructure.

This isn’t a new idea — databases have used write-ahead logs since the 1970s, and Lamport’s state machine replication framework is fundamentally about replicating a log of commands. What Kafka did was elevate the log from an implementation detail to the primary abstraction. In Kafka, the log isn’t something that happens behind the scenes to support replication — the log IS the product. Producers append to it. Consumers read from it. That’s the API.

This has several consequences that cascade through the entire system design:

Decoupling of producers and consumers. Because the log is persistent, producers don’t need to know about consumers and consumers don’t need to coordinate with producers. A producer writes a record and moves on. A consumer reads at its own pace. If a consumer falls behind, the log retains the data (up to the retention limit). If a consumer crashes, it resumes from where it left off. This is publish-subscribe done right, and it eliminates an enormous class of coordination problems.

Replay as a first-class operation. Because the log is immutable and indexed by offset, any consumer can re-read from any point. This turns what would be an exceptional operation in a traditional message queue (re-processing old messages) into a routine operation. Need to reprocess yesterday’s events because you deployed a bug? Reset the consumer offset. Need to bootstrap a new service with historical data? Read from the beginning of the topic. This capability is so valuable that teams build entire architectures around it (event sourcing, CQRS, the “streaming ETL” pattern).

Natural ordering guarantee. Within a partition, records have a total order defined by their offset. This is exactly the guarantee that most applications need — events for a given entity (user, account, device) are ordered, while events for different entities can be processed independently. The partition-level ordering maps cleanly to the application-level requirement without over-ordering (total order across all events) or under-ordering (no ordering at all).

Offset-based consumption model. Consumers track their position in the log via an offset (a simple integer). This is dramatically simpler than the acknowledgment models in traditional message queues, where individual message acknowledgment creates the need for complex state tracking, redelivery logic, and dead letter queues. In Kafka, advancing your offset means “I have processed everything up to here.” It’s idempotent, it’s resumable, and it compresses the consumer’s state to a single number per partition.

These properties aren’t unique to Kafka — any append-only log has them. But Kafka was the system that packaged them into an operational, scalable, production-ready product and demonstrated that you could build a company’s entire data infrastructure around them. That packaging matters more than the theory.

What Kafka Actually Does Well

Let’s be specific about where Kafka excels, because “Kafka is good” is too vague to be useful.

Throughput

Kafka achieves high throughput through several design decisions that prioritize sequential I/O over everything else:

  • Append-only writes. All writes are sequential appends. No random I/O for inserts, no B-tree rebalancing, no compaction (unless you use log compaction, which is a separate beast). Sequential disk I/O on modern hardware can sustain 500MB/sec+ on a single disk, and Kafka exploits this fully.

  • Zero-copy transfers. When a consumer reads data, Kafka uses the sendfile system call (zero-copy) to transfer data directly from the page cache to the network socket without copying through user space. This eliminates one of the biggest CPU bottlenecks in data transfer.

  • Batching everywhere. Producers batch records before sending. The broker batches records before writing to disk. Consumers fetch batches of records. Compression is applied at the batch level. Every layer is optimized for amortizing overhead across many records.

  • Page cache reliance. Kafka deliberately uses the OS page cache rather than managing its own buffer pool. This means that recent data (which is what most consumers are reading) is served from RAM without Kafka doing any caching logic. It also means Kafka’s heap usage is low, reducing GC pressure — a non-trivial concern for a long-running JVM process.

The result is that a single Kafka broker can sustain hundreds of thousands of messages per second, and a well-configured cluster can handle millions. These are real numbers, not microbenchmark fantasies.

Consumer Groups

Kafka’s consumer group mechanism is one of its most copied features, and for good reason. A consumer group provides:

  • Automatic partition assignment. Partitions are distributed among consumers in the group. If you have 12 partitions and 3 consumers, each consumer gets 4 partitions. Add a 4th consumer and the partitions rebalance to 3 each.

  • Automatic failover. If a consumer crashes, its partitions are reassigned to surviving consumers. This provides fault tolerance at the consumption layer without the application implementing any failure detection.

  • Parallel consumption with ordering. Each partition is consumed by exactly one consumer in the group, maintaining per-partition ordering while allowing parallel processing across partitions.

The consumer group protocol has gone through several iterations (the original ZooKeeper-based coordination, the broker-based group coordinator, cooperative rebalancing, static membership, server-side assignors), each addressing real operational pain points. The evolution reflects the reality that getting distributed consumption right is genuinely hard, and Kafka has been iterating on it for over a decade.

Retention and Replay

Kafka’s retention model — keep all messages for a configurable time or size limit, regardless of whether they’ve been consumed — was radical when Kafka was introduced. Traditional message queues deleted messages after acknowledgment, treating the queue as a transient buffer. Kafka treats the log as a database of events.

This enables patterns that are impossible or awkward with traditional queues:

  • Multiple independent consumers. Each consumer group has its own offset. Ten different services can independently consume the same topic at their own pace without interfering with each other.

  • Backfilling new consumers. A new service that needs historical data can consume from the beginning of the topic, processing days, weeks, or months of events to build up its state.

  • Operational replay. Deployed a bug that corrupted your downstream database? Reset the consumer offset and reprocess. This has saved more engineering teams than any monitoring system.

The ISR Model’s Quiet Influence

Chapter 16 covered Kafka’s ISR (In-Sync Replica) model in detail. Here, we’ll focus on its influence beyond Kafka.

The ISR model’s key insight is that you don’t need a fixed quorum. Instead, you maintain a set of replicas that are “in sync” with the leader, and you only require acknowledgment from this dynamic set. Replicas that fall behind are removed from the ISR and re-added when they catch up.

This is not consensus in the formal sense. It doesn’t provide the same guarantees as Paxos or Raft. Specifically:

  • If the entire ISR is lost simultaneously (all in-sync replicas crash before their data is flushed), data is lost
  • The ISR can shrink to one (the leader), at which point there’s no replication at all
  • The decision about whether to allow an “unclean” leader election (promoting a replica that was behind) is a configuration choice, not a protocol guarantee

But the ISR model has properties that make it attractive for systems that could have used classical consensus but chose not to:

Dynamic membership without consensus. Adding or removing replicas from the ISR doesn’t require a membership-change protocol. The controller tracks ISR membership and updates it based on replication lag. This is operationally simpler than the joint-consensus or single-server-change approaches used by Raft.

Graceful degradation. When a replica falls behind (due to load, slow disk, network issues), the ISR shrinks and the system continues. There’s no complex failure detection — just “is this replica keeping up?” If the ISR shrinks too much (below min.insync.replicas), writes are rejected, which is a clear and understandable failure mode.

Tunable consistency. With acks=all, you get durability to all ISR members. With acks=1, you get durability only to the leader (lower latency, higher risk). With acks=0, you get fire-and-forget. This per-request tunability is more flexible than most consensus protocols, which provide a single consistency level.

The influence of this model shows up in systems you might not expect. MongoDB’s replica sets use a similar write-concern model (write to primary, wait for replication to secondaries, configurable acknowledgment level). Amazon Aurora uses a quorum-based model but with a storage layer that independently manages replica health. The pattern of “dynamic replica set with tunable durability” has become the default approach for systems that need replication but don’t want the full weight of formal consensus.

Kafka’s Approach to Exactly-Once Semantics

For years, Kafka’s answer to “does Kafka support exactly-once?” was “no, but at-least-once is good enough for most use cases.” This was honest but frustrating.

Then, in 2017, KIP-98 introduced exactly-once semantics (EOS) for Kafka. The implementation required:

  1. Idempotent producers. Each producer is assigned a producer ID, and each message within a producer session is assigned a sequence number. The broker deduplicates based on (producer ID, partition, sequence number). This eliminates duplicates from producer retries.

  2. Transactional writes. For writes spanning multiple partitions (e.g., a stream processing job that reads from partition A and writes to partition B), Kafka implements a two-phase commit protocol using a transaction coordinator. The producer begins a transaction, writes to multiple partitions, and either commits or aborts atomically.

  3. Consumer-side offset management. Consumer offsets are committed as part of the transaction, ensuring that “read input, process, write output, advance offset” is atomic.

This took years to implement, stabilize, and optimize. The initial EOS release had performance overhead that made many users disable it. Subsequent releases (particularly the Kafka Streams “exactly-once v2” in KIP-447) reduced the overhead to the point where EOS is practical for most workloads.

The lesson here is instructive: exactly-once semantics are not a property of the consensus/replication protocol — they’re a property of the entire system, including producers, consumers, and the transaction coordinator. Anyone who tells you their consensus protocol provides exactly-once semantics is either confused about what they mean or has quietly built a transaction coordinator on top of their consensus protocol.

The KRaft Migration: Kafka Confronts Consensus

For most of its history, Kafka outsourced its own consensus needs to ZooKeeper. The broker cluster used ZooKeeper for:

  • Controller election (which broker is the controller)
  • Broker registration (which brokers are alive)
  • Topic and partition metadata
  • ISR tracking
  • ACL storage

This worked but created a dependency that was operationally painful. Running Kafka meant running two distributed systems — Kafka itself and a ZooKeeper cluster — each with its own failure modes, monitoring requirements, and upgrade procedures. ZooKeeper became the weak link not because it’s a bad system (it’s not), but because most Kafka operators weren’t ZooKeeper experts, and a struggling ZooKeeper cluster manifests as mysterious Kafka problems.

KRaft (Kafka Raft) is Kafka’s project to replace ZooKeeper with a built-in Raft-based metadata quorum. The key details:

  • A set of broker nodes (or dedicated controller nodes) form a Raft quorum for metadata management
  • The metadata is stored as an event log (naturally), replicating Kafka’s own log-centric philosophy
  • The Raft implementation is specifically optimized for Kafka’s needs (batched writes, snapshot-based state transfer)

The KRaft migration tells us several things about Kafka’s relationship with consensus:

Kafka always needed consensus — it just outsourced it. The ISR model handles data replication, but Kafka still needs consensus for metadata: who’s the controller, what’s the partition assignment, what’s the current ISR. This metadata must be consistent across all brokers. ZooKeeper provided that consensus. KRaft provides it in-process.

Raft was the obvious choice. When the Kafka team needed to implement consensus, they chose Raft — not because it’s the theoretically optimal protocol, but because it’s the most well-understood and implementable one. Even a team as experienced as the Kafka committers, with deep knowledge of consensus theory, chose the pragmatic option.

The migration is painful. Replacing a running system’s consensus layer while maintaining availability is, to use a technical term, extremely hard. The KRaft migration has been in progress since 2020, with ZooKeeper finally deprecated in 3.5 (2023) and full ZooKeeper removal targeted for 4.0. A multi-year migration for a foundational component is not unusual — it’s expected. This should calibrate your expectations for “we’ll just swap out the consensus layer later.”

A Comparison of Replication Models

To understand why Kafka’s ISR model has been so influential, it helps to see it side by side with classical consensus approaches as applied to log replication.

PropertyRaft-style ReplicationKafka ISRFlexible Paxos
Quorum definitionFixed majority (n/2 + 1)Dynamic (all in-sync replicas)Configurable (Phase 1 and Phase 2 quorums)
Quorum membershipStatic (requires reconfiguration to change)Dynamic (ISR shrinks/grows automatically)Static (requires reconfiguration)
Minimum write quorumMajority of clustermin.insync.replicas (configurable)Phase 2 quorum size (configurable)
Failure detectionHeartbeat timeout → leader electionReplication lag → ISR removalHeartbeat timeout → Phase 1
Recovery after failureFollower replays from leader’s logBroker catches up and rejoins ISRFollower replays from leader’s log
Availability during failuresAvailable if majority aliveAvailable if ISR >= min.insync.replicasDepends on quorum configuration
Consistency guaranteeLinearizableLinearizable (with acks=all and min.insync.replicas >= 2)Linearizable
Cost of adding a replicaReconfiguration protocolJust start replicating (auto-joins ISR when caught up)Reconfiguration protocol
Operational overheadModerate (fixed membership, explicit reconfiguration)Low (ISR is self-managing)High (must understand quorum intersection requirements)

The table reveals why ISR is attractive for operators: it’s self-managing. A slow replica drops out of the ISR automatically. A recovered replica rejoins automatically. You don’t need to run a reconfiguration procedure — the system adapts. For a system like Kafka, where brokers routinely experience load spikes, garbage collection pauses, and rolling restarts, this adaptability is operationally invaluable.

The formal consensus community would note that ISR’s “automatic” behavior is precisely what makes it less safe — the ISR can shrink to one node without any operator approval, and if that one node fails, data is lost. This is true, and it’s why min.insync.replicas exists. But the fact that ISR requires one configuration parameter to be safe, while Raft requires correct implementation of a reconfiguration protocol to be flexible, tells you something about where each approach puts its complexity budget.

Systems That Successfully Copied Kafka

Several systems have taken Kafka’s design principles and adapted them, some more successfully than others.

Apache Pulsar

Pulsar separated the compute layer (brokers) from the storage layer (BookKeeper). This is arguably an improvement on Kafka’s architecture, where brokers are both compute and storage nodes. Pulsar’s approach allows independent scaling of serving capacity and storage capacity.

Pulsar uses BookKeeper’s quorum-write protocol for replication, which is similar to the ISR model but with explicit write quorums and ack quorums (similar in spirit to Flexible Paxos). A message is written to W replicas and considered committed when A replicas acknowledge (where A <= W).

Where Pulsar differs from Kafka:

FeatureKafkaPulsar
Storage architectureBroker = storageSeparated (broker + BookKeeper)
ReplicationISR (dynamic)Quorum writes (configured W and A)
Multi-tenancyLimited (quotas)First-class (namespaces, quotas, isolation)
Geo-replicationMirrorMaker (async)Built-in (async)
Consumer modelPull-basedPull and push
Message acknowledgmentOffset-basedIndividual or cumulative

Pulsar demonstrates that you can take the log-centric model and make different implementation choices while preserving the core benefits. The question most teams face isn’t “is Pulsar better than Kafka?” but “is Pulsar enough better to justify the smaller community and ecosystem?”

Redpanda

Redpanda took a different approach: implement Kafka’s protocol exactly (wire-compatible) but with a C++ implementation using the Seastar framework, eliminating the JVM and its garbage collection overhead.

Redpanda uses Raft for both data replication and metadata management (no ZooKeeper dependency from day one — anticipating Kafka’s own direction). Each partition is backed by a Raft group, which provides stronger guarantees than Kafka’s ISR model but at the cost of the flexibility that ISR provides.

The Redpanda story validates Kafka’s API and operational model while suggesting that the ISR model itself might not be the only viable replication approach. You can use Raft for per-partition replication and get Kafka-compatible behavior with stronger consistency guarantees. The tradeoff is that Raft’s fixed quorum is less flexible than ISR’s dynamic membership — a slow replica in Raft still participates in the quorum, while Kafka simply removes it from the ISR.

Amazon Kinesis

Kinesis is Amazon’s managed streaming service, clearly inspired by Kafka’s model (topics are “streams,” partitions are “shards,” consumers use checkpoints analogous to offsets). The replication and consensus details are hidden behind the managed service boundary, which is both the advantage (you don’t have to care) and the limitation (you can’t tune it).

Kinesis validated that the log-centric model works as a managed service and that most users don’t need or want to think about the replication protocol. They want the abstraction: append records, read records, partition for parallelism, retain for replay.

Systems That Probably Shouldn’t Have Copied Kafka

Not every team that says “we’re building something like Kafka” should be building something like Kafka.

The “Lightweight Kafka” Trap

The conversation usually starts like this:

“Kafka is too heavy for our use case. We just need a simple message queue with persistence and ordering. Let’s build something lightweight.”

Two years later, the team has built:

  • A log-structured storage engine (because they need persistence)
  • A replication protocol (because they need fault tolerance)
  • A consumer group coordinator (because they need parallel consumption)
  • A partition assignment algorithm (because they need scalability)
  • A compaction mechanism (because the disk isn’t infinite)
  • An offset management system (because consumers need to track their position)
  • A metrics and monitoring layer (because they need to operate it)
  • A client library (because applications need to talk to it)
  • A wire protocol (because the client library needs a wire protocol)

They’ve built Kafka. Except it’s less tested, less documented, less understood, and maintained by a team of three instead of a community of thousands.

The “lightweight Kafka” trap springs from a misunderstanding of where Kafka’s complexity comes from. It’s not the JVM (though that adds operational overhead). It’s not ZooKeeper (though that also adds overhead). It’s the fundamental problem space. Any system that provides durable, ordered, fault-tolerant, scalable message delivery will converge on a similar set of mechanisms, and those mechanisms are inherently complex.

Internal Event Buses

Many organizations build internal “event bus” or “event backbone” systems that are spiritually Kafka-like but use a different replication strategy. Sometimes it’s a custom protocol over Redis Streams. Sometimes it’s a PostgreSQL-backed queue with logical replication. Sometimes it’s a hand-rolled TCP server with file-based persistence.

These systems work fine at small scale and become increasingly painful as scale grows. The usual failure mode is that the team discovers, one production incident at a time, all the problems that Kafka has already solved:

  1. What happens when a consumer is slow? (Backpressure and consumer lag monitoring)
  2. What happens when the disk fills up? (Retention policies and segment deletion)
  3. What happens when you need to add a partition? (Partition reassignment)
  4. What happens when the leader dies? (Leader election and ISR management)
  5. What happens when a message is produced but the ack is lost? (Idempotent producers)

Each of these problems takes weeks to months to solve correctly, and the solutions look increasingly like Kafka. By the time you’ve solved them all, you’ve built Kafka with more bugs and fewer features.

When “Not Kafka” Is Actually Right

To be fair, there are legitimate cases where something other than Kafka is appropriate:

Embedded systems / edge devices. Kafka is a server-side system. If you need message ordering on a device with 256MB of RAM, you need something different (an embedded Raft log, a local SQLite-based queue, etc.).

Very small scale with simple requirements. If you have one producer, one consumer, and need a persistent queue, PostgreSQL with SKIP LOCKED is simpler and sufficient. Not everything needs to be a distributed system.

Extreme low latency (< 1ms). Kafka’s batching-oriented design trades latency for throughput. If you need sub-millisecond end-to-end latency, you need a system designed for that (Aeron, custom shared-memory queues, etc.). These systems typically sacrifice durability and fault tolerance for speed.

Total ordering across all events. If you genuinely need a single totally-ordered log (not per-partition ordering), Kafka’s partitioned model doesn’t help. You need a single-partition topic (which is a single-leader bottleneck) or a different system entirely.

The Lessons

Kafka’s outsized influence on distributed systems design carries several lessons that extend beyond Kafka itself.

Lesson 1: “Good Enough” Consensus Often Beats “Correct” Consensus

Kafka’s ISR model is not consensus in the formal, provably-correct sense. It doesn’t provide the same guarantees as Raft or Paxos. It has corner cases (unclean leader election, ISR shrinking to one) where data loss is possible.

And yet, Kafka handles more data reliably than probably any other distributed system in existence. How?

Because the remaining failure modes — correlated failures that take out the entire ISR, unclean leader elections that are now disabled by default — are rare enough that the practical reliability is extremely high. And the system provides operational controls (monitoring ISR size, alerting on under-replicated partitions, configuring min.insync.replicas) that let operators manage the residual risk.

This is a different philosophy from formal consensus. Formal consensus says “prove that no execution can violate safety.” Kafka’s approach says “make the unsafe executions rare enough that operational monitoring can catch them.” It’s not as theoretically satisfying, but for the vast majority of use cases, the difference is immaterial.

Lesson 2: The Interface Matters More Than the Protocol

Kafka’s success isn’t primarily about the ISR protocol — it’s about the log abstraction, the consumer group model, the offset-based consumption, the retention and replay capabilities. These are interface-level decisions, not protocol-level decisions. You could implement the Kafka API on top of Raft (Redpanda does) or on top of a different quorum protocol (Pulsar/BookKeeper) and get a system that’s functionally equivalent from the application’s perspective.

This suggests that most of the energy spent debating consensus protocols would be better spent designing the right abstraction for the application. The protocol is an implementation detail. The abstraction is the product.

Lesson 3: Operational Simplicity Wins

ZooKeeper is a well-designed, formally verified, battle-tested consensus system. Kafka used it for a decade and then spent years replacing it. Why? Because operational simplicity matters more than protocol elegance.

Running two distributed systems (Kafka + ZooKeeper) is harder than running one (Kafka with KRaft). The operational complexity isn’t additive — it’s multiplicative, because failure modes can combine in unexpected ways. A ZooKeeper session timeout causes a Kafka controller election, which triggers partition reassignment, which causes consumer rebalancing, which causes a processing delay, which causes backpressure, which causes producer timeouts. Each component is well-designed in isolation; the emergent behavior is chaos.

KRaft doesn’t improve Kafka’s theoretical properties. It improves Kafka’s operational properties. And in production systems, operational properties dominate.

Lesson 4: Engineering Culture Matters

Why does engineering culture gravitate toward Kafka’s pragmatic approach over academic elegance?

Because most engineers are evaluated on shipping features, not on proof correctness. A system that works 99.999% of the time, is operationally understandable, has good monitoring, and can be debugged by on-call engineers at 3 AM is more valuable than a system that works 100% of the time (in theory) but requires a PhD to understand when something goes wrong.

This isn’t anti-intellectual — it’s a recognition that production systems are sociotechnical artifacts. They’re operated by humans, and the human factors (understandability, debuggability, operational familiarity) are as important as the algorithmic factors (message complexity, latency bounds, safety proofs).

Kafka’s documentation explains what happens when a broker fails. Paxos’s documentation explains what happens in an abstract model of asynchronous message passing. Both are valuable, but only one of them helps you at 3 AM.

Lesson 5: Sometimes the Industry Is Right

The academic consensus community has produced protocols that are theoretically superior to what most production systems use. EPaxos has better throughput than Raft for multi-leader workloads. Flexible Paxos provides stronger configurability. Various BFT protocols offer safety guarantees that CFT protocols can’t match.

And yet, the industry gravitates toward Kafka-style ISR, Raft, and “just use a single leader.” Is the industry wrong?

Sometimes, yes. There are genuinely cases where a team uses Raft when EPaxos would be better, or uses eventual consistency when they actually need strong consistency. But more often, the industry is right — not because the simpler protocols are always theoretically superior, but because the total cost of ownership (implementation effort + operational burden + debugging difficulty + hiring difficulty) is lower.

The team that builds on Raft can hire engineers who’ve used Raft. The team that builds on EPaxos can hire… the small number of engineers who’ve read the EPaxos paper and the even smaller number who’ve implemented it. For most organizations, the pool of available expertise is a more binding constraint than the theoretical properties of the protocol.

The Anatomy of “We’re Building Something Like Kafka”

Having discussed why Kafka is influential, let’s dissect what actually happens when a team utters the fateful words “we’re building something like Kafka.”

Phase 1: The Prototype (Week 1-4)

The team builds a single-node append-only log with a simple TCP protocol. Producers connect and append messages. Consumers connect and read by offset. It works beautifully. Performance is excellent (sequential writes to a single node are fast). The team is pleased. “See? Kafka is over-engineered.”

Phase 2: Replication (Month 2-4)

Someone asks what happens when the node dies. The team adds replication. “It’s just sending the same data to another node.” They implement leader-follower replication. It mostly works, except:

  • What happens when the leader dies? They need leader election. They reach for a consensus protocol or an external coordinator.
  • What about messages that the leader accepted but didn’t replicate? They discover the meaning of “committed” vs “accepted” and why the distinction matters.
  • How does a new follower catch up? They implement state transfer. It’s harder than expected, especially while the leader is still accepting writes.

Phase 3: Consumers (Month 4-8)

Adding multiple consumers reveals new problems:

  • How do consumers coordinate who reads which partition? They need a group coordinator.
  • What happens when a consumer crashes mid-processing? They need to distinguish between “read” and “processed” and implement offset commits.
  • What about rebalancing when consumers join or leave? They implement a rebalance protocol. The first version has a stop-the-world pause. The second version has bugs during concurrent rebalances. The third version works but is complex.

Phase 4: Operations (Month 8-12)

The system is in production. New requirements emerge:

  • Retention policies (the disk isn’t infinite)
  • Monitoring (how to detect under-replicated partitions, consumer lag, leader skew)
  • Partition reassignment (a broker is being decommissioned)
  • Rolling upgrades (deploying a new version without downtime)
  • Performance tuning (batch sizes, buffer sizes, compression)

Each of these is a week to a month of work, with production incidents along the way.

Phase 5: Acceptance (Month 12+)

The team realizes they have built 80% of Kafka with 20% of the features, 10% of the testing, and 5% of the documentation. The “lightweight” system is no longer lightweight. The maintenance burden is significant. New team members struggle to understand the custom replication protocol.

Someone suggests migrating to actual Kafka. The migration takes six months.

This cycle has played out at enough companies to be a pattern. It’s not that the team is incompetent — it’s that the problem space is inherently complex, and Kafka’s complexity reflects the problem, not engineering excess.

The Influence on System Design Patterns

Kafka’s influence extends beyond systems that compete with or copy Kafka. It has shaped how the industry thinks about several design patterns.

Event Sourcing and CQRS

The event sourcing pattern (storing state changes as an immutable log of events, rather than storing current state directly) was known before Kafka, but Kafka made it practical. Before Kafka, implementing event sourcing required building your own durable event store. After Kafka, the event store was a commodity.

CQRS (Command Query Responsibility Segregation) — separating read models from write models — becomes natural when your writes go to a Kafka topic and your read models are materialized by consuming from that topic. The topic is the source of truth, and read models are derived views that can be rebuilt by replaying.

This pattern has become so common that it’s sometimes applied where it isn’t needed (not every CRUD application benefits from event sourcing), but where it fits, it’s powerful. And it only became widespread because Kafka provided the infrastructure to support it.

Change Data Capture

CDC (Change Data Capture) — streaming database changes to downstream consumers — was an old idea implemented with triggers, polling, or database-specific log readers. Kafka Connect and Debezium standardized CDC as “read the database’s WAL, publish to Kafka, consume from Kafka.” This pattern is now the default approach for replicating data between systems.

The consensus implication is interesting: the database’s own WAL (which is built on some form of consensus or durable write protocol) is re-published through Kafka’s ISR protocol. The data passes through two replication systems, each with its own durability guarantees. The end-to-end guarantee is the weaker of the two, which is worth understanding but rarely causes problems in practice.

The “Kafka as a Database” Debate

An ongoing debate in the Kafka community is whether Kafka itself can serve as a database. Proponents point to Kafka Streams’ state stores (backed by RocksDB, with changelog topics for replication), ksqlDB (SQL queries over Kafka topics), and log compaction (which retains the latest value per key, making a topic function like a key-value store).

Opponents point out that Kafka lacks transactions across topics (partially addressed by exactly-once semantics), efficient point lookups (Kafka is optimized for sequential reads, not random access), and the query capabilities of a real database.

The truth is somewhere in between. Kafka can serve as the system of record for event-stream data, and Kafka Streams/ksqlDB can provide materialized views over that data. But using Kafka as a general-purpose database (replacing PostgreSQL or MySQL) is a category error. The log abstraction is powerful but not universal.

What Kafka Gets Wrong

Fairness demands that we also discuss where Kafka falls short.

Tail latency. Kafka’s batching-optimized design means tail latency can be unpredictable. A slow consumer or a compaction storm on the broker can cause latency spikes that are difficult to diagnose.

Small message overhead. If your messages are 100 bytes each, Kafka’s per-message overhead (metadata, headers, CRC) is a significant fraction of the total. Kafka is optimized for messages of 1KB-1MB; below that range, the overhead becomes noticeable.

Partition count scaling. Each partition has a leader, a set of replicas, and associated file handles, memory buffers, and controller state. Kafka clusters with hundreds of thousands of partitions experience controller bottlenecks, slow leader elections, and elevated memory usage. Improvements are ongoing, but this remains a practical limitation.

No server-side filtering. Kafka delivers all records in a partition to the consumer; the consumer must filter client-side. If you only care about 1% of records in a partition, you’re reading and discarding 99%. Some competing systems (Pulsar with server-side filtering, various cloud offerings with subscription filters) do better here.

Ordering across partitions. If you need events for different keys to be ordered relative to each other, Kafka can’t help unless they’re in the same partition. There’s no cross-partition ordering primitive.

These aren’t fundamental flaws — they’re design tradeoffs that are correct for Kafka’s target use case (high-throughput event streaming) and incorrect for other use cases (low-latency request-response, small-message-high-volume telemetry, complex event processing).

The Verdict

Everyone copies what Kafka does because what Kafka does is, for a broad class of problems, the right thing to do. The log abstraction is powerful. The consumer group model is practical. The ISR replication model is simple enough to understand and reliable enough to trust. The operational model is well-documented and well-tooled.

If your problem looks like “move data reliably between services with ordering and persistence,” the Kafka model is your starting point. Whether you use Kafka itself, a Kafka-compatible alternative (Redpanda), a Kafka-inspired alternative (Pulsar), or a managed service (Kinesis, Event Hubs), the architectural pattern is the same.

The mistake is thinking that because Kafka’s pattern works for data streaming, it must also work for coordination (use a consensus protocol), transactional processing (use a database), or low-latency RPC (use a proper RPC framework). Kafka’s influence is broad, but it’s not universal. Knowing where the pattern applies and where it doesn’t is the difference between copying Kafka wisely and copying Kafka reflexively.

A Field Guide to Kafka-Influenced Design Decisions

To close this chapter, here’s a quick reference for the design decisions that Kafka popularized and their applicability beyond Kafka:

Design DecisionKafka’s ApproachWhen to Copy ItWhen Not To
Append-only log as primary abstractionAll writes are appends; no in-place updatesEvent streaming, audit logs, CDC, event sourcingOLTP databases, in-place update workloads
Per-partition ordering (not total ordering)Total order within partition; no cross-partition orderingWhen entities can be partitioned independentlyWhen you need global ordering (financial ledgers, serializable transactions)
Consumer-managed offsetsConsumers track their position; server doesn’t track per-consumer stateHigh fan-out (many consumers per topic)Low fan-out with complex acknowledgment needs
Retention-based lifecycleMessages retained by time/size, not by consumptionWhen replay and backfill are importantWhen storage cost is critical and messages are consumed once
Dynamic replica set (ISR)Replicas self-manage based on replication lagWhen operational simplicity matters more than formal guaranteesWhen you need provable consensus properties
Batching at every layerProducers, brokers, and consumers all batchHigh-throughput workloadsUltra-low-latency workloads (< 1ms)
Leader-per-partitionDifferent partitions can have different leadersWhen aggregate throughput matters more than per-key throughputWhen you need a single authority for all operations

These aren’t Kafka-specific inventions — most have precedents in databases, message queues, or academic systems. But Kafka assembled them into a coherent package and proved they work at scale. That packaging is Kafka’s real contribution, and it’s the reason everyone keeps copying it.

The Future of Consensus

Predicting the future of any technology field is a reliable way to look foolish in hindsight. Predicting the future of consensus algorithms, where the foundational impossibility result (FLP) is over forty years old and the most widely deployed protocol (Paxos, in various disguises) is over thirty, requires a particular kind of hubris.

Let’s proceed anyway.

The trends we’ll discuss in this chapter aren’t speculative — they’re research directions with working prototypes, some with production deployments. What’s uncertain isn’t whether these ideas work, but whether they’ll achieve the adoption necessary to displace the current generation of consensus protocols. History suggests that most won’t. But the few that do will reshape how we think about agreement in distributed systems.

Disaggregated Consensus: Separating Ordering from Execution

The traditional model of consensus-based state machine replication bundles two concerns: ordering (deciding the sequence of commands) and execution (applying those commands to the state machine). Every replica orders the commands and executes them. This is clean conceptually but wasteful practically — why should every replica burn CPU executing the same computation when only the ordering needs agreement?

Disaggregated consensus separates these concerns. A small consensus group handles ordering (producing a totally ordered log of commands), and a potentially larger set of execution nodes consumes this log and applies the commands. The ordering group doesn’t need to know what the commands mean. The execution nodes don’t need to participate in consensus.

This pattern has appeared in several forms:

Shared log architectures. Systems like Corfu (from Microsoft Research), Delos (from Facebook/Meta), and Virtual Consensus use a shared log as the foundation. The log provides total order. Applications built on top of the log get this ordering “for free” by reading from the log. Different applications can share the same log (amortizing the consensus cost) or use separate logs (isolating failure domains).

Separation in databases. Amazon Aurora separates compute from storage, with the storage layer handling replication. The compute layer doesn’t run a consensus protocol — it writes to the storage layer, which handles durability and replication using a quorum-based protocol. This means Aurora can scale compute independently of storage, and adding a read replica doesn’t affect the consensus protocol.

The Delos approach. Meta’s Delos system takes this further by making the shared log implementation pluggable. The VirtualLog abstraction presents a log API to applications, but the underlying log implementation can be swapped (from a ZooKeeper-based log to a custom NativeLoglet, for example) without changing the application. This is disaggregation not just of ordering and execution, but of the consensus protocol itself.

The appeal of disaggregation is operational: the consensus group can be small (3 or 5 nodes), simple (just ordering bytes, no application logic), and generic (shared across many applications). The execution layer can be scaled independently, can tolerate execution failures without affecting consensus, and can even use different execution engines for different use cases.

The challenge is latency. Adding an indirection layer between clients and the consensus group adds at least one additional network hop. For latency-sensitive applications, this overhead may not be acceptable. But for applications where throughput matters more than single-operation latency, disaggregation is a clear win.

This trend is likely to accelerate. As more systems adopt microservices architectures and as serverless computing grows, the idea of a shared, managed ordering service becomes increasingly natural. Why should each microservice run its own Raft group when they could share an ordering layer?

Hardware-Assisted Consensus

The most exciting (and most uncertain) trend in consensus research is the use of specialized hardware to accelerate consensus protocols. The basic idea: if the bottleneck in consensus is network round-trips and message processing, what if the network infrastructure itself participated in the protocol?

RDMA-Based Consensus

Remote Direct Memory Access (RDMA) allows one machine to read from or write to another machine’s memory without involving the remote CPU. This eliminates the kernel networking stack, context switches, and much of the software overhead of traditional RPC.

RDMA-based consensus protocols (like DARE and Hermes) exploit this by having the leader write directly to followers’ memory. The consensus round-trip becomes:

  1. Leader writes to followers’ memory via RDMA (one-sided write, no follower CPU involvement)
  2. Leader polls followers’ memory to check for acknowledgment
  3. Done

The latency for a single RDMA round-trip is typically 1-3 microseconds, compared to 50-200 microseconds for a traditional TCP round-trip. This means consensus commits in single-digit microseconds — two orders of magnitude faster than software-based consensus.

The limitations are significant:

  • RDMA requires specialized network hardware (InfiniBand or RoCE-capable NICs). This hardware is common in HPC and cloud data centers but not ubiquitous.
  • RDMA-based protocols are limited to a single network domain (typically a single data center). You can’t do RDMA over the public internet.
  • The failure model is different. RDMA one-sided writes can succeed even if the remote node has crashed (the NIC performs the write independently of the CPU). Detecting failure requires additional mechanisms.
  • The programming model is complex. RDMA requires careful memory management, and bugs in RDMA code can corrupt remote memory silently.

Despite these limitations, RDMA-based consensus is being adopted in production systems where latency matters more than generality. Microsoft’s FaRM (Fast Remote Memory) uses RDMA-based replication for an in-memory key-value store, achieving millions of operations per second with microsecond-level latency.

SmartNIC and Programmable Switch Consensus

Even more exotic: running parts of the consensus protocol on network devices themselves.

NetPaxos (from Dang et al.) implements Paxos’s acceptor logic on a programmable network switch (using P4, a language for programming network data planes). The idea is that the switch, which already sees every packet, can act as the acceptor — stamping each proposal with an acceptance as the packet passes through. This eliminates the round-trip to separate acceptor nodes entirely.

Speculative Paxos (from Ports et al.) uses the network to provide an ordering guarantee. If the network fabric can deliver messages to all replicas in the same order (which some data center network topologies can approximate), then replicas can speculatively execute commands in that order, and the consensus protocol only needs to handle the rare cases where the network order isn’t consistent.

P4xos and related systems push consensus logic into the switch ASIC, achieving consensus in nanoseconds rather than microseconds. The tradeoff: switch memory and computation are extremely limited, so only the critical path of the consensus protocol runs on the switch, with everything else (recovery, membership changes, state transfer) handled in software.

These approaches share a common theme: exploiting the data center network as a computational resource rather than a dumb pipe. The results are impressive — sub-microsecond consensus in the best case. But the limitations are equally significant:

ApproachLatency ImprovementLimitations
RDMA one-sided writes~100x (microseconds → microseconds)Special NICs, single DC, complex failure handling
SmartNIC offload~10-100xNIC-specific code, limited processing power
Programmable switch~1000x (nanoseconds)P4-capable switches, very limited state, vendor-specific
Network-ordered consensus~10x (eliminates ordering round-trip)Requires network topology guarantees, fragile

The question isn’t whether these approaches work — they do, in controlled environments. The question is whether the hardware assumptions will become common enough to make them practical for general use. The trend in data center networking (toward programmable switches, SmartNICs, and RDMA-capable fabrics) suggests yes, but the timeline is uncertain.

FPGAs and Custom ASICs

Some researchers have gone further, implementing consensus logic directly in FPGAs. Consensus Box and similar projects demonstrate that the hot path of Paxos can run entirely in hardware, with the FPGA accepting proposals, checking quorums, and producing decisions without software involvement.

This is the extreme end of the hardware spectrum: maximum performance, minimum flexibility. An FPGA-based consensus accelerator can commit in nanoseconds, but changing the protocol requires reprogramming the hardware. For use cases where the consensus protocol is fixed and performance is paramount (high-frequency trading, real-time control systems), this makes sense. For general-purpose distributed systems, it’s overkill.

The Move Toward Shared Log Architectures

The shared log pattern deserves deeper examination because it represents a genuine architectural shift, not just an optimization.

In a shared log architecture, consensus is a service, not a library. Applications don’t embed a consensus protocol — they connect to a log service and append entries. The log service handles ordering, replication, and durability. Applications handle everything else.

This separation has several architectural benefits:

Amortized consensus overhead. If ten applications share one log, the consensus cost (leader election, quorum management, replication) is paid once, not ten times. Each application adds incremental cost for its log entries, but the fixed overhead is shared.

Simplified application development. Application developers don’t need to understand consensus. They need to understand “append to log” and “read from log.” This is a much lower bar and eliminates an enormous class of bugs (incorrect consensus implementations, failure to handle leader changes, etc.).

Flexible consistency. Different applications can read the same log at different points, providing different consistency levels. An application that reads at the tail of the log sees the latest data (strong consistency). An application that reads at a lag sees slightly stale data (eventual consistency). The log itself provides the consistency, and applications choose where to read.

Time travel and debugging. Because the log is the source of truth, you can replay from any point to reconstruct state. This is invaluable for debugging (“what did the system look like at 3:47 AM when the incident started?”) and for building new derived views (materialize a new index by replaying the log).

The shared log architecture is essentially the architecture of databases (WAL-based recovery) elevated to the system level. Instead of each database having its own WAL, the entire system shares a distributed WAL. It’s an old idea, but one whose time may have come as the tooling for building on top of logs (Kafka Streams, Apache Flink, materialized views) has matured.

Meta’s Delos, Microsoft’s Tango/Corfu, and various “log-structured everything” proposals all point in this direction. The pattern is also visible in the way CockroachDB and TiDB use Raft groups as per-range logs, with the state machine (SQL engine) consuming from these logs.

Leaderless and Multi-Leader Consensus: The Next Generation

The leader bottleneck — a single node that sequences all operations — remains the most important practical limitation of deployed consensus protocols. Several research directions aim to eliminate or mitigate it.

Beyond EPaxos

EPaxos demonstrated that leaderless consensus is possible, but its complexity has limited adoption. Newer protocols attempt to capture EPaxos’s benefits with simpler designs.

Atlas (Enes et al., 2020) simplifies EPaxos by restricting the dependency tracking to a cleaner model. While EPaxos tracks arbitrary dependencies between commands, Atlas uses a more structured approach that reduces the complexity of the execution ordering algorithm — arguably the hardest part of EPaxos to get right.

Tempo (Enes et al., 2021) goes further, providing leaderless consensus with clock-based ordering. By using loosely synchronized clocks (not for correctness, but for performance — clock skew degrades performance, not safety), Tempo can order most commands without explicit dependency tracking.

These protocols represent a trend toward making leaderless consensus practical rather than merely possible. The question is whether any of them will achieve the implementation maturity and ecosystem support necessary to challenge Raft’s dominance. As of this writing, none has.

Multi-Leader for Geo-Distribution

For geo-distributed systems where a single leader creates unacceptable latency, multi-leader approaches are gaining traction:

WPaxos (Ailijiang et al.) extends Flexible Paxos with object-level leadership. Different objects can have leaders in different regions. When a region’s client accesses an object whose leader is remote, the leader can be migrated to the accessing region. This provides locality for frequently accessed objects while maintaining consensus.

Mencius (Mao et al.) partitions the consensus log among multiple leaders. Leader 1 owns slots 1, 4, 7, …; leader 2 owns slots 2, 5, 8, …; etc. Each leader can independently propose for its slots, providing multi-leader throughput. The challenge is handling “holes” when a leader has nothing to propose for its slot (resolved with no-op messages).

These approaches represent a middle ground between single-leader consensus (simple, high latency for remote clients) and full leaderlessness (low latency, complex). Whether the middle ground is the right tradeoff depends on the workload, which brings us back to the unsatisfying but accurate answer: it depends.

Consensus at the Edge

The edge computing trend — pushing computation closer to users, onto devices at the network edge — presents new challenges for consensus.

Resource constraints. Edge devices may have limited CPU, memory, and storage. Running a full Raft implementation on a device with 64MB of RAM requires careful engineering. Running PBFT is likely impractical.

Intermittent connectivity. Edge devices may lose connectivity to the cloud and to each other. Consensus protocols that require a quorum of always-connected nodes don’t work when connectivity is intermittent.

Heterogeneous nodes. Edge deployments may include devices with very different capabilities — a powerful edge server in a base station alongside resource-constrained IoT sensors. Symmetric consensus protocols (where all nodes play the same role) don’t fit well.

Geo-distribution. Edge nodes are, by definition, distributed across geographic locations. The latency between an edge node in Seattle and an edge node in Miami makes classical consensus impractical for coordination between them.

These constraints push toward a few directions:

Hierarchical consensus. Local consensus among co-located edge nodes (fast, small quorum), with asynchronous replication to a central cloud (for durability and global coordination). This is not new — it’s the multi-master replication model that databases have used for decades — but the edge context adds constraints around resource limits and intermittent connectivity.

CRDTs and eventual consistency. For many edge use cases (counters, sensor aggregation, last-writer-wins registers), eventual consistency with CRDTs is sufficient and avoids the need for consensus entirely. The CRDT model is a natural fit for edge/IoT: each device can operate independently and merge state when connectivity is available.

Lightweight consensus protocols. Research into consensus protocols specifically designed for resource-constrained environments (fewer messages, smaller state, lower CPU requirements) is ongoing. These aren’t fundamentally new algorithms — they’re typically optimized variants of Raft or Paxos — but the optimization target is different (minimize resource usage rather than maximize throughput).

The edge will probably not produce a fundamentally new consensus algorithm. Instead, it will produce new architectures that combine existing consensus protocols (for the parts that need strong consistency) with eventual consistency mechanisms (for the parts that don’t) in configurations that reflect the edge’s unique constraints.

Machine Learning and Consensus

The intersection of machine learning and consensus is nascent but intriguing.

Learned Protocol Configuration

Consensus protocols have many tunable parameters: timeout values, batch sizes, quorum configurations, leader placement. Traditionally, these are set by humans based on experience and benchmarking. ML-based approaches can potentially optimize these parameters automatically.

Timeout tuning. Raft’s election timeout must be long enough to avoid unnecessary elections (which cause disruption) but short enough to detect actual failures promptly. The optimal timeout depends on network conditions, which change over time. An ML model that observes network latency patterns and adjusts timeouts accordingly could improve both stability and failover speed.

Batch size optimization. The optimal batch size for consensus depends on message sizes, arrival rates, and latency targets. Too small and you waste per-batch overhead. Too large and you increase latency while waiting for the batch to fill. An RL (reinforcement learning) agent could learn the optimal batching strategy online.

Leader placement. In a geo-distributed consensus group, placing the leader near the majority of clients minimizes commit latency. As workload patterns shift (different regions are active at different times of day), the optimal leader placement changes. An ML model could predict workload patterns and trigger proactive leader migration.

Adaptive Protocol Selection

More speculatively: could a system automatically choose between different consensus protocols (or protocol configurations) based on the current workload?

Consider a system that uses EPaxos’s fast path when the workload has low contention (most commands don’t conflict) and falls back to a Raft-like leader-based approach when contention is high (too many commands conflict for EPaxos’s fast path to be effective). The switch point depends on the workload’s conflict rate, which an ML model could estimate in real-time.

This is technically feasible — the challenge is ensuring correctness during transitions. Switching between protocols mid-stream requires careful state synchronization, and any bug in the transition logic could violate safety. The engineering cost of getting this right probably outweighs the performance benefit for most systems today, but as consensus-as-a-service offerings mature, the cost-benefit calculation may shift.

Predictive Failure Detection

A more immediately practical application of ML is improving failure detection. Current consensus protocols use fixed timeouts to detect node failures — if a heartbeat isn’t received within T milliseconds, the node is presumed dead. This is crude: too short a timeout causes false positives (healthy nodes declared dead during a network hiccup), too long a timeout delays failover.

ML models can learn the distribution of heartbeat delays under normal conditions and flag anomalies. If heartbeat latency normally follows a pattern (lower during the day, higher during batch processing at night), an ML model can adjust the effective timeout dynamically. Systems like Microsoft’s Falcon and academic projects on ML-based failure detection have shown promising results — reducing false positive rates while maintaining fast detection of actual failures.

This is one area where ML provides unambiguous value: it’s a pure optimization of existing mechanisms, the safety properties don’t depend on the ML model being correct (a false positive just triggers an unnecessary but safe leader election), and the training data is readily available from production telemetry.

What ML Won’t Fix

It’s worth being clear about what machine learning cannot do for consensus:

  • ML cannot eliminate the fundamental round-trip overhead. No amount of prediction can substitute for actual agreement among nodes.
  • ML cannot fix the FLP impossibility result. An asynchronous system with even one faulty process cannot guarantee consensus termination, regardless of how clever the ML model is.
  • ML cannot substitute for formal correctness. A consensus protocol must be safe under all executions, not just the ones the ML model has seen in training. Using ML to learn a consensus protocol from scratch (rather than to tune parameters of a correct protocol) is a terrible idea.

The role of ML in consensus is optimization at the margins, not fundamental improvement. The margins can matter — a 20% latency reduction from better timeout tuning is valuable — but they don’t change the game.

The “Death of Consensus” Argument

Every few years, someone publishes a provocative paper or blog post arguing that consensus is obsolete. The argument typically goes: CRDTs and causal consistency can handle most use cases without coordination, so strong consensus is an unnecessary overhead that we should eliminate.

Let’s take this argument seriously.

Where the Argument Is Right

CRDTs (Conflict-Free Replicated Data Types) have genuinely expanded the space of problems solvable without consensus. For data types with commutative, associative, and idempotent operations (counters, sets, LWW-registers, OR-sets, etc.), CRDTs provide eventual convergence without any coordination. This is not a theoretical curiosity — it’s used in production by systems like Riak, Redis, and various collaborative editing applications.

Causal consistency (where operations are ordered according to their causal dependencies, but causally independent operations can be ordered arbitrarily) is weaker than linearizability but strong enough for many applications. If you only need “read your own writes” and “monotonic reads,” causal consistency provides this without the latency and availability costs of consensus.

The CALM theorem (Consistency As Logical Monotonicity) provides a formal framework for identifying which computations require coordination and which don’t. Monotonic computations (where adding information never invalidates previous conclusions) can be done without coordination. This is a powerful insight that suggests many programs could be written to avoid consensus entirely.

Together, these results suggest that the domain of problems requiring strong consensus is smaller than we traditionally assumed. Many systems that use consensus don’t actually need it — they could achieve their requirements with weaker consistency models.

Where the Argument Is Wrong

The argument that consensus is obsolete breaks down in several places:

Leader election and mutual exclusion are inherently non-monotonic. Choosing one leader among many, or granting a lock to one process among many, requires agreement. No CRDT can implement a distributed lock. No eventually consistent system can provide mutual exclusion. If your system needs “exactly one process does this thing at this time,” you need consensus.

Transactions require ordering. If you need atomic multi-key updates (transfer money from account A to account B), you need agreement on the order of operations. CRDTs handle individual operations beautifully but don’t compose into transactions without additional coordination.

Configuration changes require agreement. The membership of a distributed system (which nodes are alive, which are primary, what’s the current schema) must be consistent across nodes. A system where different nodes disagree about who the leader is will behave incorrectly. Ironically, even a system that uses CRDTs for application data needs consensus for its own configuration.

Exactly-once semantics require deduplication state. If you need to ensure an operation is applied exactly once, you need consistent deduplication state. This requires consensus (or something equivalent, like a consistent hash ring backed by consensus).

The Numbers

To put some perspective on the “CRDTs will replace consensus” argument, consider what fraction of a typical system’s operations actually require strong consistency:

System TypeOperations Requiring ConsensusOperations Suitable for Eventual Consistency
Social media feed< 1% (account creation, deactivation)> 99% (posts, likes, comments, reads)
E-commerce~5-10% (checkout, inventory decrement)~90-95% (browsing, cart updates, recommendations)
Banking~50-80% (transfers, balance updates)~20-50% (statement reads, notifications)
Configuration management~100% (all writes)~0% (all reads need strong consistency)
Collaborative editing~1-5% (document creation, permissions)~95-99% (edit operations via OT/CRDTs)

For the social media and e-commerce cases, the “death of consensus” argument has merit — most operations don’t need it. For banking and configuration management, consensus remains essential. The future isn’t “consensus vs. no consensus” — it’s knowing which operations fall into which category.

The Realistic Prediction

Consensus won’t die. But its domain will shrink.

The pattern that’s emerging is a layered architecture:

  1. Consensus layer (small, critical): Handles leader election, membership, configuration, and the small amount of data that requires strong consistency. Runs on a small number of nodes (3-5). Uses Raft or an equivalent protocol.

  2. Coordination-free layer (large, performance-sensitive): Handles application data using CRDTs, eventual consistency, or causal consistency. Runs on many nodes. No consensus overhead.

  3. Selective consensus (as needed): Some operations require strong consistency even on application data (e.g., unique username registration, inventory decrement below zero). These operations go through the consensus layer on demand.

This is essentially what many modern databases already do (consensus for the metadata layer, weaker consistency for the data layer with optional strong reads). The trend is toward making this architecture more explicit and more configurable.

The Modularity Trend: Consensus as a Pluggable Component

The traditional approach to building a consensus-based system is to deeply integrate the consensus protocol into the application. etcd’s Raft implementation is tightly coupled to etcd’s storage engine. ZooKeeper’s Zab implementation is tightly coupled to ZooKeeper’s data model.

A growing trend is toward modular, composable consensus:

Consensus libraries. Instead of building a complete system, use a consensus library and add your application logic on top. Examples include hashicorp/raft (Go), openraft (Rust), ratis (Java), and dragonboat (Go). These libraries handle leader election, log replication, and snapshot management, exposing a state machine interface that the application implements.

Consensus-as-a-service. Instead of running your own consensus group, use a managed ordering service. This is the shared log approach discussed earlier — the consensus is someone else’s problem, and your application just appends to and reads from a log.

Pluggable consensus in blockchain frameworks. Hyperledger Fabric allows plugging in different consensus protocols (Raft, BFT variants). Cosmos SDK, built on Tendermint, separates the consensus engine from the application via ABCI. This modularity lets the same application run with different consensus backends depending on the deployment requirements.

The modularity trend reduces the barrier to using consensus correctly. If you’re using a well-tested consensus library rather than implementing your own, you inherit the library’s correctness (and its bugs, but at least they’re shared, well-known bugs). If you’re using consensus-as-a-service, you don’t need to understand the protocol at all.

The risk is that modularity can create a false sense of security. A consensus library handles consensus, but it doesn’t handle the interaction between consensus and your application. Incorrect use of the state machine interface, improper handling of leadership changes, or misunderstanding of the consistency guarantees provided by the library can all lead to bugs that the library can’t prevent.

What Problems Remain Unsolved

Despite decades of research, several problems in consensus remain genuinely open:

Optimal Byzantine Fault Tolerance

We know that BFT requires 3f+1 nodes for f Byzantine faults, and we know that HotStuff achieves linear message complexity. But the constant factors in BFT protocols remain high. The throughput gap between CFT and BFT protocols is roughly 10x, even with optimized BFT implementations. Closing this gap — or proving that it can’t be closed — is an open problem.

Threshold signatures (used by HotStuff to aggregate votes) help with message complexity but add CPU overhead. Post-quantum threshold signatures will add even more overhead. Whether hardware acceleration (SmartNICs, FPGAs) can close the gap is an active research question.

Dynamic Membership with Formal Guarantees

Adding and removing nodes from a consensus group while maintaining safety is a solved problem in theory (Raft’s joint consensus, Lamport’s reconfigurable Paxos) but remains fragile in practice. Systems regularly hit edge cases during membership changes — nodes that don’t know they’ve been removed, configurations that briefly lack a quorum, state transfer that races with new proposals.

A consensus protocol that handles dynamic membership as cleanly as it handles normal operation (same safety proofs, same performance characteristics, same implementation simplicity) does not yet exist.

Consensus Under Partial Synchrony with Tight Bounds

The FLP impossibility result tells us that consensus is impossible in a fully asynchronous system. Practical protocols assume partial synchrony — eventually, messages are delivered within a bounded time. But the bound is unknown, and choosing it wrong has consequences (too short: unnecessary leader elections; too long: slow failure detection).

An adaptive protocol that provides optimal performance under current network conditions without requiring any timing assumptions as input is a holy grail that remains out of reach. Machine learning approaches (as discussed above) are a step in this direction but lack formal guarantees.

Consensus at Planetary Scale

Current consensus protocols work well within a data center (sub-millisecond latency) and tolerably across regions (tens to hundreds of milliseconds). But as we push toward truly global systems — satellite networks, interplanetary communication, systems spanning light-seconds of latency — current protocols break down.

The speed of light imposes a hard floor on latency. Earth to Mars is 3-22 minutes one-way. A Raft round-trip of 6-44 minutes is not a practical commit latency. New models of consistency — perhaps based on speculative execution, hierarchical consensus, or fundamentally different assumptions about agreement — will be needed for truly planetary-scale systems. This is admittedly a niche concern today, but it’s a fascinating theoretical question.

Making Consensus Understandable

This might be the most important unsolved problem. Despite Raft’s explicit goal of understandability, consensus algorithms remain difficult for most engineers to understand, implement correctly, and debug. The gap between reading a protocol description and building a correct implementation is enormous.

Better tools (model checkers like TLA+ and formal verification frameworks), better education (interactive visualizations, worked examples), and better abstractions (consensus libraries and services that hide the protocol details) all help. But the fundamental complexity of agreement in the presence of failures is irreducible. You can hide it behind an abstraction, but someone has to understand what’s behind the abstraction when things go wrong.

Formal Verification: Proving Consensus Correct

One trend that deserves dedicated attention is the increasing use of formal verification for consensus implementations — not just protocols on paper, but actual running code.

The State of the Art

The gap between a provably correct protocol and a provably correct implementation has historically been wide. Lamport proved Paxos correct on paper in 1989. The first formally verified implementation of Paxos (in the Verdi framework) appeared in 2015. That’s 26 years between “we know it’s correct” and “we can prove the code is correct.”

Several projects have pushed this boundary:

IronFleet (Microsoft Research) provided the first formally verified implementation of a practical Paxos-based replicated state machine, including the network layer, state transfer, and liveness. The verification covered not just the core consensus logic but the entire system stack down to the compiled binary. The cost: roughly 4x the development effort compared to an unverified implementation.

Verdi provides a framework for verified distributed systems in Coq, with consensus protocols as a primary use case. Verdi’s approach separates the protocol specification from the system-level concerns (network semantics, failure handling), allowing each to be verified independently.

CockroachDB’s use of TLA+ demonstrates a more pragmatic approach: specify the protocol in TLA+, model-check the specification to find bugs, but don’t formally verify the implementation. This catches protocol-level bugs (of which the CockroachDB team has found several) without the cost of full verification.

The Raft TLA+ specifications (both the original by Ongaro and subsequent refinements) have been used to find bugs in multiple Raft implementations by checking that the implementation’s behavior matches the specification. This isn’t full verification, but it’s far better than testing alone.

Why This Matters

Consensus bugs are among the most dangerous bugs in distributed systems because they violate the assumptions that everything else is built on. A bug in the application layer loses one user’s data. A bug in the consensus layer can corrupt the replicated state across all nodes, making the corruption durable and replicated — the opposite of what replication is supposed to do.

Real examples of consensus bugs that formal methods could have caught (or did catch):

BugSystemImpactFound By
Raft pre-vote missing caseMultiple Raft implsDisrupted clusters after partition healingTesting + analysis
EPaxos execution ordering bugOriginal EPaxos paperIncorrect command orderingManual proof review, years after publication
ZooKeeper atomic broadcast edge caseZooKeeperPotential data inconsistency during leader changeJepsen testing
etcd lease revocation raceetcdStale lease could grant access after revocationProduction incident
MongoDB replication rollback data lossMongoDBCommitted writes lost during rollbackJepsen testing

The trend toward formal verification in consensus is not about theoretical purity — it’s about preventing these bugs. As consensus-based systems handle increasingly critical data (financial transactions, medical records, infrastructure control), the cost of consensus bugs increases, and the investment in formal verification becomes more justified.

The Practical Compromise

Full formal verification of a consensus implementation remains expensive — roughly 4-10x the development cost. For most teams, this isn’t practical. The practical compromise that’s emerging is a layered approach:

  1. Formally specify the protocol in TLA+ or a similar specification language
  2. Model-check the specification to find protocol-level bugs
  3. Test the implementation against the specification using trace checking or conformance testing
  4. Chaos-test the deployment using tools like Jepsen, Chaos Monkey, or Toxiproxy
  5. Monitor invariants in production (e.g., alert if a follower’s committed index exceeds the leader’s — this should never happen)

Each layer catches a different class of bugs at a different cost. Together, they provide much stronger assurance than testing alone, without the full cost of formal verification.

The Eternal Tension

Throughout this book, we’ve encountered the same tension repeatedly: the tension between theoretical elegance and production engineering.

The theory gives us impossibility results that define what can’t be done (FLP, CAP). It gives us protocols with provable safety and liveness properties (Paxos, PBFT, HotStuff). It gives us lower bounds on message complexity and fault tolerance. This theory is valuable — it prevents us from attempting the impossible and gives us confidence that our protocols are correct.

But the theory doesn’t build systems. The gap between a protocol on paper and a protocol in production is filled with engineering decisions that the theory doesn’t address: how to do state transfer, when to take snapshots, how to handle disk failures, what to do when the clock is wrong, how to upgrade without downtime, how to monitor health, how to debug a stuck consensus group, how to explain to management why the system is unavailable when two out of five nodes are down even though “we have replication.”

This gap is not a failure of theory or of engineering — it’s an inherent consequence of the fact that distributed systems exist at the intersection of mathematics and the physical world. The mathematics is clean. The physical world is not. Consensus algorithms are our best attempt to bridge the two, and the difficulty of that bridge is why they continue to cause agony.

Final Thoughts

When I started writing this book, I expected to arrive at a neat conclusion: here’s the best consensus algorithm, here’s when to use it, here’s how the field will evolve. What I found instead is that consensus — like most problems that have occupied great minds for decades — doesn’t have a neat conclusion. It has tradeoffs, context-dependent recommendations, and a healthy dose of “it depends.”

Here’s what I do believe, after spending far too long thinking about this:

Consensus will remain necessary. The arguments for eventually consistent alternatives are valid but bounded. As long as systems need leaders, locks, transactions, or configuration agreement, they need consensus in some form.

Raft will remain the default. Not because it’s optimal, but because it’s understood. The next decade’s consensus innovations will likely be built as optimizations on top of Raft-like foundations rather than as replacements for them.

Hardware will change the performance picture. RDMA, SmartNICs, and programmable switches will push consensus latency into the microsecond range for systems that can use them. This will enable new use cases but won’t eliminate the fundamental constraints.

The abstraction boundary will move up. Fewer teams will implement consensus protocols directly. More teams will use consensus through libraries, services, and managed offerings. This is unambiguously good — the fewer people who have to worry about the details of leader election, the fewer production incidents caused by getting leader election wrong.

The agony will continue. The gap between understanding consensus in theory and operating it in production will not close. New engineers will continue to discover, with dismay, that “just use Raft” is the beginning of the journey, not the end. They’ll struggle with leader elections, membership changes, state transfer, and the thousand other details that make consensus hard in practice.

And that’s fine. Not every problem should be easy. The agony of consensus algorithms is the price we pay for making multiple computers agree on something, and that capability — fragile, expensive, and maddening as it is — underpins nearly every reliable distributed system in existence. The computers that manage your bank account, route your network traffic, store your data, and coordinate your infrastructure all rely on some form of consensus. The fact that it works at all, given the theoretical impossibility results and the practical engineering challenges, is a minor miracle of computer science.

It’s a miracle that causes a lot of suffering. But it’s a miracle nonetheless.

If this book has done its job, you now understand not just the algorithms themselves, but the landscape in which they operate — the tradeoffs, the failure modes, the gap between theory and practice, and the reasons why every choice involves giving something up. The agony of consensus is not a problem to be solved. It is a condition to be managed, with clear eyes, good tools, and a healthy respect for the difficulty of making machines agree.

Good luck. You’ll need it.