Multi-Paxos and the Gap Between Paper and Production

Here is a dirty secret of distributed systems: there is no paper called “Multi-Paxos.” Lamport sketched the idea in a few paragraphs at the end of “The Part-Time Parliament” and elaborated slightly in “Paxos Made Simple.” The rest — the actual protocol that every production system implements — was reverse-engineered from those sketches, from folklore passed between implementers, from conference hallway conversations, and from the painful experience of people who tried to build real systems from a description that amounted to “just run Paxos repeatedly, with a stable leader, and optimize the obvious things.”

This is the gap between paper and production. It is not a gap. It is a chasm, and the bodies of failed implementations line the bottom.

Why Single-Decree Paxos Is Not Enough

Single-decree Paxos agrees on one value. A replicated state machine needs to agree on a sequence of commands — a log of operations that every replica executes in the same order. You need consensus not once but continuously, for every operation the system processes.

The naive approach is straightforward: for the i-th entry in the log, run an independent instance of single-decree Paxos. Instance 1 agrees on the first command. Instance 2 agrees on the second command. And so on.

This works. It is also catastrophically slow:

Each instance requires two round trips (Phase 1 and Phase 2).
Each round trip includes at least one fsync on each acceptor.
Instances must be sequential if commands can depend on each other.
Every instance might face contention from competing proposers.

At, say, 1ms per round trip and 0.2ms per fsync, you’re looking at roughly 2.4ms per command in the best case. That’s about 400 operations per second. Your average laptop’s SQLite database laughs at this throughput.

We need to do better.

The Multi-Paxos Optimization

The key insight is that Phase 1 is only necessary to establish the proposer’s right to propose. If a proposer has already completed Phase 1 for some proposal number n and nobody has tried to preempt it, then that proposer can skip Phase 1 for subsequent instances and go directly to Phase 2.

This is the stable leader optimization, and it transforms the performance of the protocol:

In steady state, the leader sends Accept messages and receives Accepted responses — one round trip per command.
The leader can pipeline — sending Accept for instance i+1 before receiving responses for instance i.
Phase 1 is only needed when the leader changes (which should be rare in a healthy system).

Here’s what steady-state Multi-Paxos looks like:

Leader                          Followers (Acceptors)
  |                               F1       F2       F3
  |                               |        |        |
  |--- Accept(n, slot=1, cmd_A)->|        |        |
  |--- Accept(n, slot=1, cmd_A)--------->|        |
  |--- Accept(n, slot=1, cmd_A)------------------>|
  |                               |        |        |
  |--- Accept(n, slot=2, cmd_B)->|        |        |
  |--- Accept(n, slot=2, cmd_B)--------->|        |
  |--- Accept(n, slot=2, cmd_B)------------------>|
  |                               |        |        |
  |<-- Accepted(n, slot=1) ------|        |        |
  |<-- Accepted(n, slot=1) --------------|        |
  |                               |        |        |
  |  (slot 1 committed — majority accepted)       |
  |                               |        |        |
  |<-- Accepted(n, slot=2) ------|        |        |
  |<-- Accepted(n, slot=2) --------------|        |
  |                               |        |        |
  |  (slot 2 committed — majority accepted)       |

One round trip. Pipelining. This is a protocol you can actually build a database on.

The Stable Leader

The term “leader” doesn’t appear in single-decree Paxos. In Multi-Paxos, it’s the entire game. The leader is a distinguished proposer that handles all client requests and drives the protocol. Everyone else is a follower.

But how do you elect the leader? Lamport’s papers are deliberately vague — the leader election mechanism is outside the consensus protocol itself. In practice, you need:

Failure detection: Followers must detect when the leader has failed (usually via heartbeat timeouts).
Leader election: When the leader fails, a new leader must be elected. This is itself a consensus problem, which creates a delightful circularity. In practice, most systems use the Paxos mechanism itself: a prospective leader runs Phase 1 with a higher proposal number, which implicitly establishes it as the new leader if it succeeds.
Leader stability: You want the leader to be stable — frequent leader changes kill performance because each change requires Phase 1.

class MultiPaxosNode:
    state:
        role: Leader | Follower
        current_leader: NodeId = null
        leader_lease_expiry: Timestamp = 0
        heartbeat_interval: Duration = 100ms
        election_timeout: Duration = random(300ms, 500ms)
        last_heartbeat_received: Timestamp = 0

    function follower_loop():
        while self.role == Follower:
            if now() - self.last_heartbeat_received > self.election_timeout:
                // Leader seems dead — try to become leader
                self.start_election()
            sleep(self.heartbeat_interval)

    function start_election():
        new_proposal = generate_proposal_number(self.my_id, self.highest_seen)

        // Run Phase 1 for ALL active log slots
        promises = send_prepare_to_majority(new_proposal)

        if |promises| >= quorum_size:
            self.role = Leader
            self.current_leader = self.my_id
            self.proposal_number = new_proposal

            // Process any already-accepted values from promises
            self.reconcile_log(promises)

            // Start sending heartbeats
            self.leader_loop()

    function leader_loop():
        while self.role == Leader:
            // Send heartbeats to maintain leadership
            broadcast(Heartbeat(self.proposal_number))

            // Process client requests
            for request in pending_requests:
                self.replicate(request)

            sleep(self.heartbeat_interval)

Steady-State Leader Operation

Here is the pseudocode for the leader’s steady-state operation — accepting a client request and replicating it:

class MultiPaxosLeader:
    state:
        log: Map<SlotNumber, LogEntry>
        next_slot: SlotNumber = 1
        commit_index: SlotNumber = 0  // highest committed slot
        proposal_number: ProposalNumber
        match_index: Map<NodeId, SlotNumber>  // per-follower progress

    function handle_client_request(command: Command) -> Result:
        // Assign the command to the next log slot
        slot = self.next_slot
        self.next_slot += 1

        entry = LogEntry {
            slot: slot,
            proposal: self.proposal_number,
            command: command,
            accepted_by: {self.my_id}  // Leader accepts its own entries
        }
        self.log[slot] = entry
        persist(self.log[slot])

        // Send Accept to all followers
        for follower in self.followers:
            send(follower, Accept {
                proposal: self.proposal_number,
                slot: slot,
                command: command
            })

        // Wait for majority (could be async with pipelining)
        wait_until(|entry.accepted_by| >= self.quorum_size)

        // Committed!
        self.advance_commit_index()
        apply_to_state_machine(command)

        return Result(success=true, response=execute(command))

    function on_accepted(slot: SlotNumber, from: NodeId):
        if slot in self.log:
            self.log[slot].accepted_by.add(from)
            self.match_index[from] = max(self.match_index[from], slot)

            if |self.log[slot].accepted_by| >= self.quorum_size:
                self.advance_commit_index()

    function advance_commit_index():
        // Find the highest slot accepted by a majority
        while self.commit_index + 1 in self.log and
              |self.log[self.commit_index + 1].accepted_by| >= self.quorum_size:
            self.commit_index += 1
            apply_to_state_machine(self.log[self.commit_index].command)

        // Notify followers of new commit index
        // (often piggybacked on next Accept message)

The Log: Gaps, Holes, and Ordering

Here is where Multi-Paxos diverges significantly from Raft, and where many implementation bugs lurk.

In Multi-Paxos, the log can have gaps. If a leader assigns slot 5 but crashes before its Accept message reaches any follower, and a new leader takes over, the new leader might assign slot 5 to a different command — or might leave slot 5 empty and start from slot 6. This happens because Multi-Paxos runs independent Paxos instances per slot, and there’s no requirement that slots be filled contiguously.

This is theoretically fine but practically awful:

Your state machine expects commands in order. You can’t apply slot 6 until slot 5 is resolved.
A gap in the log means you need to run Paxos for that slot to determine if a value was previously chosen (maybe a no-op).
Clients might be waiting for a response to the command that was in the gap.

function new_leader_fill_gaps():
    // After winning election, fill any gaps in the log
    for slot in range(1, self.next_slot):
        if slot not in self.log or self.log[slot].status != COMMITTED:
            // Run Phase 2 for this slot
            // If promises reported an accepted value, use it
            // Otherwise, propose a no-op
            if slot in values_from_promises:
                value = values_from_promises[slot]
            else:
                value = NO_OP

            replicate_to_slot(slot, value)

Raft (Chapter 8) avoids this entirely by requiring the log to be contiguous — no holes allowed. This is one of Raft’s genuine simplifications, and it’s worth appreciating how much complexity it eliminates.

Reconfiguration: Changing the Membership

At some point, you need to add or remove servers from your cluster. A server’s disk fails, you want to replace it. You’re scaling from three nodes to five. You’re decommissioning a data center.

This is the membership change problem, and it is — to put it charitably — under-specified in the Paxos literature. Lamport’s papers mention that the state machine can include reconfiguration commands, but the details of how to do this safely are left as an exercise for the reader. An exercise that has kept the reader up at night for decades.

The fundamental danger is this: during a reconfiguration, the old configuration and the new configuration might independently form majorities that don’t overlap. If the old configuration is {A, B, C} and the new configuration is {B, C, D}, the old majority {A, B} and the new majority {C, D} don’t overlap. Two different values could be chosen for the same slot. Consensus is violated. Your database is corrupt. Your pager is going off.

Approach 1: Alpha-Based Reconfiguration

Lamport’s approach is to have the state machine command at slot i determine the configuration used for slot i + alpha, where alpha is a parameter large enough to ensure that the reconfiguration command has been committed and applied before the new configuration takes effect.

// Configuration is determined by state machine commands
function get_configuration_for_slot(slot):
    // Look at all committed reconfiguration commands
    // with slot numbers <= (slot - ALPHA)
    config = initial_configuration
    for s in range(1, slot - ALPHA + 1):
        if log[s].command is ReconfigCommand:
            config = log[s].command.new_config
    return config

This works but has a serious limitation: you can only process one reconfiguration command every alpha slots. If alpha is 100 (a common choice), and each slot takes 1ms, that’s one reconfiguration per 100ms. This is usually fine, but it means you can’t rapidly reconfigure in an emergency.

Approach 2: Joint Consensus

Raft’s joint consensus approach (described in detail in Chapter 8) can be adapted for Multi-Paxos. The idea is to transition through an intermediate configuration that requires majorities from BOTH the old and new configurations.

Approach 3: The “Just Use a Configuration Log” Approach

Many practical systems (including Chubby) use a separate Paxos instance to agree on configuration changes. The configuration log is itself replicated using Paxos with a fixed (or very rarely changing) set of members. This is the “turtles all the way down” approach — your configuration is managed by consensus, which requires a configuration, which is managed by consensus…

In practice, the base configuration is typically hardcoded or stored in a file that is updated manually. We pretend this is fine.

Log Compaction and Snapshotting

An append-only log grows without bound. In a long-running system, you’ll eventually run out of disk space, and replaying the entire log on startup takes longer and longer. You need to compact the log.

The standard approach is snapshotting: periodically take a snapshot of the state machine’s state, then discard all log entries up to the snapshot point.

class SnapshotManager:
    state:
        last_snapshot_slot: SlotNumber = 0
        snapshot_interval: int = 10000  // slots between snapshots

    function maybe_snapshot():
        if self.commit_index - self.last_snapshot_slot >= self.snapshot_interval:
            self.take_snapshot()

    function take_snapshot():
        // Must be atomic with respect to the state machine
        snapshot = serialize(self.state_machine.state)
        snapshot_metadata = {
            last_included_slot: self.commit_index,
            last_included_proposal: self.log[self.commit_index].proposal,
            configuration: self.current_config
        }

        write_snapshot_to_disk(snapshot, snapshot_metadata)
        fsync()

        // Now we can discard old log entries
        for slot in range(1, self.commit_index):
            delete self.log[slot]

        self.last_snapshot_slot = self.commit_index

    function install_snapshot(snapshot, metadata):
        // Called when a follower is too far behind and the leader
        // sends its snapshot instead of individual log entries
        write_snapshot_to_disk(snapshot, metadata)
        self.state_machine.restore(deserialize(snapshot))
        self.commit_index = metadata.last_included_slot
        self.last_snapshot_slot = metadata.last_included_slot

        // Discard any log entries covered by the snapshot
        for slot in range(1, metadata.last_included_slot + 1):
            delete self.log[slot]

Snapshotting interacts with the rest of the system in several unpleasant ways:

Follower catchup. If a follower is too far behind (its next needed log entry has been discarded), the leader must send its snapshot. This is a large transfer that can take seconds or minutes, during which the follower is unavailable.

Snapshot consistency. The snapshot must represent a consistent state machine state at a specific log position. If the state machine is large (gigabytes of data), taking a consistent snapshot while continuing to process requests requires copy-on-write semantics or a quiescent period.

Snapshot transfer. Sending a multi-gigabyte snapshot over the network is not a trivial operation. You need flow control, checksumming, resumability (what if the transfer is interrupted?), and you need to do it without starving normal replication traffic.

What Google Actually Built

The most instructive documentation of the gap between Paxos-the-protocol and Paxos-the-system comes from Google, which built several systems on Multi-Paxos and had the good grace to publish papers about the experience.

Chubby (2006)

Chubby is Google’s distributed lock service, built on Multi-Paxos. The Chubby paper by Mike Burrows is remarkably candid about the implementation challenges:

They found it extremely difficult to implement Multi-Paxos from the paper alone.
They had to invent solutions for master leases, client session management, and event notifications — none of which are part of Paxos.
The system’s complexity was dominated by these “engineering” concerns, not by the consensus protocol.

Paxos Made Live (2007)

This paper by Chandra, Griesemer, and Redstone is perhaps the most valuable document ever written about implementing Paxos. It catalogs the challenges they encountered building Chubby’s Paxos implementation:

Disk corruption. The paper assumes reliable persistent storage. Disks lie. Google had to add checksums to every on-disk data structure and handle corruption gracefully.
Master leases. To serve reads without running Paxos, the master needs a lease — a time-bounded promise that no other node will become master. Implementing leases correctly requires carefully reasoned clock assumptions.
Group membership. Changing the set of participants is, in their words, “one of the most complex and least well-understood aspects of Paxos.”
Snapshots. They describe the snapshot mechanism as “the single hardest thing to get right” in their implementation.
Testing. They built an elaborate testing framework including a simulated network, simulated disk, and simulated clock to test failure scenarios. Even with this, bugs slipped through.

The paper’s conclusion is worth quoting approximately: the algorithm’s correctness proof gave them confidence in the design, but the implementation challenges were enormous and mostly orthogonal to the consensus protocol itself.

Spanner (2012)

Google’s Spanner is a globally distributed database that uses Paxos for replication within each shard. Spanner’s contribution to the Paxos implementation landscape includes:

Long-lived leaders with 10-second lease periods, enabling efficient reads.
Pipelining of writes to achieve high throughput despite global latencies.
TrueTime — GPS and atomic clock-based time synchronization that enables externally consistent reads without running Paxos. This is not a Paxos optimization; it’s a way to avoid Paxos for reads entirely.

Why Every Implementation Is Different

Here is a non-exhaustive list of decisions you must make when implementing Multi-Paxos, none of which are specified by the protocol:

Decision	Options	Tradeoffs
Leader election trigger	Heartbeat timeout / external failure detector / client-triggered	Latency vs. complexity vs. false positives
Heartbeat mechanism	Separate messages / piggybacked on Accept	Message overhead vs. implementation simplicity
Log indexing	Contiguous / sparse with gaps	Complexity vs. flexibility
Follower catchup	Send individual entries / send snapshot / hybrid	Network vs. latency vs. complexity
Read handling	Read through Paxos / leader leases / follower reads	Consistency vs. latency vs. throughput
Batching	Per-entry / batched / adaptive	Latency vs. throughput
Pipelining depth	Fixed / adaptive / unlimited	Throughput vs. memory vs. ordering complexity
Disk persistence	Write-ahead log / separate state files / embedded DB	Performance vs. simplicity vs. recovery time
Snapshot format	Full state / incremental / log-structured	Space vs. time vs. complexity
Membership changes	Alpha-based / joint consensus / separate config log	Safety vs. availability vs. complexity
Client interaction	Leader-only / redirect / proxy	Latency vs. load balancing vs. complexity

That’s eleven fundamental design decisions, most with three or more reasonable options. The combinatorial space is enormous, and the interactions between these choices are subtle. This is why every Multi-Paxos implementation is different, and why the phrase “we use Paxos” tells you almost nothing about what the system actually does.

Steady-State Follower

For completeness, here is the follower’s steady-state logic:

class MultiPaxosFollower:
    state:
        log: Map<SlotNumber, LogEntry>
        commit_index: SlotNumber = 0
        highest_promised: ProposalNumber = 0
        current_leader: NodeId = null
        last_heartbeat: Timestamp = 0

    function on_accept(msg: AcceptMessage):
        if msg.proposal < self.highest_promised:
            // Stale leader — reject
            reply(msg.from, Nack {
                slot: msg.slot,
                highest_promised: self.highest_promised
            })
            return

        // Accept the entry
        self.highest_promised = msg.proposal
        self.log[msg.slot] = LogEntry {
            slot: msg.slot,
            proposal: msg.proposal,
            command: msg.command,
            status: ACCEPTED
        }
        persist(self.log[msg.slot], self.highest_promised)

        reply(msg.from, Accepted {
            slot: msg.slot,
            proposal: msg.proposal
        })

        // Update commit index if leader told us
        if msg.leader_commit > self.commit_index:
            old_commit = self.commit_index
            self.commit_index = min(msg.leader_commit,
                                     self.highest_contiguous_slot())
            for slot in range(old_commit + 1, self.commit_index + 1):
                apply_to_state_machine(self.log[slot].command)

    function on_heartbeat(msg: HeartbeatMessage):
        if msg.proposal >= self.highest_promised:
            self.current_leader = msg.from
            self.last_heartbeat = now()
            self.highest_promised = msg.proposal

            // Heartbeats often carry commit index updates
            if msg.commit_index > self.commit_index:
                self.advance_commit_index(msg.commit_index)

    function on_prepare(msg: PrepareMessage):
        // New leader election in progress
        if msg.proposal > self.highest_promised:
            self.highest_promised = msg.proposal
            persist(self.highest_promised)

            // Send all our accepted entries
            accepted_entries = {}
            for slot, entry in self.log:
                if entry.status == ACCEPTED:
                    accepted_entries[slot] = (entry.proposal, entry.command)

            reply(msg.from, Promise {
                proposal: msg.proposal,
                accepted_entries: accepted_entries
            })
        else:
            reply(msg.from, Nack {
                proposal: msg.proposal,
                highest_promised: self.highest_promised
            })

    function highest_contiguous_slot():
        slot = 0
        while (slot + 1) in self.log:
            slot += 1
        return slot

Leader Takeover: The Messy Part

When a leader fails and a new leader takes over, the new leader must reconcile the state of all the followers. This is the most complex part of Multi-Paxos and where most bugs hide.

function leader_takeover():
    new_proposal = generate_proposal_number(self.my_id, self.highest_seen)

    // Phase 1: Send Prepare to all acceptors
    promises = send_prepare_and_collect_majority(new_proposal)

    if |promises| < quorum_size:
        return  // Failed — someone else is leader or network partition

    // Merge all accepted entries from promises
    // For each slot, take the value with the highest proposal number
    merged_log = {}
    max_slot = 0

    for promise in promises:
        for slot, (proposal, command) in promise.accepted_entries:
            max_slot = max(max_slot, slot)
            if slot not in merged_log or proposal > merged_log[slot].proposal:
                merged_log[slot] = (proposal, command)

    // Fill gaps with no-ops and re-replicate all uncertain entries
    for slot in range(self.commit_index + 1, max_slot + 1):
        if slot in merged_log:
            command = merged_log[slot].command
        else:
            command = NO_OP  // Fill the gap

        // Run Phase 2 for this slot with the new proposal number
        self.replicate_to_slot(slot, new_proposal, command)

    // Now we're caught up — start normal operation
    self.next_slot = max_slot + 1
    self.role = Leader
    self.start_heartbeats()

The no-op commands deserve special attention. When the new leader finds a gap in the log — a slot where no acceptor reports an accepted value — it must still fill that slot before it can advance the commit index past it. A no-op is a command that, when applied to the state machine, does nothing. But it must still go through the full Paxos Accept phase to be committed.

This means leader takeover latency scales with the number of uncommitted slots. If the old leader was pipelining aggressively and had 100 outstanding uncommitted entries when it failed, the new leader must re-replicate all 100 of them before it can process new client requests. This is one of the reasons that tuning the pipeline depth involves a tradeoff between throughput and recovery time.

Pipelining in Detail

Pipelining is essential for throughput. Without it, the leader must wait for each Accept to be committed before sending the next one. With pipelining, the leader can have multiple outstanding uncommitted entries:

class PipelinedLeader:
    state:
        pipeline_window: int = 50  // max outstanding uncommitted entries
        pending_commits: Map<SlotNumber, PendingEntry>

    function handle_client_request(command: Command):
        // Check pipeline backpressure
        while |self.pending_commits| >= self.pipeline_window:
            wait_for_any_commit()

        slot = self.next_slot
        self.next_slot += 1

        entry = PendingEntry {
            slot: slot,
            command: command,
            accepted_by: {self.my_id},
            client_callback: current_client_callback
        }
        self.pending_commits[slot] = entry

        // Send Accept — don't wait for response
        broadcast_accept(self.proposal_number, slot, command)

    function on_accepted(slot, from):
        if slot in self.pending_commits:
            self.pending_commits[slot].accepted_by.add(from)

            if |self.pending_commits[slot].accepted_by| >= quorum_size:
                self.try_advance_commit_index()

    function try_advance_commit_index():
        // Commit entries in order
        while (self.commit_index + 1) in self.pending_commits and
              |self.pending_commits[self.commit_index + 1].accepted_by| >= quorum_size:
            self.commit_index += 1
            entry = self.pending_commits.remove(self.commit_index)
            result = apply_to_state_machine(entry.command)
            entry.client_callback(result)

Note the crucial detail: entries are committed in order. Even if slot 5 gets a majority before slot 4, you cannot commit slot 5 until slot 4 is committed. The client callback for slot 5 must wait. This preserves the linearizability of the replicated log.

Batching

In practice, you combine pipelining with batching: instead of one command per Accept message, you pack multiple commands into a single message.

function batch_and_send():
    // Collect commands for up to BATCH_TIMEOUT or BATCH_SIZE
    batch = []
    deadline = now() + BATCH_TIMEOUT  // e.g., 1ms

    while |batch| < MAX_BATCH_SIZE and now() < deadline:
        if command = try_dequeue_request(deadline - now()):
            batch.append(command)

    if |batch| > 0:
        slot = self.next_slot
        self.next_slot += 1
        broadcast_accept(self.proposal_number, slot, batch)

Batching introduces a tradeoff between latency and throughput. A 1ms batching window means the first command in each batch waits up to 1ms before being sent. But if you’re processing 10,000 commands per second, each batch contains about 10 commands, and you’ve reduced your message overhead by 10x.

Google’s Spanner uses batching extensively. Amazon’s DynamoDB Paxos implementation reportedly uses adaptive batching that adjusts the window based on current load.

Comparison: Multi-Paxos vs. What Google Described

Aspect	Academic Multi-Paxos	Google’s Implementation (per “Paxos Made Live”)
Leader election	Unspecified	Lease-based with 10s lease period
Log gaps	Allowed	Filled with no-ops on leader change
Reads	Through Paxos	Leader leases for local reads
Persistence	“Write to stable storage”	Checksummed WAL with corruption detection
Snapshots	Not discussed	Custom snapshot format with incremental support
Membership	“Exercise for the reader”	Complex migration protocol, single most difficult part
Testing	Prove it correct	2000+ failure injection tests
Performance	Analyze message complexity	Batching, pipelining, flow control, congestion avoidance

The right column is approximately 10x more work than the left column. This is the gap.

Why Everyone Builds Something Different

In 2014, a group of researchers surveyed several Paxos implementations and found that they all differed in significant ways. The paper, “Paxos Quorum Leases,” noted that systems like Chubby, Spanner, Megastore, and ZooKeeper (which uses Zab, a Paxos variant) all made different choices about:

How to handle reads
How to manage leader leases
How to compact the log
How to handle membership changes
How to batch and pipeline requests

The reason is simple: Multi-Paxos, as described in the literature, is not a complete system design. It is a collection of optimization ideas applied to a theoretical protocol. To build a real system, you must make dozens of engineering decisions that the theory doesn’t address, and different systems face different requirements that lead to different decisions.

This is not necessarily a problem — different systems SHOULD make different tradeoffs. But it means that “we use Multi-Paxos” is not a specification. It’s a statement of philosophy, roughly equivalent to “we use a leader-based consensus protocol inspired by Lamport’s work.” The details — which is where all the bugs and performance characteristics live — are entirely implementation-specific.

The Legacy of Multi-Paxos

Multi-Paxos is the workhorse of distributed consensus. It powers (or powered) Google’s Chubby, Spanner, Megastore, and Bigtable. It powers (in spirit) numerous other systems that use the stable-leader optimization regardless of whether they call it “Multi-Paxos.”

Its legacy is also one of frustration. Generations of engineers have stared at Lamport’s papers, understood the single-decree protocol, and then discovered that building a system requires solving a hundred problems the papers don’t address. The popularity of Raft (Chapter 8) is largely a response to this frustration — not because Raft solves a fundamentally different problem, but because Raft’s specification is a system design, not just a protocol sketch.

But Multi-Paxos also left a positive legacy: it established the blueprint that all leader-based consensus protocols follow. Stable leader, log replication, snapshotting, membership changes — these concepts, first explored (however informally) in the Multi-Paxos context, became the standard vocabulary of the field. Every protocol in the rest of Part II owes a debt to this protocol that was never fully specified but somehow became the foundation of modern distributed systems.

The gap between paper and production is not unique to Paxos. It exists for every distributed systems paper. But Paxos is where the gap was first discovered, first documented, and first lamented. And for that, we should be grateful — even as we curse it.

Keyboard shortcuts

The Agony of Consensus Algorithms