Multi-Paxos and the Gap Between Paper and Production
Here is a dirty secret of distributed systems: there is no paper called “Multi-Paxos.” Lamport sketched the idea in a few paragraphs at the end of “The Part-Time Parliament” and elaborated slightly in “Paxos Made Simple.” The rest — the actual protocol that every production system implements — was reverse-engineered from those sketches, from folklore passed between implementers, from conference hallway conversations, and from the painful experience of people who tried to build real systems from a description that amounted to “just run Paxos repeatedly, with a stable leader, and optimize the obvious things.”
This is the gap between paper and production. It is not a gap. It is a chasm, and the bodies of failed implementations line the bottom.
Why Single-Decree Paxos Is Not Enough
Single-decree Paxos agrees on one value. A replicated state machine needs to agree on a sequence of commands — a log of operations that every replica executes in the same order. You need consensus not once but continuously, for every operation the system processes.
The naive approach is straightforward: for the i-th entry in the log, run an independent instance of single-decree Paxos. Instance 1 agrees on the first command. Instance 2 agrees on the second command. And so on.
This works. It is also catastrophically slow:
- Each instance requires two round trips (Phase 1 and Phase 2).
- Each round trip includes at least one fsync on each acceptor.
- Instances must be sequential if commands can depend on each other.
- Every instance might face contention from competing proposers.
At, say, 1ms per round trip and 0.2ms per fsync, you’re looking at roughly 2.4ms per command in the best case. That’s about 400 operations per second. Your average laptop’s SQLite database laughs at this throughput.
We need to do better.
The Multi-Paxos Optimization
The key insight is that Phase 1 is only necessary to establish the proposer’s right to propose. If a proposer has already completed Phase 1 for some proposal number n and nobody has tried to preempt it, then that proposer can skip Phase 1 for subsequent instances and go directly to Phase 2.
This is the stable leader optimization, and it transforms the performance of the protocol:
- In steady state, the leader sends Accept messages and receives Accepted responses — one round trip per command.
- The leader can pipeline — sending Accept for instance i+1 before receiving responses for instance i.
- Phase 1 is only needed when the leader changes (which should be rare in a healthy system).
Here’s what steady-state Multi-Paxos looks like:
Leader Followers (Acceptors)
| F1 F2 F3
| | | |
|--- Accept(n, slot=1, cmd_A)->| | |
|--- Accept(n, slot=1, cmd_A)--------->| |
|--- Accept(n, slot=1, cmd_A)------------------>|
| | | |
|--- Accept(n, slot=2, cmd_B)->| | |
|--- Accept(n, slot=2, cmd_B)--------->| |
|--- Accept(n, slot=2, cmd_B)------------------>|
| | | |
|<-- Accepted(n, slot=1) ------| | |
|<-- Accepted(n, slot=1) --------------| |
| | | |
| (slot 1 committed — majority accepted) |
| | | |
|<-- Accepted(n, slot=2) ------| | |
|<-- Accepted(n, slot=2) --------------| |
| | | |
| (slot 2 committed — majority accepted) |
One round trip. Pipelining. This is a protocol you can actually build a database on.
The Stable Leader
The term “leader” doesn’t appear in single-decree Paxos. In Multi-Paxos, it’s the entire game. The leader is a distinguished proposer that handles all client requests and drives the protocol. Everyone else is a follower.
But how do you elect the leader? Lamport’s papers are deliberately vague — the leader election mechanism is outside the consensus protocol itself. In practice, you need:
- Failure detection: Followers must detect when the leader has failed (usually via heartbeat timeouts).
- Leader election: When the leader fails, a new leader must be elected. This is itself a consensus problem, which creates a delightful circularity. In practice, most systems use the Paxos mechanism itself: a prospective leader runs Phase 1 with a higher proposal number, which implicitly establishes it as the new leader if it succeeds.
- Leader stability: You want the leader to be stable — frequent leader changes kill performance because each change requires Phase 1.
class MultiPaxosNode:
state:
role: Leader | Follower
current_leader: NodeId = null
leader_lease_expiry: Timestamp = 0
heartbeat_interval: Duration = 100ms
election_timeout: Duration = random(300ms, 500ms)
last_heartbeat_received: Timestamp = 0
function follower_loop():
while self.role == Follower:
if now() - self.last_heartbeat_received > self.election_timeout:
// Leader seems dead — try to become leader
self.start_election()
sleep(self.heartbeat_interval)
function start_election():
new_proposal = generate_proposal_number(self.my_id, self.highest_seen)
// Run Phase 1 for ALL active log slots
promises = send_prepare_to_majority(new_proposal)
if |promises| >= quorum_size:
self.role = Leader
self.current_leader = self.my_id
self.proposal_number = new_proposal
// Process any already-accepted values from promises
self.reconcile_log(promises)
// Start sending heartbeats
self.leader_loop()
function leader_loop():
while self.role == Leader:
// Send heartbeats to maintain leadership
broadcast(Heartbeat(self.proposal_number))
// Process client requests
for request in pending_requests:
self.replicate(request)
sleep(self.heartbeat_interval)
Steady-State Leader Operation
Here is the pseudocode for the leader’s steady-state operation — accepting a client request and replicating it:
class MultiPaxosLeader:
state:
log: Map<SlotNumber, LogEntry>
next_slot: SlotNumber = 1
commit_index: SlotNumber = 0 // highest committed slot
proposal_number: ProposalNumber
match_index: Map<NodeId, SlotNumber> // per-follower progress
function handle_client_request(command: Command) -> Result:
// Assign the command to the next log slot
slot = self.next_slot
self.next_slot += 1
entry = LogEntry {
slot: slot,
proposal: self.proposal_number,
command: command,
accepted_by: {self.my_id} // Leader accepts its own entries
}
self.log[slot] = entry
persist(self.log[slot])
// Send Accept to all followers
for follower in self.followers:
send(follower, Accept {
proposal: self.proposal_number,
slot: slot,
command: command
})
// Wait for majority (could be async with pipelining)
wait_until(|entry.accepted_by| >= self.quorum_size)
// Committed!
self.advance_commit_index()
apply_to_state_machine(command)
return Result(success=true, response=execute(command))
function on_accepted(slot: SlotNumber, from: NodeId):
if slot in self.log:
self.log[slot].accepted_by.add(from)
self.match_index[from] = max(self.match_index[from], slot)
if |self.log[slot].accepted_by| >= self.quorum_size:
self.advance_commit_index()
function advance_commit_index():
// Find the highest slot accepted by a majority
while self.commit_index + 1 in self.log and
|self.log[self.commit_index + 1].accepted_by| >= self.quorum_size:
self.commit_index += 1
apply_to_state_machine(self.log[self.commit_index].command)
// Notify followers of new commit index
// (often piggybacked on next Accept message)
The Log: Gaps, Holes, and Ordering
Here is where Multi-Paxos diverges significantly from Raft, and where many implementation bugs lurk.
In Multi-Paxos, the log can have gaps. If a leader assigns slot 5 but crashes before its Accept message reaches any follower, and a new leader takes over, the new leader might assign slot 5 to a different command — or might leave slot 5 empty and start from slot 6. This happens because Multi-Paxos runs independent Paxos instances per slot, and there’s no requirement that slots be filled contiguously.
This is theoretically fine but practically awful:
- Your state machine expects commands in order. You can’t apply slot 6 until slot 5 is resolved.
- A gap in the log means you need to run Paxos for that slot to determine if a value was previously chosen (maybe a no-op).
- Clients might be waiting for a response to the command that was in the gap.
function new_leader_fill_gaps():
// After winning election, fill any gaps in the log
for slot in range(1, self.next_slot):
if slot not in self.log or self.log[slot].status != COMMITTED:
// Run Phase 2 for this slot
// If promises reported an accepted value, use it
// Otherwise, propose a no-op
if slot in values_from_promises:
value = values_from_promises[slot]
else:
value = NO_OP
replicate_to_slot(slot, value)
Raft (Chapter 8) avoids this entirely by requiring the log to be contiguous — no holes allowed. This is one of Raft’s genuine simplifications, and it’s worth appreciating how much complexity it eliminates.
Reconfiguration: Changing the Membership
At some point, you need to add or remove servers from your cluster. A server’s disk fails, you want to replace it. You’re scaling from three nodes to five. You’re decommissioning a data center.
This is the membership change problem, and it is — to put it charitably — under-specified in the Paxos literature. Lamport’s papers mention that the state machine can include reconfiguration commands, but the details of how to do this safely are left as an exercise for the reader. An exercise that has kept the reader up at night for decades.
The fundamental danger is this: during a reconfiguration, the old configuration and the new configuration might independently form majorities that don’t overlap. If the old configuration is {A, B, C} and the new configuration is {B, C, D}, the old majority {A, B} and the new majority {C, D} don’t overlap. Two different values could be chosen for the same slot. Consensus is violated. Your database is corrupt. Your pager is going off.
Approach 1: Alpha-Based Reconfiguration
Lamport’s approach is to have the state machine command at slot i determine the configuration used for slot i + alpha, where alpha is a parameter large enough to ensure that the reconfiguration command has been committed and applied before the new configuration takes effect.
// Configuration is determined by state machine commands
function get_configuration_for_slot(slot):
// Look at all committed reconfiguration commands
// with slot numbers <= (slot - ALPHA)
config = initial_configuration
for s in range(1, slot - ALPHA + 1):
if log[s].command is ReconfigCommand:
config = log[s].command.new_config
return config
This works but has a serious limitation: you can only process one reconfiguration command every alpha slots. If alpha is 100 (a common choice), and each slot takes 1ms, that’s one reconfiguration per 100ms. This is usually fine, but it means you can’t rapidly reconfigure in an emergency.
Approach 2: Joint Consensus
Raft’s joint consensus approach (described in detail in Chapter 8) can be adapted for Multi-Paxos. The idea is to transition through an intermediate configuration that requires majorities from BOTH the old and new configurations.
Approach 3: The “Just Use a Configuration Log” Approach
Many practical systems (including Chubby) use a separate Paxos instance to agree on configuration changes. The configuration log is itself replicated using Paxos with a fixed (or very rarely changing) set of members. This is the “turtles all the way down” approach — your configuration is managed by consensus, which requires a configuration, which is managed by consensus…
In practice, the base configuration is typically hardcoded or stored in a file that is updated manually. We pretend this is fine.
Log Compaction and Snapshotting
An append-only log grows without bound. In a long-running system, you’ll eventually run out of disk space, and replaying the entire log on startup takes longer and longer. You need to compact the log.
The standard approach is snapshotting: periodically take a snapshot of the state machine’s state, then discard all log entries up to the snapshot point.
class SnapshotManager:
state:
last_snapshot_slot: SlotNumber = 0
snapshot_interval: int = 10000 // slots between snapshots
function maybe_snapshot():
if self.commit_index - self.last_snapshot_slot >= self.snapshot_interval:
self.take_snapshot()
function take_snapshot():
// Must be atomic with respect to the state machine
snapshot = serialize(self.state_machine.state)
snapshot_metadata = {
last_included_slot: self.commit_index,
last_included_proposal: self.log[self.commit_index].proposal,
configuration: self.current_config
}
write_snapshot_to_disk(snapshot, snapshot_metadata)
fsync()
// Now we can discard old log entries
for slot in range(1, self.commit_index):
delete self.log[slot]
self.last_snapshot_slot = self.commit_index
function install_snapshot(snapshot, metadata):
// Called when a follower is too far behind and the leader
// sends its snapshot instead of individual log entries
write_snapshot_to_disk(snapshot, metadata)
self.state_machine.restore(deserialize(snapshot))
self.commit_index = metadata.last_included_slot
self.last_snapshot_slot = metadata.last_included_slot
// Discard any log entries covered by the snapshot
for slot in range(1, metadata.last_included_slot + 1):
delete self.log[slot]
Snapshotting interacts with the rest of the system in several unpleasant ways:
Follower catchup. If a follower is too far behind (its next needed log entry has been discarded), the leader must send its snapshot. This is a large transfer that can take seconds or minutes, during which the follower is unavailable.
Snapshot consistency. The snapshot must represent a consistent state machine state at a specific log position. If the state machine is large (gigabytes of data), taking a consistent snapshot while continuing to process requests requires copy-on-write semantics or a quiescent period.
Snapshot transfer. Sending a multi-gigabyte snapshot over the network is not a trivial operation. You need flow control, checksumming, resumability (what if the transfer is interrupted?), and you need to do it without starving normal replication traffic.
What Google Actually Built
The most instructive documentation of the gap between Paxos-the-protocol and Paxos-the-system comes from Google, which built several systems on Multi-Paxos and had the good grace to publish papers about the experience.
Chubby (2006)
Chubby is Google’s distributed lock service, built on Multi-Paxos. The Chubby paper by Mike Burrows is remarkably candid about the implementation challenges:
- They found it extremely difficult to implement Multi-Paxos from the paper alone.
- They had to invent solutions for master leases, client session management, and event notifications — none of which are part of Paxos.
- The system’s complexity was dominated by these “engineering” concerns, not by the consensus protocol.
Paxos Made Live (2007)
This paper by Chandra, Griesemer, and Redstone is perhaps the most valuable document ever written about implementing Paxos. It catalogs the challenges they encountered building Chubby’s Paxos implementation:
-
Disk corruption. The paper assumes reliable persistent storage. Disks lie. Google had to add checksums to every on-disk data structure and handle corruption gracefully.
-
Master leases. To serve reads without running Paxos, the master needs a lease — a time-bounded promise that no other node will become master. Implementing leases correctly requires carefully reasoned clock assumptions.
-
Group membership. Changing the set of participants is, in their words, “one of the most complex and least well-understood aspects of Paxos.”
-
Snapshots. They describe the snapshot mechanism as “the single hardest thing to get right” in their implementation.
-
Testing. They built an elaborate testing framework including a simulated network, simulated disk, and simulated clock to test failure scenarios. Even with this, bugs slipped through.
The paper’s conclusion is worth quoting approximately: the algorithm’s correctness proof gave them confidence in the design, but the implementation challenges were enormous and mostly orthogonal to the consensus protocol itself.
Spanner (2012)
Google’s Spanner is a globally distributed database that uses Paxos for replication within each shard. Spanner’s contribution to the Paxos implementation landscape includes:
- Long-lived leaders with 10-second lease periods, enabling efficient reads.
- Pipelining of writes to achieve high throughput despite global latencies.
- TrueTime — GPS and atomic clock-based time synchronization that enables externally consistent reads without running Paxos. This is not a Paxos optimization; it’s a way to avoid Paxos for reads entirely.
Why Every Implementation Is Different
Here is a non-exhaustive list of decisions you must make when implementing Multi-Paxos, none of which are specified by the protocol:
| Decision | Options | Tradeoffs |
|---|---|---|
| Leader election trigger | Heartbeat timeout / external failure detector / client-triggered | Latency vs. complexity vs. false positives |
| Heartbeat mechanism | Separate messages / piggybacked on Accept | Message overhead vs. implementation simplicity |
| Log indexing | Contiguous / sparse with gaps | Complexity vs. flexibility |
| Follower catchup | Send individual entries / send snapshot / hybrid | Network vs. latency vs. complexity |
| Read handling | Read through Paxos / leader leases / follower reads | Consistency vs. latency vs. throughput |
| Batching | Per-entry / batched / adaptive | Latency vs. throughput |
| Pipelining depth | Fixed / adaptive / unlimited | Throughput vs. memory vs. ordering complexity |
| Disk persistence | Write-ahead log / separate state files / embedded DB | Performance vs. simplicity vs. recovery time |
| Snapshot format | Full state / incremental / log-structured | Space vs. time vs. complexity |
| Membership changes | Alpha-based / joint consensus / separate config log | Safety vs. availability vs. complexity |
| Client interaction | Leader-only / redirect / proxy | Latency vs. load balancing vs. complexity |
That’s eleven fundamental design decisions, most with three or more reasonable options. The combinatorial space is enormous, and the interactions between these choices are subtle. This is why every Multi-Paxos implementation is different, and why the phrase “we use Paxos” tells you almost nothing about what the system actually does.
Steady-State Follower
For completeness, here is the follower’s steady-state logic:
class MultiPaxosFollower:
state:
log: Map<SlotNumber, LogEntry>
commit_index: SlotNumber = 0
highest_promised: ProposalNumber = 0
current_leader: NodeId = null
last_heartbeat: Timestamp = 0
function on_accept(msg: AcceptMessage):
if msg.proposal < self.highest_promised:
// Stale leader — reject
reply(msg.from, Nack {
slot: msg.slot,
highest_promised: self.highest_promised
})
return
// Accept the entry
self.highest_promised = msg.proposal
self.log[msg.slot] = LogEntry {
slot: msg.slot,
proposal: msg.proposal,
command: msg.command,
status: ACCEPTED
}
persist(self.log[msg.slot], self.highest_promised)
reply(msg.from, Accepted {
slot: msg.slot,
proposal: msg.proposal
})
// Update commit index if leader told us
if msg.leader_commit > self.commit_index:
old_commit = self.commit_index
self.commit_index = min(msg.leader_commit,
self.highest_contiguous_slot())
for slot in range(old_commit + 1, self.commit_index + 1):
apply_to_state_machine(self.log[slot].command)
function on_heartbeat(msg: HeartbeatMessage):
if msg.proposal >= self.highest_promised:
self.current_leader = msg.from
self.last_heartbeat = now()
self.highest_promised = msg.proposal
// Heartbeats often carry commit index updates
if msg.commit_index > self.commit_index:
self.advance_commit_index(msg.commit_index)
function on_prepare(msg: PrepareMessage):
// New leader election in progress
if msg.proposal > self.highest_promised:
self.highest_promised = msg.proposal
persist(self.highest_promised)
// Send all our accepted entries
accepted_entries = {}
for slot, entry in self.log:
if entry.status == ACCEPTED:
accepted_entries[slot] = (entry.proposal, entry.command)
reply(msg.from, Promise {
proposal: msg.proposal,
accepted_entries: accepted_entries
})
else:
reply(msg.from, Nack {
proposal: msg.proposal,
highest_promised: self.highest_promised
})
function highest_contiguous_slot():
slot = 0
while (slot + 1) in self.log:
slot += 1
return slot
Leader Takeover: The Messy Part
When a leader fails and a new leader takes over, the new leader must reconcile the state of all the followers. This is the most complex part of Multi-Paxos and where most bugs hide.
function leader_takeover():
new_proposal = generate_proposal_number(self.my_id, self.highest_seen)
// Phase 1: Send Prepare to all acceptors
promises = send_prepare_and_collect_majority(new_proposal)
if |promises| < quorum_size:
return // Failed — someone else is leader or network partition
// Merge all accepted entries from promises
// For each slot, take the value with the highest proposal number
merged_log = {}
max_slot = 0
for promise in promises:
for slot, (proposal, command) in promise.accepted_entries:
max_slot = max(max_slot, slot)
if slot not in merged_log or proposal > merged_log[slot].proposal:
merged_log[slot] = (proposal, command)
// Fill gaps with no-ops and re-replicate all uncertain entries
for slot in range(self.commit_index + 1, max_slot + 1):
if slot in merged_log:
command = merged_log[slot].command
else:
command = NO_OP // Fill the gap
// Run Phase 2 for this slot with the new proposal number
self.replicate_to_slot(slot, new_proposal, command)
// Now we're caught up — start normal operation
self.next_slot = max_slot + 1
self.role = Leader
self.start_heartbeats()
The no-op commands deserve special attention. When the new leader finds a gap in the log — a slot where no acceptor reports an accepted value — it must still fill that slot before it can advance the commit index past it. A no-op is a command that, when applied to the state machine, does nothing. But it must still go through the full Paxos Accept phase to be committed.
This means leader takeover latency scales with the number of uncommitted slots. If the old leader was pipelining aggressively and had 100 outstanding uncommitted entries when it failed, the new leader must re-replicate all 100 of them before it can process new client requests. This is one of the reasons that tuning the pipeline depth involves a tradeoff between throughput and recovery time.
Pipelining in Detail
Pipelining is essential for throughput. Without it, the leader must wait for each Accept to be committed before sending the next one. With pipelining, the leader can have multiple outstanding uncommitted entries:
class PipelinedLeader:
state:
pipeline_window: int = 50 // max outstanding uncommitted entries
pending_commits: Map<SlotNumber, PendingEntry>
function handle_client_request(command: Command):
// Check pipeline backpressure
while |self.pending_commits| >= self.pipeline_window:
wait_for_any_commit()
slot = self.next_slot
self.next_slot += 1
entry = PendingEntry {
slot: slot,
command: command,
accepted_by: {self.my_id},
client_callback: current_client_callback
}
self.pending_commits[slot] = entry
// Send Accept — don't wait for response
broadcast_accept(self.proposal_number, slot, command)
function on_accepted(slot, from):
if slot in self.pending_commits:
self.pending_commits[slot].accepted_by.add(from)
if |self.pending_commits[slot].accepted_by| >= quorum_size:
self.try_advance_commit_index()
function try_advance_commit_index():
// Commit entries in order
while (self.commit_index + 1) in self.pending_commits and
|self.pending_commits[self.commit_index + 1].accepted_by| >= quorum_size:
self.commit_index += 1
entry = self.pending_commits.remove(self.commit_index)
result = apply_to_state_machine(entry.command)
entry.client_callback(result)
Note the crucial detail: entries are committed in order. Even if slot 5 gets a majority before slot 4, you cannot commit slot 5 until slot 4 is committed. The client callback for slot 5 must wait. This preserves the linearizability of the replicated log.
Batching
In practice, you combine pipelining with batching: instead of one command per Accept message, you pack multiple commands into a single message.
function batch_and_send():
// Collect commands for up to BATCH_TIMEOUT or BATCH_SIZE
batch = []
deadline = now() + BATCH_TIMEOUT // e.g., 1ms
while |batch| < MAX_BATCH_SIZE and now() < deadline:
if command = try_dequeue_request(deadline - now()):
batch.append(command)
if |batch| > 0:
slot = self.next_slot
self.next_slot += 1
broadcast_accept(self.proposal_number, slot, batch)
Batching introduces a tradeoff between latency and throughput. A 1ms batching window means the first command in each batch waits up to 1ms before being sent. But if you’re processing 10,000 commands per second, each batch contains about 10 commands, and you’ve reduced your message overhead by 10x.
Google’s Spanner uses batching extensively. Amazon’s DynamoDB Paxos implementation reportedly uses adaptive batching that adjusts the window based on current load.
Comparison: Multi-Paxos vs. What Google Described
| Aspect | Academic Multi-Paxos | Google’s Implementation (per “Paxos Made Live”) |
|---|---|---|
| Leader election | Unspecified | Lease-based with 10s lease period |
| Log gaps | Allowed | Filled with no-ops on leader change |
| Reads | Through Paxos | Leader leases for local reads |
| Persistence | “Write to stable storage” | Checksummed WAL with corruption detection |
| Snapshots | Not discussed | Custom snapshot format with incremental support |
| Membership | “Exercise for the reader” | Complex migration protocol, single most difficult part |
| Testing | Prove it correct | 2000+ failure injection tests |
| Performance | Analyze message complexity | Batching, pipelining, flow control, congestion avoidance |
The right column is approximately 10x more work than the left column. This is the gap.
Why Everyone Builds Something Different
In 2014, a group of researchers surveyed several Paxos implementations and found that they all differed in significant ways. The paper, “Paxos Quorum Leases,” noted that systems like Chubby, Spanner, Megastore, and ZooKeeper (which uses Zab, a Paxos variant) all made different choices about:
- How to handle reads
- How to manage leader leases
- How to compact the log
- How to handle membership changes
- How to batch and pipeline requests
The reason is simple: Multi-Paxos, as described in the literature, is not a complete system design. It is a collection of optimization ideas applied to a theoretical protocol. To build a real system, you must make dozens of engineering decisions that the theory doesn’t address, and different systems face different requirements that lead to different decisions.
This is not necessarily a problem — different systems SHOULD make different tradeoffs. But it means that “we use Multi-Paxos” is not a specification. It’s a statement of philosophy, roughly equivalent to “we use a leader-based consensus protocol inspired by Lamport’s work.” The details — which is where all the bugs and performance characteristics live — are entirely implementation-specific.
The Legacy of Multi-Paxos
Multi-Paxos is the workhorse of distributed consensus. It powers (or powered) Google’s Chubby, Spanner, Megastore, and Bigtable. It powers (in spirit) numerous other systems that use the stable-leader optimization regardless of whether they call it “Multi-Paxos.”
Its legacy is also one of frustration. Generations of engineers have stared at Lamport’s papers, understood the single-decree protocol, and then discovered that building a system requires solving a hundred problems the papers don’t address. The popularity of Raft (Chapter 8) is largely a response to this frustration — not because Raft solves a fundamentally different problem, but because Raft’s specification is a system design, not just a protocol sketch.
But Multi-Paxos also left a positive legacy: it established the blueprint that all leader-based consensus protocols follow. Stable leader, log replication, snapshotting, membership changes — these concepts, first explored (however informally) in the Multi-Paxos context, became the standard vocabulary of the field. Every protocol in the rest of Part II owes a debt to this protocol that was never fully specified but somehow became the foundation of modern distributed systems.
The gap between paper and production is not unique to Paxos. It exists for every distributed systems paper. But Paxos is where the gap was first discovered, first documented, and first lamented. And for that, we should be grateful — even as we curse it.