Tradeoff Matrix: Latency, Throughput, Fault Tolerance
Every few months, someone publishes a blog post titled something like “Consensus Algorithm Comparison” that contains a table with five columns and six rows, each cell filled with a confident one-word summary. “Fast.” “Slow.” “Complex.” These tables are worse than useless — they give you enough information to feel informed while leaving out everything that actually matters.
This chapter is our attempt to build the comparison table we wish we’d had. It will be large. It will have footnotes. Some cells will contain uncomfortable phrases like “it depends” and “the paper doesn’t say.” That’s because honest comparison is messy, and anyone who tells you otherwise is either selling something or hasn’t built anything.
The Protocols Under Comparison
We’re comparing eleven protocols (or protocol families) that have appeared throughout this book:
- Paxos — Single-decree, Lamport’s original
- Multi-Paxos — The practical extension everyone actually means when they say “Paxos”
- Raft — The understandable one (allegedly)
- Zab — ZooKeeper’s protocol
- Viewstamped Replication (VR) — The one nobody reads
- PBFT — Castro and Liskov’s Byzantine workhorse
- HotStuff — Linear BFT with rotating leaders
- Tendermint — BFT for blockchains (and beyond)
- EPaxos — Leaderless when it can be
- Flexible Paxos — Paxos with relaxed quorums
- Kafka ISR — Not quite consensus, but close enough for many
A few caveats before we begin. Paxos and Multi-Paxos are algorithm families, not single implementations — there are dozens of variants, and the performance characteristics depend heavily on which variant you pick. EPaxos has been refined several times since the original paper. Kafka ISR isn’t a consensus algorithm in the formal sense, but it solves a sufficiently similar problem that excluding it would be dishonest. And any comparison that includes both crash-fault-tolerant and Byzantine-fault-tolerant protocols is inherently comparing apples to slightly paranoid oranges.
With those disclaimers lodged, let’s proceed.
Dimension 1: Message Complexity (Normal Case)
Message complexity tells you how many messages need to be exchanged to commit a single operation in the common, happy-path case — no failures, no leader changes, the sun is shining and your network is behaving.
| Protocol | Messages (Normal Case) | Message Pattern | Notes |
|---|---|---|---|
| Paxos (single-decree) | 2n (Prepare + Accept) | 2 round-trips, leader to all | First round can be skipped if leader is stable (Multi-Paxos optimization) |
| Multi-Paxos (steady state) | n | 1 round-trip, leader to all | Prepare phase amortized away after leader election |
| Raft | n | 1 round-trip, leader to all | AppendEntries + responses |
| Zab | n | 1 round-trip, leader to all | Propose + Ack in broadcast phase |
| VR | n | 1 round-trip, primary to all | Prepare + PrepareOK |
| PBFT | ~n^2 | 3 phases, all-to-all in commit | Pre-prepare + Prepare (n msgs each) + Commit (n msgs each) |
| HotStuff | n per phase, 3 phases | Linear, leader collects votes | 3 round-trips in basic; pipelined reduces effective latency |
| Tendermint | ~n^2 | Propose + Prevote + Precommit | Gossip-based, all-to-all in vote steps |
| EPaxos | n (fast path), 2n (slow path) | 1 or 2 round-trips | Fast path if no conflicts; slow path on dependency conflicts |
| Flexible Paxos | n (steady state) | Same as Multi-Paxos | Smaller quorums possible, so “n” may be smaller than majority |
| Kafka ISR | ISR size | 1 round-trip, leader to ISR | Only replicates to in-sync replicas, not all brokers |
The thing to notice immediately is the gap between CFT and BFT protocols. The crash-fault-tolerant protocols all converge on roughly n messages in steady state (one round-trip from leader to replicas). The BFT protocols pay a tax of either n^2 messages or additional round-trips. This is not a coincidence — it’s the fundamental cost of not trusting each other.
EPaxos deserves special attention. Its fast path of n messages with a single round-trip is genuinely impressive, but it requires a “fast-path quorum” that’s larger than a simple majority (specifically, floor(3f/4) + 1 replicas for f faults, which for five nodes means four out of five must respond in agreement). The fast path also only works when there are no conflicting commands. In workloads with high contention, EPaxos degrades to its slow path more often than the paper’s evaluation section might lead you to believe.
What “Message Complexity” Actually Means in Practice
Here’s where we need to be honest about what these numbers don’t tell you.
Message complexity counts messages. It does not count bytes. A Raft AppendEntries RPC carrying a 4KB state machine command is one message, and a Raft AppendEntries RPC carrying a 4MB batch of commands is also one message, but your network disagrees that these are equivalent.
Message complexity also doesn’t account for persistence. Every protocol on this list (except arguably some BFT protocols with sufficiently many replicas) requires writing to stable storage before responding. That fsync call is almost always the latency bottleneck, not the network hop. A protocol with 2n messages but one persistence point may well be faster than a protocol with n messages but two persistence points.
Finally, message complexity ignores batching. Every production implementation batches multiple client requests into a single consensus round. Multi-Paxos with batching and Multi-Paxos without batching have the same message complexity but wildly different throughput. We’ll come back to this in the throughput section.
Dimension 2: Message Complexity (Leader Change / View Change)
The sunny-day numbers are nice, but distributed systems live in the rain. What happens when a leader fails?
| Protocol | Leader Change Messages | Leader Change Latency | Notes |
|---|---|---|---|
| Paxos | 2n (new Prepare phase) | 1 additional round-trip | Re-runs Phase 1; conceptually simple |
| Multi-Paxos | 2n + recovery | 1 round-trip + catch-up | Must discover and fill any gaps in the log |
| Raft | n (RequestVote) + catch-up | 1 round-trip + election timeout | Randomized timeouts to prevent split votes |
| Zab | O(n * log_size) | Multiple round-trips | Discovery + Synchronization + Broadcast phases; can be expensive |
| VR | O(n^2) | 2+ round-trips | View change messages contain full logs; expensive |
| PBFT | O(n^2) | Multiple round-trips | View change requires 2f+1 view-change messages with proofs |
| HotStuff | O(n) | 1 round-trip for pacemaker | Designed for frequent leader rotation; view change is cheap |
| Tendermint | O(n^2) | Round timeout | Validators just move to next round/proposer |
| EPaxos | N/A (leaderless) | N/A | No leader to fail; but recovery of in-progress instances is complex |
| Flexible Paxos | Same as Multi-Paxos | Same as Multi-Paxos | Quorum flexibility doesn’t change recovery |
| Kafka ISR | O(ISR size) | Controller detects + shrinks ISR | Controller elects new partition leader from ISR |
This is where the protocols diverge dramatically.
HotStuff’s selling point is right here — it was specifically designed so that leader rotation (and by extension, leader failure handling) is O(n) rather than O(n^2). This matters enormously in BFT settings where you might want to rotate leaders frequently to limit a malicious leader’s ability to cause damage.
Zab’s view change deserves its reputation for complexity. The discovery phase requires the new leader to contact all followers, determine who has the most up-to-date state, synchronize that state to a quorum, and only then begin accepting new proposals. In a cluster with significant state divergence (which happens when the old leader was partitioned while still accepting writes that some followers received and others didn’t), this process can take a disturbing amount of time.
VR’s view change is similarly expensive because view-change messages historically carry full logs (though practical implementations obviously optimize this). The theoretical message complexity of O(n^2) comes from every replica needing to send its state to every other replica, though in practice you only need the new primary to collect f+1 view-change messages.
EPaxos sidesteps leader election entirely, which is wonderful until you need to recover an in-progress command instance. The recovery protocol requires running an explicit recovery phase that resembles Paxos Phase 1, and if you’re unlucky enough to have multiple concurrent recoveries for dependent commands, you can end up in a situation that will make you appreciate why leaders exist.
Kafka’s approach is characteristically pragmatic — the controller (which is itself elected, previously via ZooKeeper, now via KRaft) simply picks a new leader from the ISR. If the ISR is empty, you have bigger problems, and Kafka lets you choose between unavailability and data loss via the unclean.leader.election.enable configuration flag. That’s not theoretical elegance, but it is honest.
Dimension 3: Latency (Message Delays to Commit)
Latency measures how many sequential network round-trips a client must wait before its request is committed. This is distinct from message complexity — a protocol could send 100 messages in parallel (high message complexity) but complete in one round-trip (low latency).
| Protocol | Message Delays (Normal Case) | Message Delays (With Leader) | Notes |
|---|---|---|---|
| Paxos | 4 (client→leader, Prepare, Accept, reply) | 2 if leader is pre-elected | Single-decree; each decree is independent |
| Multi-Paxos | 2 | 2 | Client→leader, leader→quorum→leader→client |
| Raft | 2 | 2 | Same as Multi-Paxos in steady state |
| Zab | 2 | 2 | Same pattern |
| VR | 2 | 2 | Same pattern |
| PBFT | 5 | 5 | Client→primary, pre-prepare, prepare, commit, reply |
| HotStuff (basic) | 7 | 7 | Client + 3 phases of leader↔replica + reply |
| HotStuff (pipelined) | 7 (amortized ~2) | 7 (amortized ~2) | Pipelining overlaps phases of different commands |
| Tendermint | 4 | 4 | Propose, Prevote, Precommit, Commit |
| EPaxos (fast path) | 2 | N/A (leaderless) | Client→replica→quorum→replica→client |
| EPaxos (slow path) | 4 | N/A | Additional Accept phase on conflict |
| Flexible Paxos | 2 | 2 | Same as Multi-Paxos |
| Kafka ISR | 2 | 2 | Producer→leader, replicate to ISR, ack |
For CFT protocols, the answer is boringly uniform: two message delays in steady state. Client sends to leader, leader replicates to a quorum, leader responds. Everyone has converged on this because it’s the theoretical minimum for fault-tolerant replication with a stable leader.
The BFT protocols show more variation, and this is one of the genuine tradeoffs between PBFT, HotStuff, and Tendermint. PBFT’s five message delays are a consequence of its three-phase protocol (pre-prepare, prepare, commit). HotStuff basic pays seven delays for its three chained QC phases, but pipelining effectively amortizes this to two delays per committed decision (at the cost of increased latency for any individual decision — it takes longer for your specific command to commit, but the system commits one command per round-trip once the pipeline is full). Tendermint sits in between at four delays.
The Persistence Tax
Every one of these latency numbers assumes negligible disk write time. In reality, the fsync (or fdatasync, if your implementation is sophisticated enough to know the difference) at each persistence point adds anywhere from 0.1ms (NVMe SSD) to 10ms (spinning disk) to 50ms+ (cloud storage with “durable” semantics that actually mean “we’ll get around to it”).
Raft implementations typically require two persistence points per commit: one when the leader writes the entry to its own log, and one when each follower writes the entry to its log. But the leader can send AppendEntries and persist in parallel (this is an optimization that the Raft paper mentions in passing and that every production implementation uses, but which trips up everyone building Raft for the first time).
The bottom line: in a data center with modern SSDs, cross-node network latency (0.1-0.5ms per hop) and disk persistence latency (~0.1ms per fsync) are in the same ballpark. Over a WAN, network latency dominates. On spinning disks, persistence dominates. Your protocol choice matters less than your hardware in the typical case.
Dimension 4: Throughput Characteristics
Throughput is where the comparison gets genuinely complicated, because throughput depends on factors that are mostly orthogonal to the consensus protocol itself.
| Protocol | Theoretical Throughput Limit | Practical Bottleneck | Batching Potential |
|---|---|---|---|
| Paxos | Low (per-decree overhead) | Per-operation Prepare phase | Low without Multi-Paxos |
| Multi-Paxos | Leader network bandwidth | Leader CPU + NIC | High (batch in Accept) |
| Raft | Leader network bandwidth | Leader CPU + NIC | High (batch in AppendEntries) |
| Zab | Leader network bandwidth | Leader CPU + NIC | High (batch in proposals) |
| VR | Leader network bandwidth | Leader CPU + NIC | High |
| PBFT | All-node network bandwidth | n^2 messages per commit | Moderate (batch in pre-prepare) |
| HotStuff | Leader network bandwidth | Leader CPU (aggregating votes) | High |
| Tendermint | All-node network bandwidth | Gossip overhead + n^2 | Moderate (batch in block) |
| EPaxos | Aggregate cluster bandwidth | Dependency tracking overhead | High per-replica |
| Flexible Paxos | Leader network bandwidth | Same as Multi-Paxos | High |
| Kafka ISR | Leader network bandwidth | Disk I/O (by design) | Very high (batch-optimized) |
The leader-based protocols (Multi-Paxos, Raft, Zab, VR, Flexible Paxos, Kafka ISR) all share the same fundamental throughput limitation: the leader is a bottleneck. Every write goes through the leader. The leader must receive the client request, append it to its log, send it to all followers, wait for a quorum of acknowledgments, and respond to the client. The leader’s NIC, CPU, and disk are the ceiling.
This is the main theoretical advantage of EPaxos — by eliminating the leader, it can distribute load across all replicas. If you have five nodes and the workload has no conflicts, each node can independently process roughly 1/5 of the requests, giving you approximately 5x the throughput of a leader-based protocol. The original EPaxos paper demonstrates this convincingly on conflict-free workloads.
But the moment you introduce conflicts (which you will, because most real workloads have hot keys), EPaxos must run its slower path, which includes additional coordination. The dependency tracking logic also adds CPU overhead per operation. In benchmarks with realistic conflict rates (5-25%), EPaxos’s throughput advantage shrinks considerably, and in high-conflict workloads it can actually perform worse than Raft because the slow path is more expensive than just going through a leader.
For the BFT protocols, the throughput picture is dominated by the O(n^2) message overhead. PBFT with n = 4 (the minimum for f = 1) sends roughly 16 messages per commit. At n = 100, that’s roughly 10,000 messages per commit. This is why BFT protocols are typically deployed with small replica counts (4, 7, maybe 13) and why the HotStuff paper’s reduction to O(n) messages was genuinely significant.
Kafka ISR deserves special mention for throughput because its entire architecture is optimized for it. Zero-copy reads, sequential disk I/O, batching at every layer, page cache utilization — Kafka’s throughput numbers are impressive not because the ISR protocol is theoretically superior, but because the implementation has been ruthlessly optimized for the append-only, batched-writes, sequential-reads use case. This is an important lesson: protocol-level message complexity matters less than implementation-level engineering for throughput.
What the Papers Claim vs. What You Actually Get
Let’s talk about performance numbers in academic papers.
Here’s a rough calibration table, assembled from reading too many evaluation sections at 2 AM:
| Protocol | Paper Claims | Reality | Why the Gap |
|---|---|---|---|
| Multi-Paxos | ~100K ops/sec | 10K-100K ops/sec | Papers use small payloads, RAM-only “persistence,” ideal networks |
| Raft | ~20K-100K ops/sec | 5K-50K ops/sec | etcd tops out around 10-20K writes/sec in practice |
| EPaxos | 2-5x Paxos throughput | 1-2x in practice | Conflict rates in real workloads reduce fast-path utilization |
| PBFT | ~10-40K ops/sec | 1K-10K ops/sec | Papers benchmark with 4 nodes, tiny payloads, localhost |
| HotStuff | “Near Raft throughput with BFT” | 50-70% of Raft | Threshold signature aggregation has real CPU cost |
| Tendermint | ~1000 TPS (blockchain mode) | 100-1000 TPS | Block size, gossip overhead, application-level validation |
| Kafka ISR | Millions of msgs/sec | Hundreds of thousands/sec per partition | Kafka’s numbers are aggregate across partitions; single partition is bottlenecked |
The gap exists for several reasons, all of which are understandable but rarely acknowledged:
Paper benchmarks use favorable conditions. Small payloads (often 0 bytes or 16 bytes), no application-level processing, networks with minimal jitter, and usually the minimum number of nodes. These conditions are valid for measuring the protocol overhead, but they’re not your production environment.
Paper benchmarks often skip persistence. The phrase “we configure the system to batch sync writes to disk every 10ms” appears in more evaluation sections than anyone wants to admit. That’s not durable consensus — that’s consensus with a 10ms window of data loss. When you add real fsync on every commit (which you must, for correctness), throughput drops substantially.
Paper benchmarks run on dedicated hardware. Your consensus nodes are probably sharing machines with seventeen other services, running in containers, on a network shared with the analytics team’s nightly Spark jobs. The variability alone kills best-case numbers.
Paper benchmarks don’t measure tail latency. Mean latency at 50% load tells you nothing about p99 latency at 90% load, which is the number that actually determines your SLA.
Dimension 5: Fault Tolerance
| Protocol | Fault Model | Nodes for f Faults | Tolerates Byzantine? | Tolerates Omission? | Tolerates Partition? |
|---|---|---|---|---|---|
| Paxos | Crash | 2f+1 | No | Yes (crash-stop) | Minority partition |
| Multi-Paxos | Crash | 2f+1 | No | Yes | Minority partition |
| Raft | Crash | 2f+1 | No | Yes | Minority partition |
| Zab | Crash | 2f+1 | No | Yes | Minority partition |
| VR | Crash | 2f+1 | No | Yes | Minority partition |
| PBFT | Byzantine | 3f+1 | Yes | Yes | Minority partition |
| HotStuff | Byzantine | 3f+1 | Yes | Yes | Minority partition |
| Tendermint | Byzantine | 3f+1 | Yes | Yes | Minority partition |
| EPaxos | Crash | 2f+1 | No | Yes | Minority partition |
| Flexible Paxos | Crash | Varies | No | Yes | Depends on quorum config |
| Kafka ISR | Crash (tunable) | min.insync.replicas | No | Yes | ISR-dependent |
The fault tolerance story is straightforward in theory and a mess in practice.
All CFT protocols tolerate f crash faults with 2f+1 nodes. All BFT protocols tolerate f Byzantine faults with 3f+1 nodes. If you stopped here, you’d think this was simple.
It isn’t, because “crash fault” is a model, not a reality. Real failures include:
- Disk corruption without node crash — your node is running but serving garbage data. CFT protocols assume this doesn’t happen.
- Clock skew — Raft’s leader election depends on timeouts; extreme clock skew can cause unnecessary elections or, worse, split-brain in poorly implemented variants.
- Network asymmetry — node A can reach B but not C, while B can reach both A and C. Most protocols assume symmetric partitions.
- Slow nodes — not crashed, not Byzantine, just slow enough to be useless. This falls through the cracks of both fault models.
- Correlated failures — the assumption of independent failures underlies the
f out of 2f+1math. When all your nodes are in the same availability zone and AWS has a bad day, independence goes out the window.
Flexible Paxos is interesting in the fault tolerance dimension because it lets you tune the tradeoff. If you set a Phase 1 quorum of n-1 and a Phase 2 quorum of 1, you get amazingly fast writes (only one replica must acknowledge) but terrible leader election (must contact all replicas). The failure tolerance depends on your quorum choices, and choosing wrong means silent data loss. This flexibility is powerful and terrifying in equal measure.
Kafka ISR is fault-tolerant in a way that makes theorists uncomfortable. The ISR can shrink to a single replica (the leader), at which point you have zero fault tolerance but remain available. Whether this is acceptable depends on your min.insync.replicas setting and your willingness to trade safety for availability. In practice, many Kafka deployments run with min.insync.replicas=2 and replication.factor=3, giving them crash tolerance of one broker — equivalent to f=1 with 2f+1=3 nodes, just with a less formal proof.
The Jepsen Reality Check
No discussion of fault tolerance claims is complete without mentioning Jepsen, Kyle Kingsbury’s testing framework that has found consistency violations in nearly every distributed system it has tested. Here’s a partial scorecard relevant to our protocols:
| System | Jepsen Findings | Status |
|---|---|---|
| etcd (Raft) | Stale reads under certain configurations; lease-related issues | Fixed in subsequent releases |
| ZooKeeper (Zab) | Issues under network partitions with specific timing | Mostly fixed; some edge cases remain |
| CockroachDB (Raft) | Serializability violations under clock skew | Fixed; improved clock skew handling |
| MongoDB (custom) | Significant data loss during rollback; stale reads | Multiple rounds of fixes |
| Kafka | Under-replicated partitions can lose data; exactly-once edge cases | Configuration-dependent; improved over releases |
| Redis (Sentinel/Cluster) | Split-brain, data loss during failover | Fundamental design limitations |
| Consul (Raft) | Stale reads under default configuration | Configuration clarified |
The pattern is consistent: every system has bugs, and those bugs are most likely to manifest during network partitions, leader elections, and membership changes — exactly the scenarios that matter most. The protocols are correct on paper, but the implementations have gaps. This is not a criticism of the implementers (these are some of the best engineering teams in the industry) — it’s evidence that implementing consensus correctly is genuinely, perhaps irreducibly, difficult.
Dimension 6: Leader Requirement
| Protocol | Requires Leader? | Leader’s Role | Multi-Leader? |
|---|---|---|---|
| Paxos | Proposer (not technically a leader) | Proposes values | Multiple proposers possible (but may conflict) |
| Multi-Paxos | Yes (distinguished proposer) | Sequences all operations | No |
| Raft | Yes (strong leader) | Sequences all operations, log authority | No |
| Zab | Yes (leader) | Sequences all operations | No |
| VR | Yes (primary) | Sequences all operations | No |
| PBFT | Yes (primary) | Orders requests within a view | No, but rotates on view change |
| HotStuff | Yes (rotating leader) | Proposes blocks, collects votes | No, but designed for rotation |
| Tendermint | Yes (proposer per round) | Proposes blocks | Rotates every round |
| EPaxos | No | N/A | All replicas can lead any instance |
| Flexible Paxos | Yes (in Multi-Paxos mode) | Sequences operations | No |
| Kafka ISR | Yes (per-partition leader) | Handles reads and writes | Yes (different leaders per partition) |
The leader question matters for two reasons: performance and availability.
Performance: A single leader is a throughput bottleneck. Every operation must pass through one node. Kafka sidesteps this with per-partition leaders — different partitions can have different leaders, distributing the load. EPaxos sidesteps it by eliminating the leader entirely. Everyone else accepts the bottleneck as the price of simplicity.
Availability: A leader failure causes a period of unavailability while a new leader is elected. For Raft, this is typically 150-300ms (one election timeout). For Zab, it can be seconds. For PBFT view changes, it can be… longer than you’d like. HotStuff was explicitly designed to make leader changes fast, because in a BFT setting where you rotate leaders frequently, expensive view changes are fatal to performance.
Raft’s “strong leader” design is worth highlighting. In Raft, the leader is the sole authority on the log — followers accept whatever the leader tells them, and log entries only flow from leader to follower, never the other way. This simplifies reasoning about the protocol tremendously but means the leader does more work than in protocols like Multi-Paxos, where followers can occasionally know about entries the current leader doesn’t (from a previous leader’s incomplete rounds).
Dimension 7: Ordering Guarantees
| Protocol | Ordering Guarantee | Total Order? | Per-Key Ordering? | Notes |
|---|---|---|---|---|
| Paxos | Per-instance only | No (single decree) | No | Each instance independent |
| Multi-Paxos | Total order | Yes | Yes (implied) | Log is totally ordered |
| Raft | Total order | Yes | Yes (implied) | Log is totally ordered |
| Zab | Total order + causal | Yes | Yes | FIFO ordering for same client |
| VR | Total order | Yes | Yes (implied) | Log is totally ordered |
| PBFT | Total order | Yes | Yes | Within a view; across views with view-change |
| HotStuff | Total order | Yes | Yes | Chained QCs provide total order |
| Tendermint | Total order (per chain) | Yes | Yes | Block sequence provides total order |
| EPaxos | Partial order (fast path) | Only after execution ordering | Per-key with conflicts resolved | Dependency graph, not a log |
| Flexible Paxos | Total order | Yes | Yes | Same as Multi-Paxos |
| Kafka ISR | Total order per partition | Per-partition only | Only within partition | No cross-partition ordering |
EPaxos is the outlier here, and this is both its greatest strength and its most confusing aspect. EPaxos doesn’t maintain a totally ordered log — instead, it builds a dependency graph where commands that don’t conflict can be ordered independently. This allows parallelism (non-conflicting commands at different replicas don’t need to coordinate) but makes the execution layer more complex. You need a deterministic algorithm to linearize the dependency graph at each replica, and that algorithm (based on Tarjan’s strongly connected components) is subtle enough that multiple published versions had bugs.
Kafka’s per-partition ordering is a practical compromise that works brilliantly for many use cases but catches people off guard when they need cross-partition ordering. If you need events for user A and user B to be ordered relative to each other, they must go to the same partition. If you also need events for user B and user C to be ordered, all three users must be on the same partition. Transitive ordering requirements can collapse your partition scheme into a single partition faster than you’d expect.
Dimension 8: Membership Change Support
| Protocol | Membership Changes | Mechanism | Complexity |
|---|---|---|---|
| Paxos | Not specified | (Left as exercise for reader) | N/A |
| Multi-Paxos | Not specified in original | Various ad-hoc approaches | High — many subtle bugs |
| Raft | Joint consensus or single-server | Raft paper specifies both approaches | Moderate — but edge cases abound |
| Zab | Dynamic reconfiguration (3.5.0+) | Reconfiguration proposal | High — added years after initial release |
| VR | Reconfiguration protocol | Epoch-based | Moderate |
| PBFT | Not specified | View change can implicitly handle | Very high if attempted |
| HotStuff | Addressed in some variants | Committee rotation | Moderate in blockchain context |
| Tendermint | Validator set changes | Via application-level EndBlock | Moderate |
| EPaxos | Not specified | Open research problem | Very high |
| Flexible Paxos | Not specified | Quorum changes compound the problem | Very high |
| Kafka ISR | Built-in | Partition reassignment + ISR dynamics | Low (from user perspective) |
Membership changes — adding or removing nodes from the consensus group — are the feature that every protocol paper hand-waves over and every implementation team curses. Lamport’s Paxos paper doesn’t address it. The PBFT paper doesn’t address it. EPaxos doesn’t address it. This is not because the problem is trivial; it’s because it’s orthogonal to the core algorithm and fiendishly hard to get right.
Raft deserves credit for being the first major protocol paper to include a detailed membership change mechanism. The joint consensus approach (where the cluster temporarily operates with a configuration that spans old and new memberships) is correct but complex. The single-server change approach (adding or removing one server at a time) is simpler but requires careful sequencing.
Even Raft’s approach has subtleties that have tripped up implementations. The most notorious: a server that has been removed from the cluster but doesn’t know it yet can disrupt the cluster by starting elections with a higher term number. The pre-vote mechanism (RequestPreVote before RequestVote) was added to address this, but it’s not in the original paper.
Kafka wins this category handily from a user perspective — partition reassignment is a well-tested operational procedure, and the ISR mechanism naturally handles temporary membership changes (a slow broker drops out of the ISR and rejoins later). The KRaft migration adds complexity, but the partition-level membership story remains solid.
Dimension 9: Implementation Complexity
This is the most subjective dimension, but also one of the most practically important. A protocol that’s theoretically optimal but impossible to implement correctly is worse than a protocol that’s theoretically suboptimal but actually works.
| Protocol | Implementation Complexity | Lines of Code (rough) | Key Difficulties |
|---|---|---|---|
| Paxos | High | 1K-5K | Mapping to practical system; gaps in log |
| Multi-Paxos | Very High | 5K-20K | No canonical specification; endless variants |
| Raft | Moderate | 3K-10K | Canonical spec helps; still many edge cases |
| Zab | High | 10K-30K (in ZooKeeper) | Tightly coupled to ZooKeeper; complex recovery |
| VR | Moderate | 3K-10K | Well-specified; but few reference implementations |
| PBFT | Very High | 10K-30K | Cryptographic operations; state transfer; garbage collection |
| HotStuff | High | 5K-15K | Threshold signatures; pipelining correctness |
| Tendermint | High | 15K-40K (full node) | ABCI interface; gossip layer; evidence handling |
| EPaxos | Very High | 5K-15K | Dependency tracking; execution ordering; recovery |
| Flexible Paxos | Very High | Same as Multi-Paxos + config | All of Multi-Paxos plus quorum configuration safety |
| Kafka ISR | Moderate (within Kafka) | N/A (part of larger system) | ISR management; controller failover; exactly-once |
The lines-of-code numbers are extremely rough and depend heavily on language, coding style, and how much you include (just the consensus layer? The RPC layer? The storage layer? Testing?). They’re meant to give a relative sense, not absolute numbers.
Raft’s moderate complexity is its entire selling point, and it’s a real one. The Raft paper includes enough detail that a graduate student can implement it in a semester (with bugs that will take two more semesters to fix). Multi-Paxos requires reading between the lines of Lamport’s papers, several follow-up papers, and ideally a few blog posts by people who’ve implemented it. PBFT requires all of that plus cryptography.
EPaxos is in the “very high” category not because the core algorithm is inherently more complex than Multi-Paxos, but because the dependency tracking and execution ordering are genuinely difficult to get right. The original paper had a bug in the execution algorithm that was discovered years after publication. If the authors can get it wrong, so can you.
The Grand Comparison Table
For reference, here’s the complete matrix in one (necessarily compressed) view:
| Paxos | Multi-Paxos | Raft | Zab | VR | PBFT | HotStuff | Tendermint | EPaxos | Flex. Paxos | Kafka ISR | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Msg complexity (normal) | 2n | n | n | n | n | n^2 | n/phase | n^2 | n or 2n | n | ISR size |
| Msg complexity (view change) | 2n | 2n+ | n | n*log | n^2 | n^2 | n | n^2 | N/A | 2n+ | O(ISR) |
| Latency (msg delays) | 4/2 | 2 | 2 | 2 | 2 | 5 | 7/~2 | 4 | 2/4 | 2 | 2 |
| Throughput | Low | Leader-bound | Leader-bound | Leader-bound | Leader-bound | n^2-bound | Leader-bound | n^2-bound | Distributed | Leader-bound | Leader/part. |
| Fault tolerance | Crash, 2f+1 | Crash, 2f+1 | Crash, 2f+1 | Crash, 2f+1 | Crash, 2f+1 | Byz, 3f+1 | Byz, 3f+1 | Byz, 3f+1 | Crash, 2f+1 | Crash, varies | Crash, ISR |
| Leader required | Proposer | Yes | Yes (strong) | Yes | Yes | Yes | Yes (rotating) | Yes (rotating) | No | Yes | Yes/partition |
| Total ordering | No | Yes | Yes | Yes+causal | Yes | Yes | Yes | Yes | Partial | Yes | Per-partition |
| Membership change | Unspecified | Unspecified | Specified | Added later | Specified | Unspecified | Varies | App-level | Unspecified | Unspecified | Built-in |
| Impl. complexity | High | Very high | Moderate | High | Moderate | Very high | High | High | Very high | Very high | Moderate |
Why Microbenchmarks Lie
We’ve alluded to this throughout, but it deserves its own section.
A microbenchmark of a consensus protocol typically measures the protocol in isolation: how fast can it commit operations with no application-level processing, minimal payload sizes, and a perfectly behaved network? These numbers are useful for comparing the overhead of the protocol itself, but they tell you almost nothing about the performance of a system built on top of the protocol.
Here’s what microbenchmarks systematically miss:
1. Application processing time. If your state machine takes 1ms to apply a command, and your consensus protocol commits in 0.5ms, the protocol isn’t your bottleneck. Shaving 0.1ms off consensus latency (by, say, switching from Raft to a theoretically faster protocol) saves you 6.7% of total latency, not 20%.
2. Serialization overhead. Consensus protocols send messages. Those messages must be serialized and deserialized. In a benchmark with 16-byte payloads, serialization is negligible. In a system with 64KB payloads containing Protocol Buffers, serialization can account for 10-30% of CPU time.
3. Garbage collection pauses. If your consensus implementation is in Java (ZooKeeper, Kafka) or Go (etcd), GC pauses will periodically spike your latency in ways that the protocol cannot prevent. A C++ or Rust implementation will behave differently, but you don’t get to choose the implementation language of most off-the-shelf systems.
4. Connection management. Benchmarks run on a fixed number of connections in a controlled environment. Production systems have connection churn, TLS handshakes, keep-alive management, and TCP behavior that depends on the physical network path.
5. Interaction with other system components. Raft’s performance in etcd depends on boltdb’s write characteristics, gRPC’s overhead, the Go scheduler’s behavior under load, and how etcd’s MVCC layer interacts with the Raft log. None of these show up in a Raft microbenchmark.
The lesson: don’t choose a consensus protocol based on microbenchmark numbers. Choose based on the properties that matter for your system (fault model, ordering requirements, membership flexibility), and then optimize the implementation. A well-engineered Raft implementation will outperform a naive EPaxos implementation in virtually every realistic scenario.
Dimension 10: Production Maturity and Ecosystem
No comparison is complete without acknowledging that protocols don’t exist in a vacuum — they exist in ecosystems. The best protocol with no production-grade implementation loses to a mediocre protocol with excellent tooling.
| Protocol | Notable Implementations | Years in Production | Tooling Quality | Community Size |
|---|---|---|---|---|
| Paxos | Google Chubby, Spanner (internal) | 20+ | Internal (limited public) | Academic + Google |
| Multi-Paxos | Google Spanner, Megastore (internal); Ratis | 20+ | Limited public tooling | Small public community |
| Raft | etcd, CockroachDB, TiDB, Consul, hashicorp/raft, openraft | 10+ | Excellent | Very large |
| Zab | ZooKeeper | 15+ | Good (ZK ecosystem) | Large |
| VR | Viewstamped Replication Revisited implementations | Limited | Minimal | Very small |
| PBFT | Hyperledger Fabric (early), custom implementations | 15+ (mostly academic) | Limited | Small |
| HotStuff | Diem/Libra (defunct), Aptos, Flow blockchain | 5+ | Moderate | Growing (blockchain) |
| Tendermint | Cosmos, Binance Chain, various blockchains | 7+ | Good (Cosmos SDK) | Large (blockchain) |
| EPaxos | Research prototypes, limited production use | Limited | Minimal | Very small |
| Flexible Paxos | Research prototypes; influenced WPaxos | Very limited | Minimal | Very small |
| Kafka ISR | Apache Kafka, Confluent Platform | 12+ | Excellent | Very large |
Raft and Kafka ISR dominate the ecosystem dimension. Raft has at least a dozen production-grade implementations across multiple languages, comprehensive documentation, university courses, interactive visualizations, and thousands of engineers who’ve operated Raft-based systems. Kafka has an even larger ecosystem: client libraries in every language, a rich connector ecosystem (Kafka Connect), a stream processing framework (Kafka Streams), managed service offerings, and a conference circuit.
EPaxos and Flexible Paxos, despite their theoretical advantages, have essentially zero production ecosystem. If you choose EPaxos, you’re implementing it yourself (or using one of a handful of research prototypes) with minimal community support. When you hit a bug — and you will hit bugs — your debugging resources are the original paper and its errata.
VR is in a similar position. Despite being a well-designed protocol (arguably cleaner than Paxos), its lack of a canonical implementation and tiny community make it a risky choice for production systems. Choosing VR is a statement of intellectual independence that your on-call team will not appreciate.
This ecosystem effect is self-reinforcing. More implementations mean more bug fixes, more operational knowledge, more tooling, more documentation, which attracts more implementations. Raft’s dominance in the CFT space is as much a network effect as a technical judgment.
The Real Comparison
After all these dimensions, all these tables, and all these caveats, here’s the uncomfortable truth: for most systems, the choice between Raft, Multi-Paxos, and Zab doesn’t matter much. They have roughly the same message complexity, the same latency, the same fault tolerance, and the same throughput bottleneck (the leader). The difference is in implementation maturity, ecosystem support, and how much you enjoy reading the original papers.
The meaningful distinctions are at a higher level:
- Do you need Byzantine fault tolerance? If yes, your options are PBFT, HotStuff, Tendermint, or their derivatives. This is the single biggest fork in the decision tree.
- Is a leader bottleneck unacceptable? If yes, consider EPaxos (if you can handle the complexity) or a partitioned approach (like Kafka) where different partitions have different leaders.
- Do you need total ordering across all operations? If not, you can potentially use something simpler or use per-partition ordering.
- Are you willing to trade safety for availability? If yes, Kafka ISR with
unclean.leader.election.enable=trueor a Flexible Paxos configuration with small Phase 2 quorums might be appropriate.
We’ll turn these distinctions into a concrete decision framework in the next chapter.
But before we leave this comparison behind, one more uncomfortable truth: the protocol you choose matters less than how well you implement, test, and operate it. A mediocre protocol with excellent monitoring, comprehensive testing, well-documented runbooks, and an on-call team that understands the failure modes will outperform an optimal protocol that nobody on the team fully understands. The tradeoff matrix helps you make an informed choice, but the choice is the beginning of the work, not the end.
The tables in this chapter are a map. The territory is your production environment, your team’s expertise, your specific workload, and the 3 AM incidents that will inevitably test everything you thought you knew about your consensus protocol. Choose wisely, but more importantly, implement carefully.