Tradeoff Matrix: Latency, Throughput, Fault Tolerance

Every few months, someone publishes a blog post titled something like “Consensus Algorithm Comparison” that contains a table with five columns and six rows, each cell filled with a confident one-word summary. “Fast.” “Slow.” “Complex.” These tables are worse than useless — they give you enough information to feel informed while leaving out everything that actually matters.

This chapter is our attempt to build the comparison table we wish we’d had. It will be large. It will have footnotes. Some cells will contain uncomfortable phrases like “it depends” and “the paper doesn’t say.” That’s because honest comparison is messy, and anyone who tells you otherwise is either selling something or hasn’t built anything.

The Protocols Under Comparison

We’re comparing eleven protocols (or protocol families) that have appeared throughout this book:

Paxos — Single-decree, Lamport’s original
Multi-Paxos — The practical extension everyone actually means when they say “Paxos”
Raft — The understandable one (allegedly)
Zab — ZooKeeper’s protocol
Viewstamped Replication (VR) — The one nobody reads
PBFT — Castro and Liskov’s Byzantine workhorse
HotStuff — Linear BFT with rotating leaders
Tendermint — BFT for blockchains (and beyond)
EPaxos — Leaderless when it can be
Flexible Paxos — Paxos with relaxed quorums
Kafka ISR — Not quite consensus, but close enough for many

A few caveats before we begin. Paxos and Multi-Paxos are algorithm families, not single implementations — there are dozens of variants, and the performance characteristics depend heavily on which variant you pick. EPaxos has been refined several times since the original paper. Kafka ISR isn’t a consensus algorithm in the formal sense, but it solves a sufficiently similar problem that excluding it would be dishonest. And any comparison that includes both crash-fault-tolerant and Byzantine-fault-tolerant protocols is inherently comparing apples to slightly paranoid oranges.

With those disclaimers lodged, let’s proceed.

Dimension 1: Message Complexity (Normal Case)

Message complexity tells you how many messages need to be exchanged to commit a single operation in the common, happy-path case — no failures, no leader changes, the sun is shining and your network is behaving.

Protocol	Messages (Normal Case)	Message Pattern	Notes
Paxos (single-decree)	2n (Prepare + Accept)	2 round-trips, leader to all	First round can be skipped if leader is stable (Multi-Paxos optimization)
Multi-Paxos (steady state)	n	1 round-trip, leader to all	Prepare phase amortized away after leader election
Raft	n	1 round-trip, leader to all	AppendEntries + responses
Zab	n	1 round-trip, leader to all	Propose + Ack in broadcast phase
VR	n	1 round-trip, primary to all	Prepare + PrepareOK
PBFT	~n^2	3 phases, all-to-all in commit	Pre-prepare + Prepare (n msgs each) + Commit (n msgs each)
HotStuff	n per phase, 3 phases	Linear, leader collects votes	3 round-trips in basic; pipelined reduces effective latency
Tendermint	~n^2	Propose + Prevote + Precommit	Gossip-based, all-to-all in vote steps
EPaxos	n (fast path), 2n (slow path)	1 or 2 round-trips	Fast path if no conflicts; slow path on dependency conflicts
Flexible Paxos	n (steady state)	Same as Multi-Paxos	Smaller quorums possible, so “n” may be smaller than majority
Kafka ISR	ISR size	1 round-trip, leader to ISR	Only replicates to in-sync replicas, not all brokers

The thing to notice immediately is the gap between CFT and BFT protocols. The crash-fault-tolerant protocols all converge on roughly n messages in steady state (one round-trip from leader to replicas). The BFT protocols pay a tax of either n^2 messages or additional round-trips. This is not a coincidence — it’s the fundamental cost of not trusting each other.

EPaxos deserves special attention. Its fast path of n messages with a single round-trip is genuinely impressive, but it requires a “fast-path quorum” that’s larger than a simple majority (specifically, floor(3f/4) + 1 replicas for f faults, which for five nodes means four out of five must respond in agreement). The fast path also only works when there are no conflicting commands. In workloads with high contention, EPaxos degrades to its slow path more often than the paper’s evaluation section might lead you to believe.

What “Message Complexity” Actually Means in Practice

Here’s where we need to be honest about what these numbers don’t tell you.

Message complexity counts messages. It does not count bytes. A Raft AppendEntries RPC carrying a 4KB state machine command is one message, and a Raft AppendEntries RPC carrying a 4MB batch of commands is also one message, but your network disagrees that these are equivalent.

Message complexity also doesn’t account for persistence. Every protocol on this list (except arguably some BFT protocols with sufficiently many replicas) requires writing to stable storage before responding. That fsync call is almost always the latency bottleneck, not the network hop. A protocol with 2n messages but one persistence point may well be faster than a protocol with n messages but two persistence points.

Finally, message complexity ignores batching. Every production implementation batches multiple client requests into a single consensus round. Multi-Paxos with batching and Multi-Paxos without batching have the same message complexity but wildly different throughput. We’ll come back to this in the throughput section.

Dimension 2: Message Complexity (Leader Change / View Change)

The sunny-day numbers are nice, but distributed systems live in the rain. What happens when a leader fails?

Protocol	Leader Change Messages	Leader Change Latency	Notes
Paxos	2n (new Prepare phase)	1 additional round-trip	Re-runs Phase 1; conceptually simple
Multi-Paxos	2n + recovery	1 round-trip + catch-up	Must discover and fill any gaps in the log
Raft	n (RequestVote) + catch-up	1 round-trip + election timeout	Randomized timeouts to prevent split votes
Zab	O(n * log_size)	Multiple round-trips	Discovery + Synchronization + Broadcast phases; can be expensive
VR	O(n^2)	2+ round-trips	View change messages contain full logs; expensive
PBFT	O(n^2)	Multiple round-trips	View change requires 2f+1 view-change messages with proofs
HotStuff	O(n)	1 round-trip for pacemaker	Designed for frequent leader rotation; view change is cheap
Tendermint	O(n^2)	Round timeout	Validators just move to next round/proposer
EPaxos	N/A (leaderless)	N/A	No leader to fail; but recovery of in-progress instances is complex
Flexible Paxos	Same as Multi-Paxos	Same as Multi-Paxos	Quorum flexibility doesn’t change recovery
Kafka ISR	O(ISR size)	Controller detects + shrinks ISR	Controller elects new partition leader from ISR

This is where the protocols diverge dramatically.

HotStuff’s selling point is right here — it was specifically designed so that leader rotation (and by extension, leader failure handling) is O(n) rather than O(n^2). This matters enormously in BFT settings where you might want to rotate leaders frequently to limit a malicious leader’s ability to cause damage.

Zab’s view change deserves its reputation for complexity. The discovery phase requires the new leader to contact all followers, determine who has the most up-to-date state, synchronize that state to a quorum, and only then begin accepting new proposals. In a cluster with significant state divergence (which happens when the old leader was partitioned while still accepting writes that some followers received and others didn’t), this process can take a disturbing amount of time.

VR’s view change is similarly expensive because view-change messages historically carry full logs (though practical implementations obviously optimize this). The theoretical message complexity of O(n^2) comes from every replica needing to send its state to every other replica, though in practice you only need the new primary to collect f+1 view-change messages.

EPaxos sidesteps leader election entirely, which is wonderful until you need to recover an in-progress command instance. The recovery protocol requires running an explicit recovery phase that resembles Paxos Phase 1, and if you’re unlucky enough to have multiple concurrent recoveries for dependent commands, you can end up in a situation that will make you appreciate why leaders exist.

Kafka’s approach is characteristically pragmatic — the controller (which is itself elected, previously via ZooKeeper, now via KRaft) simply picks a new leader from the ISR. If the ISR is empty, you have bigger problems, and Kafka lets you choose between unavailability and data loss via the unclean.leader.election.enable configuration flag. That’s not theoretical elegance, but it is honest.

Dimension 3: Latency (Message Delays to Commit)

Latency measures how many sequential network round-trips a client must wait before its request is committed. This is distinct from message complexity — a protocol could send 100 messages in parallel (high message complexity) but complete in one round-trip (low latency).

Protocol	Message Delays (Normal Case)	Message Delays (With Leader)	Notes
Paxos	4 (client→leader, Prepare, Accept, reply)	2 if leader is pre-elected	Single-decree; each decree is independent
Multi-Paxos	2	2	Client→leader, leader→quorum→leader→client
Raft	2	2	Same as Multi-Paxos in steady state
Zab	2	2	Same pattern
VR	2	2	Same pattern
PBFT	5	5	Client→primary, pre-prepare, prepare, commit, reply
HotStuff (basic)	7	7	Client + 3 phases of leader↔replica + reply
HotStuff (pipelined)	7 (amortized ~2)	7 (amortized ~2)	Pipelining overlaps phases of different commands
Tendermint	4	4	Propose, Prevote, Precommit, Commit
EPaxos (fast path)	2	N/A (leaderless)	Client→replica→quorum→replica→client
EPaxos (slow path)	4	N/A	Additional Accept phase on conflict
Flexible Paxos	2	2	Same as Multi-Paxos
Kafka ISR	2	2	Producer→leader, replicate to ISR, ack

For CFT protocols, the answer is boringly uniform: two message delays in steady state. Client sends to leader, leader replicates to a quorum, leader responds. Everyone has converged on this because it’s the theoretical minimum for fault-tolerant replication with a stable leader.

The BFT protocols show more variation, and this is one of the genuine tradeoffs between PBFT, HotStuff, and Tendermint. PBFT’s five message delays are a consequence of its three-phase protocol (pre-prepare, prepare, commit). HotStuff basic pays seven delays for its three chained QC phases, but pipelining effectively amortizes this to two delays per committed decision (at the cost of increased latency for any individual decision — it takes longer for your specific command to commit, but the system commits one command per round-trip once the pipeline is full). Tendermint sits in between at four delays.

The Persistence Tax

Every one of these latency numbers assumes negligible disk write time. In reality, the fsync (or fdatasync, if your implementation is sophisticated enough to know the difference) at each persistence point adds anywhere from 0.1ms (NVMe SSD) to 10ms (spinning disk) to 50ms+ (cloud storage with “durable” semantics that actually mean “we’ll get around to it”).

Raft implementations typically require two persistence points per commit: one when the leader writes the entry to its own log, and one when each follower writes the entry to its log. But the leader can send AppendEntries and persist in parallel (this is an optimization that the Raft paper mentions in passing and that every production implementation uses, but which trips up everyone building Raft for the first time).

The bottom line: in a data center with modern SSDs, cross-node network latency (0.1-0.5ms per hop) and disk persistence latency (~0.1ms per fsync) are in the same ballpark. Over a WAN, network latency dominates. On spinning disks, persistence dominates. Your protocol choice matters less than your hardware in the typical case.

Dimension 4: Throughput Characteristics

Throughput is where the comparison gets genuinely complicated, because throughput depends on factors that are mostly orthogonal to the consensus protocol itself.

Protocol	Theoretical Throughput Limit	Practical Bottleneck	Batching Potential
Paxos	Low (per-decree overhead)	Per-operation Prepare phase	Low without Multi-Paxos
Multi-Paxos	Leader network bandwidth	Leader CPU + NIC	High (batch in Accept)
Raft	Leader network bandwidth	Leader CPU + NIC	High (batch in AppendEntries)
Zab	Leader network bandwidth	Leader CPU + NIC	High (batch in proposals)
VR	Leader network bandwidth	Leader CPU + NIC	High
PBFT	All-node network bandwidth	n^2 messages per commit	Moderate (batch in pre-prepare)
HotStuff	Leader network bandwidth	Leader CPU (aggregating votes)	High
Tendermint	All-node network bandwidth	Gossip overhead + n^2	Moderate (batch in block)
EPaxos	Aggregate cluster bandwidth	Dependency tracking overhead	High per-replica
Flexible Paxos	Leader network bandwidth	Same as Multi-Paxos	High
Kafka ISR	Leader network bandwidth	Disk I/O (by design)	Very high (batch-optimized)

The leader-based protocols (Multi-Paxos, Raft, Zab, VR, Flexible Paxos, Kafka ISR) all share the same fundamental throughput limitation: the leader is a bottleneck. Every write goes through the leader. The leader must receive the client request, append it to its log, send it to all followers, wait for a quorum of acknowledgments, and respond to the client. The leader’s NIC, CPU, and disk are the ceiling.

This is the main theoretical advantage of EPaxos — by eliminating the leader, it can distribute load across all replicas. If you have five nodes and the workload has no conflicts, each node can independently process roughly 1/5 of the requests, giving you approximately 5x the throughput of a leader-based protocol. The original EPaxos paper demonstrates this convincingly on conflict-free workloads.

But the moment you introduce conflicts (which you will, because most real workloads have hot keys), EPaxos must run its slower path, which includes additional coordination. The dependency tracking logic also adds CPU overhead per operation. In benchmarks with realistic conflict rates (5-25%), EPaxos’s throughput advantage shrinks considerably, and in high-conflict workloads it can actually perform worse than Raft because the slow path is more expensive than just going through a leader.

For the BFT protocols, the throughput picture is dominated by the O(n^2) message overhead. PBFT with n = 4 (the minimum for f = 1) sends roughly 16 messages per commit. At n = 100, that’s roughly 10,000 messages per commit. This is why BFT protocols are typically deployed with small replica counts (4, 7, maybe 13) and why the HotStuff paper’s reduction to O(n) messages was genuinely significant.

Kafka ISR deserves special mention for throughput because its entire architecture is optimized for it. Zero-copy reads, sequential disk I/O, batching at every layer, page cache utilization — Kafka’s throughput numbers are impressive not because the ISR protocol is theoretically superior, but because the implementation has been ruthlessly optimized for the append-only, batched-writes, sequential-reads use case. This is an important lesson: protocol-level message complexity matters less than implementation-level engineering for throughput.

What the Papers Claim vs. What You Actually Get

Let’s talk about performance numbers in academic papers.

Here’s a rough calibration table, assembled from reading too many evaluation sections at 2 AM:

Protocol	Paper Claims	Reality	Why the Gap
Multi-Paxos	~100K ops/sec	10K-100K ops/sec	Papers use small payloads, RAM-only “persistence,” ideal networks
Raft	~20K-100K ops/sec	5K-50K ops/sec	etcd tops out around 10-20K writes/sec in practice
EPaxos	2-5x Paxos throughput	1-2x in practice	Conflict rates in real workloads reduce fast-path utilization
PBFT	~10-40K ops/sec	1K-10K ops/sec	Papers benchmark with 4 nodes, tiny payloads, localhost
HotStuff	“Near Raft throughput with BFT”	50-70% of Raft	Threshold signature aggregation has real CPU cost
Tendermint	~1000 TPS (blockchain mode)	100-1000 TPS	Block size, gossip overhead, application-level validation
Kafka ISR	Millions of msgs/sec	Hundreds of thousands/sec per partition	Kafka’s numbers are aggregate across partitions; single partition is bottlenecked

The gap exists for several reasons, all of which are understandable but rarely acknowledged:

Paper benchmarks use favorable conditions. Small payloads (often 0 bytes or 16 bytes), no application-level processing, networks with minimal jitter, and usually the minimum number of nodes. These conditions are valid for measuring the protocol overhead, but they’re not your production environment.

Paper benchmarks often skip persistence. The phrase “we configure the system to batch sync writes to disk every 10ms” appears in more evaluation sections than anyone wants to admit. That’s not durable consensus — that’s consensus with a 10ms window of data loss. When you add real fsync on every commit (which you must, for correctness), throughput drops substantially.

Paper benchmarks run on dedicated hardware. Your consensus nodes are probably sharing machines with seventeen other services, running in containers, on a network shared with the analytics team’s nightly Spark jobs. The variability alone kills best-case numbers.

Paper benchmarks don’t measure tail latency. Mean latency at 50% load tells you nothing about p99 latency at 90% load, which is the number that actually determines your SLA.

Dimension 5: Fault Tolerance

Protocol	Fault Model	Nodes for f Faults	Tolerates Byzantine?	Tolerates Omission?	Tolerates Partition?
Paxos	Crash	2f+1	No	Yes (crash-stop)	Minority partition
Multi-Paxos	Crash	2f+1	No	Yes	Minority partition
Raft	Crash	2f+1	No	Yes	Minority partition
Zab	Crash	2f+1	No	Yes	Minority partition
VR	Crash	2f+1	No	Yes	Minority partition
PBFT	Byzantine	3f+1	Yes	Yes	Minority partition
HotStuff	Byzantine	3f+1	Yes	Yes	Minority partition
Tendermint	Byzantine	3f+1	Yes	Yes	Minority partition
EPaxos	Crash	2f+1	No	Yes	Minority partition
Flexible Paxos	Crash	Varies	No	Yes	Depends on quorum config
Kafka ISR	Crash (tunable)	min.insync.replicas	No	Yes	ISR-dependent

The fault tolerance story is straightforward in theory and a mess in practice.

All CFT protocols tolerate f crash faults with 2f+1 nodes. All BFT protocols tolerate f Byzantine faults with 3f+1 nodes. If you stopped here, you’d think this was simple.

It isn’t, because “crash fault” is a model, not a reality. Real failures include:

Disk corruption without node crash — your node is running but serving garbage data. CFT protocols assume this doesn’t happen.
Clock skew — Raft’s leader election depends on timeouts; extreme clock skew can cause unnecessary elections or, worse, split-brain in poorly implemented variants.
Network asymmetry — node A can reach B but not C, while B can reach both A and C. Most protocols assume symmetric partitions.
Slow nodes — not crashed, not Byzantine, just slow enough to be useless. This falls through the cracks of both fault models.
Correlated failures — the assumption of independent failures underlies the f out of 2f+1 math. When all your nodes are in the same availability zone and AWS has a bad day, independence goes out the window.

Flexible Paxos is interesting in the fault tolerance dimension because it lets you tune the tradeoff. If you set a Phase 1 quorum of n-1 and a Phase 2 quorum of 1, you get amazingly fast writes (only one replica must acknowledge) but terrible leader election (must contact all replicas). The failure tolerance depends on your quorum choices, and choosing wrong means silent data loss. This flexibility is powerful and terrifying in equal measure.

Kafka ISR is fault-tolerant in a way that makes theorists uncomfortable. The ISR can shrink to a single replica (the leader), at which point you have zero fault tolerance but remain available. Whether this is acceptable depends on your min.insync.replicas setting and your willingness to trade safety for availability. In practice, many Kafka deployments run with min.insync.replicas=2 and replication.factor=3, giving them crash tolerance of one broker — equivalent to f=1 with 2f+1=3 nodes, just with a less formal proof.

The Jepsen Reality Check

No discussion of fault tolerance claims is complete without mentioning Jepsen, Kyle Kingsbury’s testing framework that has found consistency violations in nearly every distributed system it has tested. Here’s a partial scorecard relevant to our protocols:

System	Jepsen Findings	Status
etcd (Raft)	Stale reads under certain configurations; lease-related issues	Fixed in subsequent releases
ZooKeeper (Zab)	Issues under network partitions with specific timing	Mostly fixed; some edge cases remain
CockroachDB (Raft)	Serializability violations under clock skew	Fixed; improved clock skew handling
MongoDB (custom)	Significant data loss during rollback; stale reads	Multiple rounds of fixes
Kafka	Under-replicated partitions can lose data; exactly-once edge cases	Configuration-dependent; improved over releases
Redis (Sentinel/Cluster)	Split-brain, data loss during failover	Fundamental design limitations
Consul (Raft)	Stale reads under default configuration	Configuration clarified

The pattern is consistent: every system has bugs, and those bugs are most likely to manifest during network partitions, leader elections, and membership changes — exactly the scenarios that matter most. The protocols are correct on paper, but the implementations have gaps. This is not a criticism of the implementers (these are some of the best engineering teams in the industry) — it’s evidence that implementing consensus correctly is genuinely, perhaps irreducibly, difficult.

Dimension 6: Leader Requirement

Protocol	Requires Leader?	Leader’s Role	Multi-Leader?
Paxos	Proposer (not technically a leader)	Proposes values	Multiple proposers possible (but may conflict)
Multi-Paxos	Yes (distinguished proposer)	Sequences all operations	No
Raft	Yes (strong leader)	Sequences all operations, log authority	No
Zab	Yes (leader)	Sequences all operations	No
VR	Yes (primary)	Sequences all operations	No
PBFT	Yes (primary)	Orders requests within a view	No, but rotates on view change
HotStuff	Yes (rotating leader)	Proposes blocks, collects votes	No, but designed for rotation
Tendermint	Yes (proposer per round)	Proposes blocks	Rotates every round
EPaxos	No	N/A	All replicas can lead any instance
Flexible Paxos	Yes (in Multi-Paxos mode)	Sequences operations	No
Kafka ISR	Yes (per-partition leader)	Handles reads and writes	Yes (different leaders per partition)

The leader question matters for two reasons: performance and availability.

Performance: A single leader is a throughput bottleneck. Every operation must pass through one node. Kafka sidesteps this with per-partition leaders — different partitions can have different leaders, distributing the load. EPaxos sidesteps it by eliminating the leader entirely. Everyone else accepts the bottleneck as the price of simplicity.

Availability: A leader failure causes a period of unavailability while a new leader is elected. For Raft, this is typically 150-300ms (one election timeout). For Zab, it can be seconds. For PBFT view changes, it can be… longer than you’d like. HotStuff was explicitly designed to make leader changes fast, because in a BFT setting where you rotate leaders frequently, expensive view changes are fatal to performance.

Raft’s “strong leader” design is worth highlighting. In Raft, the leader is the sole authority on the log — followers accept whatever the leader tells them, and log entries only flow from leader to follower, never the other way. This simplifies reasoning about the protocol tremendously but means the leader does more work than in protocols like Multi-Paxos, where followers can occasionally know about entries the current leader doesn’t (from a previous leader’s incomplete rounds).

Dimension 7: Ordering Guarantees

Protocol	Ordering Guarantee	Total Order?	Per-Key Ordering?	Notes
Paxos	Per-instance only	No (single decree)	No	Each instance independent
Multi-Paxos	Total order	Yes	Yes (implied)	Log is totally ordered
Raft	Total order	Yes	Yes (implied)	Log is totally ordered
Zab	Total order + causal	Yes	Yes	FIFO ordering for same client
VR	Total order	Yes	Yes (implied)	Log is totally ordered
PBFT	Total order	Yes	Yes	Within a view; across views with view-change
HotStuff	Total order	Yes	Yes	Chained QCs provide total order
Tendermint	Total order (per chain)	Yes	Yes	Block sequence provides total order
EPaxos	Partial order (fast path)	Only after execution ordering	Per-key with conflicts resolved	Dependency graph, not a log
Flexible Paxos	Total order	Yes	Yes	Same as Multi-Paxos
Kafka ISR	Total order per partition	Per-partition only	Only within partition	No cross-partition ordering

EPaxos is the outlier here, and this is both its greatest strength and its most confusing aspect. EPaxos doesn’t maintain a totally ordered log — instead, it builds a dependency graph where commands that don’t conflict can be ordered independently. This allows parallelism (non-conflicting commands at different replicas don’t need to coordinate) but makes the execution layer more complex. You need a deterministic algorithm to linearize the dependency graph at each replica, and that algorithm (based on Tarjan’s strongly connected components) is subtle enough that multiple published versions had bugs.

Kafka’s per-partition ordering is a practical compromise that works brilliantly for many use cases but catches people off guard when they need cross-partition ordering. If you need events for user A and user B to be ordered relative to each other, they must go to the same partition. If you also need events for user B and user C to be ordered, all three users must be on the same partition. Transitive ordering requirements can collapse your partition scheme into a single partition faster than you’d expect.

Dimension 8: Membership Change Support

Protocol	Membership Changes	Mechanism	Complexity
Paxos	Not specified	(Left as exercise for reader)	N/A
Multi-Paxos	Not specified in original	Various ad-hoc approaches	High — many subtle bugs
Raft	Joint consensus or single-server	Raft paper specifies both approaches	Moderate — but edge cases abound
Zab	Dynamic reconfiguration (3.5.0+)	Reconfiguration proposal	High — added years after initial release
VR	Reconfiguration protocol	Epoch-based	Moderate
PBFT	Not specified	View change can implicitly handle	Very high if attempted
HotStuff	Addressed in some variants	Committee rotation	Moderate in blockchain context
Tendermint	Validator set changes	Via application-level EndBlock	Moderate
EPaxos	Not specified	Open research problem	Very high
Flexible Paxos	Not specified	Quorum changes compound the problem	Very high
Kafka ISR	Built-in	Partition reassignment + ISR dynamics	Low (from user perspective)

Membership changes — adding or removing nodes from the consensus group — are the feature that every protocol paper hand-waves over and every implementation team curses. Lamport’s Paxos paper doesn’t address it. The PBFT paper doesn’t address it. EPaxos doesn’t address it. This is not because the problem is trivial; it’s because it’s orthogonal to the core algorithm and fiendishly hard to get right.

Raft deserves credit for being the first major protocol paper to include a detailed membership change mechanism. The joint consensus approach (where the cluster temporarily operates with a configuration that spans old and new memberships) is correct but complex. The single-server change approach (adding or removing one server at a time) is simpler but requires careful sequencing.

Even Raft’s approach has subtleties that have tripped up implementations. The most notorious: a server that has been removed from the cluster but doesn’t know it yet can disrupt the cluster by starting elections with a higher term number. The pre-vote mechanism (RequestPreVote before RequestVote) was added to address this, but it’s not in the original paper.

Kafka wins this category handily from a user perspective — partition reassignment is a well-tested operational procedure, and the ISR mechanism naturally handles temporary membership changes (a slow broker drops out of the ISR and rejoins later). The KRaft migration adds complexity, but the partition-level membership story remains solid.

Dimension 9: Implementation Complexity

This is the most subjective dimension, but also one of the most practically important. A protocol that’s theoretically optimal but impossible to implement correctly is worse than a protocol that’s theoretically suboptimal but actually works.

Protocol	Implementation Complexity	Lines of Code (rough)	Key Difficulties
Paxos	High	1K-5K	Mapping to practical system; gaps in log
Multi-Paxos	Very High	5K-20K	No canonical specification; endless variants
Raft	Moderate	3K-10K	Canonical spec helps; still many edge cases
Zab	High	10K-30K (in ZooKeeper)	Tightly coupled to ZooKeeper; complex recovery
VR	Moderate	3K-10K	Well-specified; but few reference implementations
PBFT	Very High	10K-30K	Cryptographic operations; state transfer; garbage collection
HotStuff	High	5K-15K	Threshold signatures; pipelining correctness
Tendermint	High	15K-40K (full node)	ABCI interface; gossip layer; evidence handling
EPaxos	Very High	5K-15K	Dependency tracking; execution ordering; recovery
Flexible Paxos	Very High	Same as Multi-Paxos + config	All of Multi-Paxos plus quorum configuration safety
Kafka ISR	Moderate (within Kafka)	N/A (part of larger system)	ISR management; controller failover; exactly-once

The lines-of-code numbers are extremely rough and depend heavily on language, coding style, and how much you include (just the consensus layer? The RPC layer? The storage layer? Testing?). They’re meant to give a relative sense, not absolute numbers.

Raft’s moderate complexity is its entire selling point, and it’s a real one. The Raft paper includes enough detail that a graduate student can implement it in a semester (with bugs that will take two more semesters to fix). Multi-Paxos requires reading between the lines of Lamport’s papers, several follow-up papers, and ideally a few blog posts by people who’ve implemented it. PBFT requires all of that plus cryptography.

EPaxos is in the “very high” category not because the core algorithm is inherently more complex than Multi-Paxos, but because the dependency tracking and execution ordering are genuinely difficult to get right. The original paper had a bug in the execution algorithm that was discovered years after publication. If the authors can get it wrong, so can you.

The Grand Comparison Table

For reference, here’s the complete matrix in one (necessarily compressed) view:

	Paxos	Multi-Paxos	Raft	Zab	VR	PBFT	HotStuff	Tendermint	EPaxos	Flex. Paxos	Kafka ISR
Msg complexity (normal)	2n	n	n	n	n	n^2	n/phase	n^2	n or 2n	n	ISR size
Msg complexity (view change)	2n	2n+	n	n*log	n^2	n^2	n	n^2	N/A	2n+	O(ISR)
Latency (msg delays)	4/2	2	2	2	2	5	7/~2	4	2/4	2	2
Throughput	Low	Leader-bound	Leader-bound	Leader-bound	Leader-bound	n^2-bound	Leader-bound	n^2-bound	Distributed	Leader-bound	Leader/part.
Fault tolerance	Crash, 2f+1	Crash, 2f+1	Crash, 2f+1	Crash, 2f+1	Crash, 2f+1	Byz, 3f+1	Byz, 3f+1	Byz, 3f+1	Crash, 2f+1	Crash, varies	Crash, ISR
Leader required	Proposer	Yes	Yes (strong)	Yes	Yes	Yes	Yes (rotating)	Yes (rotating)	No	Yes	Yes/partition
Total ordering	No	Yes	Yes	Yes+causal	Yes	Yes	Yes	Yes	Partial	Yes	Per-partition
Membership change	Unspecified	Unspecified	Specified	Added later	Specified	Unspecified	Varies	App-level	Unspecified	Unspecified	Built-in
Impl. complexity	High	Very high	Moderate	High	Moderate	Very high	High	High	Very high	Very high	Moderate

Why Microbenchmarks Lie

We’ve alluded to this throughout, but it deserves its own section.

A microbenchmark of a consensus protocol typically measures the protocol in isolation: how fast can it commit operations with no application-level processing, minimal payload sizes, and a perfectly behaved network? These numbers are useful for comparing the overhead of the protocol itself, but they tell you almost nothing about the performance of a system built on top of the protocol.

Here’s what microbenchmarks systematically miss:

1. Application processing time. If your state machine takes 1ms to apply a command, and your consensus protocol commits in 0.5ms, the protocol isn’t your bottleneck. Shaving 0.1ms off consensus latency (by, say, switching from Raft to a theoretically faster protocol) saves you 6.7% of total latency, not 20%.

2. Serialization overhead. Consensus protocols send messages. Those messages must be serialized and deserialized. In a benchmark with 16-byte payloads, serialization is negligible. In a system with 64KB payloads containing Protocol Buffers, serialization can account for 10-30% of CPU time.

3. Garbage collection pauses. If your consensus implementation is in Java (ZooKeeper, Kafka) or Go (etcd), GC pauses will periodically spike your latency in ways that the protocol cannot prevent. A C++ or Rust implementation will behave differently, but you don’t get to choose the implementation language of most off-the-shelf systems.

4. Connection management. Benchmarks run on a fixed number of connections in a controlled environment. Production systems have connection churn, TLS handshakes, keep-alive management, and TCP behavior that depends on the physical network path.

5. Interaction with other system components. Raft’s performance in etcd depends on boltdb’s write characteristics, gRPC’s overhead, the Go scheduler’s behavior under load, and how etcd’s MVCC layer interacts with the Raft log. None of these show up in a Raft microbenchmark.

The lesson: don’t choose a consensus protocol based on microbenchmark numbers. Choose based on the properties that matter for your system (fault model, ordering requirements, membership flexibility), and then optimize the implementation. A well-engineered Raft implementation will outperform a naive EPaxos implementation in virtually every realistic scenario.

Dimension 10: Production Maturity and Ecosystem

No comparison is complete without acknowledging that protocols don’t exist in a vacuum — they exist in ecosystems. The best protocol with no production-grade implementation loses to a mediocre protocol with excellent tooling.

Protocol	Notable Implementations	Years in Production	Tooling Quality	Community Size
Paxos	Google Chubby, Spanner (internal)	20+	Internal (limited public)	Academic + Google
Multi-Paxos	Google Spanner, Megastore (internal); Ratis	20+	Limited public tooling	Small public community
Raft	etcd, CockroachDB, TiDB, Consul, hashicorp/raft, openraft	10+	Excellent	Very large
Zab	ZooKeeper	15+	Good (ZK ecosystem)	Large
VR	Viewstamped Replication Revisited implementations	Limited	Minimal	Very small
PBFT	Hyperledger Fabric (early), custom implementations	15+ (mostly academic)	Limited	Small
HotStuff	Diem/Libra (defunct), Aptos, Flow blockchain	5+	Moderate	Growing (blockchain)
Tendermint	Cosmos, Binance Chain, various blockchains	7+	Good (Cosmos SDK)	Large (blockchain)
EPaxos	Research prototypes, limited production use	Limited	Minimal	Very small
Flexible Paxos	Research prototypes; influenced WPaxos	Very limited	Minimal	Very small
Kafka ISR	Apache Kafka, Confluent Platform	12+	Excellent	Very large

Raft and Kafka ISR dominate the ecosystem dimension. Raft has at least a dozen production-grade implementations across multiple languages, comprehensive documentation, university courses, interactive visualizations, and thousands of engineers who’ve operated Raft-based systems. Kafka has an even larger ecosystem: client libraries in every language, a rich connector ecosystem (Kafka Connect), a stream processing framework (Kafka Streams), managed service offerings, and a conference circuit.

EPaxos and Flexible Paxos, despite their theoretical advantages, have essentially zero production ecosystem. If you choose EPaxos, you’re implementing it yourself (or using one of a handful of research prototypes) with minimal community support. When you hit a bug — and you will hit bugs — your debugging resources are the original paper and its errata.

VR is in a similar position. Despite being a well-designed protocol (arguably cleaner than Paxos), its lack of a canonical implementation and tiny community make it a risky choice for production systems. Choosing VR is a statement of intellectual independence that your on-call team will not appreciate.

This ecosystem effect is self-reinforcing. More implementations mean more bug fixes, more operational knowledge, more tooling, more documentation, which attracts more implementations. Raft’s dominance in the CFT space is as much a network effect as a technical judgment.

The Real Comparison

After all these dimensions, all these tables, and all these caveats, here’s the uncomfortable truth: for most systems, the choice between Raft, Multi-Paxos, and Zab doesn’t matter much. They have roughly the same message complexity, the same latency, the same fault tolerance, and the same throughput bottleneck (the leader). The difference is in implementation maturity, ecosystem support, and how much you enjoy reading the original papers.

The meaningful distinctions are at a higher level:

Do you need Byzantine fault tolerance? If yes, your options are PBFT, HotStuff, Tendermint, or their derivatives. This is the single biggest fork in the decision tree.
Is a leader bottleneck unacceptable? If yes, consider EPaxos (if you can handle the complexity) or a partitioned approach (like Kafka) where different partitions have different leaders.
Do you need total ordering across all operations? If not, you can potentially use something simpler or use per-partition ordering.
Are you willing to trade safety for availability? If yes, Kafka ISR with unclean.leader.election.enable=true or a Flexible Paxos configuration with small Phase 2 quorums might be appropriate.

We’ll turn these distinctions into a concrete decision framework in the next chapter.

But before we leave this comparison behind, one more uncomfortable truth: the protocol you choose matters less than how well you implement, test, and operate it. A mediocre protocol with excellent monitoring, comprehensive testing, well-documented runbooks, and an on-call team that understands the failure modes will outperform an optimal protocol that nobody on the team fully understands. The tradeoff matrix helps you make an informed choice, but the choice is the beginning of the work, not the end.

The tables in this chapter are a map. The territory is your production environment, your team’s expertise, your specific workload, and the 3 AM incidents that will inevitably test everything you thought you knew about your consensus protocol. Choose wisely, but more importantly, implement carefully.

Keyboard shortcuts

The Agony of Consensus Algorithms