Introduction: The Book Blockchain Wasn't
If you read the news between 2016 and 2022, you could be forgiven for thinking that blockchain had invented distributed consensus. Every think piece about supply chains, elections, medical records, and loyalty points reached for the same vocabulary — miners, validators, gas, proof-of-work, proof-of-stake — as if these were the primitives of agreement itself.
They are not.
Consensus — getting a group of computers to agree on a single answer despite failures, delays, and concurrency — is a solved problem in computer science, and it was solved decades before Bitcoin. The catch is that the classical solutions assume something blockchain refuses to: that you know who is in the room. When you control membership, the problem gets dramatically easier. You need fewer messages. You don't need economic incentives. You don't need to boil oceans. You get stronger guarantees, faster finality, and cheaper operation.
This is the book about those solutions. The ones running inside etcd, which in turn runs Kubernetes, which in turn runs most of the world's production software. The ones inside Consul, ZooKeeper, the Kafka controller (now KRaft), Spanner, CockroachDB, FoundationDB, and the configuration plane of nearly every cloud. The ones in permissioned financial networks that happen to call themselves blockchains because the word polled well with executives. And the research lineage behind all of them: Paxos, Raft, Viewstamped Replication, PBFT, HotStuff, and their relatives.
Who this book is for
Working engineers. You have used a distributed system. You have read a post-mortem. You know what a split-brain is, have strong opinions about retries, and are tired of being told that every problem is a blockchain problem. You want to understand, concretely, what's happening when your key-value store "achieves consensus" — enough that you can read an architecture decision record or a Jepsen report and evaluate it on its merits.
The prerequisites are modest. You should be comfortable with the idea of a state machine, know that networks can drop messages and reorder them, have at least heard the phrase CAP theorem, and be able to read a code snippet in a C-family language. You do not need to know any specific algorithm. You do not need formal-methods training. We translate the math into pictures and invariants a practitioner can hold in their head.
What this book is not
It is not a blockchain book. There is a companion volume for that: How Blockchains Actually Work (Without the Hype). Here, we mention blockchain only when it earns the mention — in Chapter 12, where permissioned chains overlap with classical BFT, and occasionally as a contrast class.
It is not a proofs book. Correctness arguments are given in the form an engineer can check by hand ("if two quorums overlap in at least one node, and that node refuses to accept two conflicting proposals, then..."), not the form a theorem prover can check. Pointers to the formal treatments live in Chapter 15.
It is not a library tour. We will use etcd, ZooKeeper, and their peers as case studies, but the goal is that when a new consensus-based system lands on your desk, you recognize the shape.
How the book is built
The story is a chain of problems and responses.
- First we establish the problem: state machine replication — the canonical use case that every consensus algorithm is trying to solve.
- Then we meet the walls: FLP (you cannot guarantee consensus in a purely asynchronous system with even one crash) and CAP (you cannot have consistency, availability, and partition tolerance simultaneously). These are the constraints every practical algorithm has to route around.
- Then we get precise about what failure means — crash, omission, Byzantine — because algorithms are cheap or expensive depending on which of these you sign up for.
- With those in hand, we take the tour. Paxos (the one everyone names), Raft (the one everyone uses), Viewstamped Replication (the one that predates them both and deserves the credit), PBFT (when you can't trust your peers), HotStuff (when you have a lot of peers), and randomized consensus (when you want to sidestep FLP with a coin).
- Then we walk the production floor — etcd, Consul, ZooKeeper, Kafka, Spanner, CockroachDB — and ask what these algorithms actually look like in systems people run for a living.
- We take a fair look at permissioned blockchains.
- We discuss what goes wrong: liveness, safety, split-brain, and the incidents that teach the theory better than the theory does.
- We give you a decision procedure for the next time someone asks you, "should we use Raft?"
- We point you at the papers and the textbooks for everything we had to cut.
A note on stance
Every algorithm in this book exists because the previous one had a problem. Paxos exists because Lamport found a way to reach consensus despite FLP. Raft exists because Paxos-the-paper was incomprehensible enough to hold the field back for a decade. PBFT exists because crash-fault tolerance is not enough when your peers might lie. HotStuff exists because PBFT's communication costs got painful at scale.
We try to tell this story the way you'd tell it to a colleague over coffee: respectfully of the reader, unimpressed by jargon, willing to say "the part of the original paper that confused everyone was...", and honest about tradeoffs. "Raft is easier to implement than Paxos" is a genuine engineering win. "PBFT needs 3f+1 nodes to tolerate f failures" is a real cost. Both things are true and both are interesting.
There are no revolutionary, game-changing, or next-generation algorithms here. Just the slow, patient accumulation of good ideas that keep the lights on.
Let's go.
The Problem: State Machine Replication
Before we can talk about consensus algorithms, we need to be precise about the problem they solve. Vague problems invite vague solutions; precise problems invite good ones.
The word consensus gets overloaded. Sometimes it means "agreement" in an abstract political sense. Sometimes it means "the particular value a group of processes settle on." Sometimes it means "the protocol those processes run." All three meanings appear in the literature, sometimes in the same paper. We will use it in the technical sense: a group of processes, each starting with an input, must all decide on a single output, such that:
- Agreement — no two correct processes decide different values.
- Validity — the decided value was proposed by at least one process.
- Termination — every correct process eventually decides.
This is the definition Lamport and others have refined over decades. Any protocol that satisfies these three is a consensus protocol. The hard part is satisfying all three at once under realistic assumptions about failures and timing.
But consensus as defined above is a one-shot problem. You propose, you agree, you decide — once. Real systems want to agree on a sequence of values. That is where state machine replication comes in.
The canonical use case
Suppose you have a service — a database, a configuration store, a lock manager, whatever. The service is a deterministic state machine: given the current state and an input command, it produces a new state and an output. Single-node deterministic, and trivial.
Now you want this service to survive node failures without downtime or data loss. So you run it on several nodes. The challenge: every replica has to end up in the same state. If one replica thinks the balance is $100 and another thinks it is $200, we have a problem that no amount of retry logic will fix.
The classical trick, due to Leslie Lamport in the late 1970s and refined by Fred Schneider in a 1990 survey paper, is:
State Machine Replication (SMR): If every replica starts in the same initial state and applies the same sequence of deterministic commands in the same order, every replica ends up in the same state.
That is, the problem of "keep replicas in sync" reduces to "agree on a totally-ordered log of commands." The state machine is trivially deterministic; the replication is entirely in the log.
clients ──▶ consensus protocol ──▶ ordered log ──▶ each replica applies
commands in order
┌───────┐ ┌───────────┐
│ put x │─────┐ │ 1: put x │───▶ replica A: apply 1,2,3 ── state S
│ put y │─────┼────▶│ 2: put y │───▶ replica B: apply 1,2,3 ── state S
│ cas z │─────┘ │ 3: cas z │───▶ replica C: apply 1,2,3 ── state S
└───────┘ └───────────┘
Every consensus algorithm in this book — Paxos, Raft, PBFT, HotStuff, all of it — is, at its heart, a mechanism for producing that totally-ordered log of commands despite failures and asynchrony. The service on top of the log is an implementation detail. The log is the point.
This decomposition is worth internalizing:
- Below the log: consensus. Hard. Full of tradeoffs. The subject of this book.
- Above the log: deterministic state machine. Easy. Just a function.
When people say "we built our own Raft" they almost always mean they built the below-the-log part. The above-the-log part is usually whatever they already had.
What "same state" actually means
"All replicas end up in the same state" sounds simple until you pick at it. Same state when? What if one replica has applied commands 1–100 and another has applied 1–95? Are they in the same state?
The standard move is to say: replicas are eventually in the same state for any prefix of the log they have both applied. Specifically:
- All correct replicas apply the same totally-ordered sequence of commands.
- A replica that has applied commands
1..kis in the state that results from applying1..kto the initial state. - Replicas may lag — replica B may be at 95 while A is at 100 — but they never disagree on commands they have both applied.
This is sometimes called linearizability of the log, or when you push the guarantee up through the state machine, linearizability of the service (Herlihy and Wing, 1990). Linearizability means, roughly, that even though operations are happening concurrently across many clients, the system behaves as if each operation took effect instantaneously at some point between its invocation and response.
Linearizability is the strongest consistency model in common use. It is also not the only useful one — there's sequential consistency, causal consistency, eventual consistency, and more. We'll get to the tradeoffs in Chapter 11. For now, hold on to:
Consensus produces an ordered log. An ordered log makes replicas applying deterministic commands indistinguishable from a single machine.
Not every system needs this
State machine replication is powerful and expensive. Every write has to go through consensus, which means every write pays for multiple rounds of network I/O. If all you need is a cache, or a log of events where duplicate or out-of-order delivery is tolerable, or a search index that can lose a few documents without ruining anyone's day — SMR is overkill. Eventually-consistent systems (Dynamo-style) make a different tradeoff: they accept staleness and sometimes conflicts in exchange for availability under partitions and much higher write throughput.
We'll meet those systems later. For now, assume we are building the strict systems — the ones where order matters and correctness beats throughput. The kind of systems whose bugs get called incidents.
Why this isn't the blockchain problem
Blockchain is also in the business of producing an ordered log — the chain is nothing but a sequence of blocks, each containing an ordered list of transactions. So why do blockchains look nothing like Raft?
Because they solve a different problem:
| Dimension | Classical SMR | Public blockchain |
|---|---|---|
| Membership | Fixed, known roster | Open; anyone can join |
| Identity | Authenticated (keys, certs) | Pseudonymous |
| Failure model | Crash or Byzantine, bounded fraction | Byzantine, Sybil-prone |
| Network | Private or semi-trusted | Public internet |
| Incentives | Operators are paid to run nodes | Participants must be economically motivated |
| Finality | Usually within milliseconds | Probabilistic, or eventual after minutes |
Every classical consensus algorithm assumes you know who is in the room — that you can enumerate the participants, authenticate their messages, and count them. The moment you can do that, you can have a 3f+1 or 2f+1 quorum. You can run leader-based protocols without worrying about Sybil attacks. You can finalize decisions in one network round-trip.
Blockchains cannot assume that. A public blockchain is trying to bootstrap agreement among a dynamic, pseudonymous, adversarial crowd. That's a fundamentally harder problem, and it needs expensive tools — proof-of-work, proof-of-stake, economic incentives, probabilistic finality.
Different problem, different tools. Neither is "the" consensus problem; they sit at different points on the tradeoff surface.
A note on terminology
A few terms we will use throughout the book, defined once so we are consistent:
- Process / node / replica / server — used interchangeably to mean one member of the consensus group. When distinction matters we'll say so.
- Client — the external caller submitting commands to the service on top of the log.
- Command / operation / request — the thing submitted by a client, destined for the state machine.
- Proposal — a candidate value (often, but not always, a command) that a consensus instance is trying to decide on.
- Decision / commit — the moment a value becomes durable and cannot be reversed.
- Quorum — a subset of processes large enough that any two quorums intersect in at least one correct process. We will derive the sizes in later chapters.
- Leader / primary / proposer — the process currently driving decisions. Some protocols have one; some rotate; some have none.
- View / epoch / term — a logical time period during which a particular leader (or configuration) is in charge. When the leader changes, the view changes.
Those last three words — view, epoch, term — are the same idea wearing different hats in different papers. Paxos calls it a ballot number; Raft calls it a term; Viewstamped Replication and PBFT call it a view; some papers say epoch. We'll use the native term for each algorithm when we are inside its chapter.
What you should take away
- Consensus in the technical sense means: agreement, validity, termination.
- Real systems want a log of decisions, and they get it by running consensus repeatedly — that is what we call state machine replication.
- SMR converts the hard problem ("keep replicas in sync") into a pair of easier ones: "agree on an ordered log" (consensus) plus "apply commands deterministically" (state machine).
- Blockchains also produce an ordered log, but under different assumptions — open membership, pseudonymous identity, Byzantine networks — so their solutions look different.
- The rest of this book is about the "agree on an ordered log" half, under the assumption of known, authenticated participants.
Next up, the constraints: what the theorists proved we can't do, and what the escape hatches look like.
The Impossibility Results
Before we get to the algorithms, we have to know what the rules of the game forbid. Two results above all define the shape of the field:
- FLP — Fischer, Lynch, and Paterson (1985). "Impossibility of Distributed Consensus with One Faulty Process."
- CAP — Brewer (2000), Gilbert and Lynch (2002). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services."
If you have worked in distributed systems for a while, you have seen these cited as banners waved to end arguments. That is a shame, because both are more interesting and more nuanced than the banner versions suggest. We'll unpack each on its own terms, say what it actually rules out, and describe the escape hatches real algorithms use.
FLP: asynchronous consensus is impossible
Let's start with what FLP says.
Claim: In a fully asynchronous system with even one process that might crash, there is no deterministic protocol that solves consensus.
"Fully asynchronous" is crucial. In FLP's model:
- There is no upper bound on message delay — any finite delay is possible.
- There is no upper bound on processing speed — any finite time to compute is possible.
- Processes have no clocks.
- The adversary scheduling messages and processing steps is arbitrary (but finite).
In that world, FLP proved you cannot satisfy all three of agreement, validity, and termination, deterministically, if any process might fail by crashing.
The intuition, stripped of formalism:
Imagine a protocol that purports to solve consensus. At any point, the protocol is in some configuration — the state of every process plus all in-flight messages. Configurations can be classified:
- 0-valent: every run from here decides 0.
- 1-valent: every run from here decides 1.
- bivalent: some runs from here decide 0, some decide 1.
Initial configurations must include at least one bivalent one (otherwise the decision is fixed by the inputs and not really "agreement"). FLP's elegant trick is to show that from any bivalent configuration, the adversary can choose a message delivery order that leads to another bivalent configuration — forever. The protocol can be kept in a bivalent state indefinitely by carefully timing one message at a time. So it never terminates.
Notice what the adversary is doing. It isn't crashing processes arbitrarily (it only gets one crash). It's scheduling messages to keep the protocol from making up its mind. This is the heart of the result: asynchrony plus even a single crash gives an adversary enough power to prevent termination.
What FLP does NOT say. This is where confusion reigns. FLP does not say consensus is impossible in practice. It does not say you shouldn't bother. What it says is precise:
- In the fully asynchronous model,
- for a deterministic protocol,
- tolerating even one crash,
- you cannot guarantee termination in every run.
Each italicized word is a relaxation that unlocks possibility.
- Relax "fully asynchronous": assume the network is eventually synchronous — that after some unknown time, messages are delivered within some bound. This is the partial synchrony model of Dwork, Lynch, and Stockmeyer (1988). Real networks are partially synchronous most of the time. All practical consensus algorithms assume partial synchrony for liveness, while remaining safe in full asynchrony.
- Relax "deterministic": use randomization. If each process can flip a coin, you can break the adversary's scheduling power. Ben-Or's 1983 algorithm does exactly this, at the cost of probabilistic termination. Chapter 10 is about this family.
- Relax "tolerating a crash": if no processes fail, consensus is easy. The scary part of FLP is that even one failure ruins deterministic asynchronous agreement.
- Relax "termination": if you give up liveness, safety alone is easy — just never decide. Obviously not useful, but it shows where the impossibility bites.
Practical algorithms combine these relaxations in predictable ways:
- Paxos, Raft, VR, PBFT, HotStuff are all safe under asynchrony, live under partial synchrony. They will never produce conflicting decisions no matter how adversarial the network. They will make progress when the network is "well-behaved enough."
- Randomized protocols (Ben-Or, and modern descendants like HoneyBadger-style common-coin protocols) are safe always, live with probability 1 — meaning they terminate with probability approaching 1 over time, even in full asynchrony.
This is the FLP escape hatch every practical system takes. Safety always, liveness under timing assumptions (or in probability). Keep that phrase in your pocket; you'll see it repeatedly.
CAP: pick two
CAP is the most cited, most misquoted, most vehemently argued-about theorem in distributed systems. It was informally stated by Eric Brewer in a 2000 PODC keynote and formally proven by Seth Gilbert and Nancy Lynch in 2002. Let's do it correctly.
Claim: A networked shared-data system cannot simultaneously provide Consistency, Availability, and Partition tolerance.
Now the words:
- Consistency (C), in CAP, means linearizability — the strict ordering we discussed in Chapter 1. (Note: the C in CAP is not the C in ACID. The C in ACID is "constraints preserved." Same letter, different concepts. Blame computer science.)
- Availability (A) means every request to a non-failed node receives a non-error response — not an error, not a timeout, a real response with data.
- Partition tolerance (P) means the system continues to operate despite an arbitrary number of messages being dropped between nodes.
The theorem says you can't have all three at once when a partition happens.
Why people draw the triangle wrong
A thousand slide decks present CAP as "pick two out of three" and draw a triangle with arrows suggesting you choose two corners: CA, CP, or AP. This framing is essentially wrong.
Partitions are not optional. Networks drop messages. Switches fail. Cables get unplugged. Cloud providers have bad days. A system that doesn't tolerate partitions doesn't exist; it just fails badly when partitions inevitably happen. So P is not a dial you choose — it's reality.
What CAP really says is: when a partition happens, you have to choose between C and A. Either you refuse to serve some requests (sacrifice availability) so that clients on different sides of the partition don't see conflicting states (preserve consistency); or you keep serving (preserve availability) and accept that different sides will diverge (sacrifice consistency).
In the no-partition case, you can absolutely have both consistency and availability. That's what you build a system for. CAP is a statement about degraded operation, not a design matrix.
CP and AP systems
With that cleaned up:
- CP systems sacrifice availability during partitions. They refuse writes (and maybe reads) to minority partitions. etcd, Consul, ZooKeeper, Spanner, most consensus-based stores — CP.
- AP systems sacrifice consistency during partitions. They keep serving on both sides and reconcile later. Dynamo, Cassandra, Riak (classic Dynamo-style) — AP.
Nearly every consensus algorithm in this book sits firmly on the CP side. That's by design: SMR wants strict ordering, which means some node has to be authoritatively "current," and a minority partition can't be sure it still is.
PACELC: the more honest theorem
Daniel Abadi proposed PACELC in 2010 as a refinement: if there is a Partition, choose between Availability and Consistency; Else (no partition), choose between Latency and Consistency.
This matters because consensus algorithms trade latency for consistency even when the network is fine. Every write has to go through a quorum; that's at least one round-trip to f+1 or 2f+1 peers. If you want single-digit-millisecond tail latencies and you have cross-region replicas, consensus will cost you. The "no partition" choice is not free.
PACELC is less famous than CAP but more useful for reasoning about real systems. When someone asks "should we use Raft?" the honest answer often starts with "how much latency are you willing to pay per write?"
What these results actually rule out, and what they don't
FLP and CAP together forbid:
- A deterministic protocol that, under fully asynchronous networks with at least one possible crash, always terminates.
- A system that, during a partition, keeps all nodes responsive and strictly consistent.
They do not forbid:
- A deterministic protocol that is always safe and terminates under partial synchrony. (Paxos, Raft, PBFT, HotStuff.)
- A system that is consistent when the network is healthy and unavailable only during partitions. (etcd, Consul, ZooKeeper.)
- A system that is always available and converges after partitions heal. (Cassandra, DynamoDB in default modes.)
- A randomized protocol that terminates with probability 1 under asynchrony. (Ben-Or family.)
So the results are not obstacles to building systems; they are shape constraints on the systems you can build. Every algorithm in this book hugs one of the escape hatches.
The practical escape hatches, summarized
Here's a compact table you can carry:
| Escape hatch | What it gets you | What it costs |
|---|---|---|
| Partial synchrony | Deterministic liveness when network is "normal." | Liveness pauses during bad periods (leader elections, retries). |
| Randomization | Probabilistic liveness without timing assumptions. | Extra message complexity; harder to reason about; rarely used alone in production. |
| Weaker consistency | Always available, high throughput. | You have to handle conflicts in application code. |
| Unanimous quorum | Simplicity. | Can't make progress if any node is down. |
| Hierarchical designs | Limits the scope of consensus. | More moving parts; harder to operate. |
| Clock synchronization (Spanner) | Faster reads, external consistency. | Requires tightly engineered clock infrastructure (TrueTime). |
The rest of this book is a study of how each algorithm chooses among these.
A philosophical aside
FLP and CAP are often taught like ghost stories — "the distributed systems gods forbid certain things." That framing makes them seem bigger than they are. Both are simply statements about model-level tradeoffs under adversarial conditions. They are useful precisely because they tell you which corners of the design space are empty, so you don't waste time looking there.
The analog in classical physics would be conservation of energy. Nobody rants about conservation of energy on Twitter; they just use it to rule out perpetual motion machines and move on. FLP and CAP deserve the same treatment. They rule out a few specific kinds of perpetual motion — and then they get out of your way.
With the impossibility landscape mapped, we can turn to the next piece of the puzzle: the kinds of failures you actually need your algorithm to tolerate.
The Failure Models You Actually Face
Choose a consensus algorithm without being clear about your failure model, and you will pay for it. Either you will buy an algorithm that is more paranoid (and more expensive) than you need, or — worse — you will trust one that makes assumptions your environment violates, and it will betray you at the worst possible moment.
This chapter is an attempt to make you specific about what "a node fails" actually means, because the algorithms differ wildly in what they promise under different flavors of failure.
The spectrum, from least to most adversarial
The classical taxonomy, roughly in order of increasing nastiness:
- Crash-stop (fail-stop) — a node either works correctly or stops entirely. When it stops, it never sends another message.
- Crash-recovery — a node can stop, then come back later, possibly with its persistent state intact.
- Omission — a node is running, but some of its messages are lost (send-omission) or ignored (receive-omission).
- Byzantine — a node may do anything: lie, collude, replay old messages, fabricate signatures it hasn't the keys for (if the model allows), or behave correctly just long enough to be trusted before betraying.
There is also the theoretical purity of fail-silent (you can't tell if a node stopped or is just slow) vs. fail-signal (there's some oracle that tells you a node is dead). Almost all real systems are fail-silent — dead and slow are indistinguishable over the network, which is why timeouts are hard.
Let's walk each one with the detail a practitioner needs.
Crash-stop
What it assumes: nodes are correct until they aren't. "Not being correct" means halting — ceasing to send or receive messages, forever, from that point on.
Why it's the bedrock of the field: most of the crash-fault-tolerant algorithms (Paxos, Raft, VR) were originally specified in this model. It is simple, it is analytically clean, and it yields clean quorum arithmetic: to tolerate f crash failures, you need 2f+1 nodes, because a majority quorum (f+1) always intersects any other majority in at least one live node.
Why it is almost always a lie in practice: real nodes don't crash-stop. Real nodes have OOM killers. Real nodes get STW garbage collection pauses that make them look dead for thirty seconds and then wake up. Real nodes get their clocks jumped forward an hour by NTP. Real nodes get their TCP sessions killed by a cloud provider's idle timeout but keep processing. The model assumes an honest, simple death. Production failures are messy.
So why do we use it? Because for many failure scenarios, crash-stop is a sound approximation if you build the system carefully. Specifically, if:
- You fence stale leaders (so a leader who thought they were the leader, took a pause, and woke up has no authority).
- You persist state before sending messages based on it, so a recovered node doesn't contradict its own past.
- You restart pathological nodes rather than trusting them to self-heal.
...then you can treat misbehaving-but-not-malicious nodes as crashed without losing correctness. This is why production crash-fault systems are more paranoid in practice than the model suggests.
Crash-recovery
What it assumes: nodes may crash and later resume. If they resume, they come back with persistent storage intact. No persistent storage loss is allowed in the basic form; variants consider amnesia (coming back with nothing) as a separate failure class.
Why it matters: real servers reboot. They may be down for seconds, minutes, or hours. If your protocol is only correct under "once a node is gone, it stays gone," you are in trouble.
What this adds to the algorithm:
- The recovering node has to reconstruct enough state to participate safely. It cannot assume its earlier peers still hold its in-flight messages.
- The protocol must be idempotent with respect to retried messages — the same message arriving twice should not cause conflicting state.
- Persistent storage must survive the crash. This usually means
fsync, which is the most misunderstood system call in the distributed-systems world. If you don'tfsyncbefore acknowledging a write, you've lied about durability.
Most modern implementations (etcd, ZooKeeper, Consul) operate in the crash-recovery model, not the idealized crash-stop model. They persist to disk aggressively and the protocol is structured so that a recovering node catches up rather than corrupting the group.
A word on fsync
write(2) to a file in Linux does not commit data to disk. It hands the bytes to the kernel's page cache. The kernel will eventually write them to the storage device, but "eventually" is measured in seconds, and a crash in the meantime loses the data. To make a write durable, you must call fsync(2) (or fdatasync), which forces the kernel to flush the page cache and wait for the device to acknowledge.
fsync is slow — milliseconds per call, sometimes tens of milliseconds on networked storage. Every consensus implementation makes some choice about how often to fsync. The safe choice is "before acknowledging any state change." The fast choice is "batch several state changes and fsync once." Most production systems batch; the batching window is a tuning parameter with real durability tradeoffs.
If a system claims to be "Raft" but doesn't fsync before acknowledging, it is running a different protocol than Raft and will, under the right conditions, lose committed writes. This is not a hypothetical — it is one of Kyle Kingsbury's greatest hits in the Jepsen testing series.
Omission
What it assumes: nodes are running, but messages they send or receive can be dropped.
Why it sits between crash and Byzantine: a node experiencing send-omission looks externally like a node that is sometimes dead and sometimes alive — its peers see inconsistent behavior. But internally, the node is not malicious. It just can't reliably communicate.
Send-omission is the natural model for a node behind a flaky link. Receive-omission is the natural model for a node whose inbox is dropping packets. In practice, TCP hides much of this from you, at the cost of translating omissions into long delays, which you get to resolve with timeouts.
Why omission matters: a surprising number of production incidents look like omission, not like crash. A node is still "up" (it shows as healthy, its process is running) but a subset of its peers can't reach it. Partial partitions are the worst kind, because they confuse failure detectors that were designed for clean crash-stop. If node A can reach B but B can't reach C, whose perception is correct?
Crash-fault-tolerant algorithms generally tolerate omission failures gracefully — a node that can't communicate effectively looks like a dead node to the quorum, and the protocol proceeds without it. The algorithms don't have to special-case omission; they inherit tolerance from their crash model. But the operational reality is rough: partial partitions cause oscillation, leader flapping, and cascading issues that we'll see in Chapter 13.
Byzantine
What it assumes: a faulty node may do anything. Send messages out of order. Lie about its state. Send different things to different peers. Collude with other faulty nodes. Replay old messages. Claim to have signatures it doesn't. Act correctly for a while, build reputation, then betray it.
Where the name comes from: Lamport, Shostak, and Pease's 1982 paper "The Byzantine Generals Problem," which cast the problem as generals besieging a city, some of whom might be traitors. A memorable frame with staying power; we'll dig into the paper in Chapter 7.
Why you might care:
- Software bugs. A corrupted node sending wrong data is, from the protocol's perspective, Byzantine. Cosmic rays, bad memory, disk corruption, kernel bugs — real, occasional, and arbitrarily bad.
- Compromised nodes. If an attacker has root on one of your replicas, they can make it lie. For high-integrity systems (financial, governmental, safety-critical), this is a real threat.
- Low-trust environments. Multi-organization systems where you don't fully trust your co-operators. Cross-bank settlement, government/contractor arrangements, consortium networks.
- Public networks. Blockchains, obviously. But also any system where participants have incentives to cheat.
What Byzantine tolerance costs you:
- Node count.
3f+1instead of2f+1. To tolerate 1 Byzantine failure you need 4 nodes, not 3. (Chapter 7 shows why.) - Message complexity. Classical BFT protocols (PBFT) use
O(n^2)messages per decision, vs.O(n)for Raft. Atn=4, the difference is 16 vs. 4; atn=100, 10,000 vs. 100. Modern protocols (HotStuff) bring this down, but not for free. - Cryptography. You need signed messages, not just authenticated ones. MACs aren't enough when nodes may lie about what they received — you need non-repudiable signatures, which are computationally expensive.
- Engineering complexity. BFT systems are harder to implement correctly than CFT systems, and harder to debug when they misbehave.
If you can avoid Byzantine tolerance, you should. If you cannot, the later chapters are for you.
What real systems actually assume
A rough survey, as of 2026:
| System | Failure model | Notes |
|---|---|---|
| etcd | Crash-recovery | Raft, fsync before ack. |
| Consul | Crash-recovery | Raft, similar to etcd. |
| ZooKeeper | Crash-recovery | Zab, a Paxos cousin. |
| Kafka (KRaft) | Crash-recovery | Raft variant; replaced ZooKeeper dependency. |
| Spanner | Crash-recovery | Paxos per shard; external consistency via TrueTime. |
| CockroachDB | Crash-recovery | Raft per range. |
| FoundationDB | Crash-recovery | Custom protocol; SMR-style. |
| Hyperledger Fabric | Depends on ordering service; Raft (CFT) by default, BFT variants exist | Permissioned. |
| Diem/Aptos | Byzantine | HotStuff family. |
| Public L1 blockchains | Byzantine + Sybil | Not our subject. |
The vast majority of production systems you will encounter assume crash-recovery and nothing worse. Byzantine-tolerant consensus is rare in mainstream infrastructure — it is reserved for cases where the operational assumption "we trust the operators of each node" no longer holds.
Choosing your failure model
The right question is not "what failures could possibly happen?" but "what failures am I willing to let my algorithm assume away?"
- If you operate all the nodes, in datacenters you control, running binaries you built: crash-recovery is usually sufficient. Corruption happens, but rarely, and checksums at the storage layer catch most of it.
- If you operate the nodes but cross multiple administrative domains (multi-cloud, multi-region with different teams), you may want detection for non-crash failures, but not necessarily BFT consensus. A CFT system plus good monitoring plus checksums usually suffices.
- If nodes are operated by different organizations that don't fully trust each other: BFT starts to earn its cost.
- If participation is open and you can't enumerate participants: you are outside this book's scope. Go build a blockchain.
What to take into the algorithm chapters
When you read Chapter 4 (Paxos) and beyond, keep asking:
- What failure model does this algorithm assume?
- What quorum size does it need, and why?
- What happens if the model is violated — does the algorithm fail safe, or does it go arbitrarily wrong?
These questions are not hostile. Every algorithm is correct under its model. The only bad decision is picking one whose model doesn't match your reality.
Next, Paxos. The algorithm everyone cites, almost nobody implements from scratch, and which is more interesting than its reputation suggests.
Paxos, Carefully
Paxos has a reputation. It is supposed to be the algorithm that everyone cites and nobody understands. Leslie Lamport's original 1989 paper, The Part-Time Parliament, was rejected on grounds of "excessive cuteness" (it was framed as a parable about a fictional Greek island parliament), sat in a drawer for nine years, and finally appeared in 1998. Even then it took the 2001 follow-up, Paxos Made Simple, for most readers to get it.
We're going to treat Paxos with patience. The algorithm is not hard once you untangle the moving parts. We'll split it into its layers, starting with the one-decision core — Single-Decree Paxos or Basic Paxos — and then building up.
The single decision
Forget logs for a moment. Suppose we just want a group of processes to agree on one value. That is Basic Paxos.
Three roles:
- Proposers propose values.
- Acceptors vote on proposals.
- Learners are told which value was chosen.
In practice every process plays all three roles, but the separation is useful. Roles are not a deployment concern; they are a reasoning tool.
We have N acceptors. A quorum is any set of ⌈(N+1)/2⌉ acceptors — a strict majority. The safety of Paxos relies on the fact that any two majorities intersect in at least one acceptor.
A proposal has two parts: a globally ordered ballot number (often called a round or sequence number) and a value. Ballot numbers must be totally ordered and unique per proposer, which is easy to arrange — e.g., (counter, proposer_id) with lexicographic ordering.
The two-phase protocol
Paxos runs in two phases.
Phase 1: Prepare / Promise — the proposer asks acceptors for permission to propose at some ballot b.
proposer → acceptors: PREPARE(b)
acceptor on receiving PREPARE(b):
if b > any ballot it has previously promised:
promise not to accept any proposal with ballot < b
reply PROMISE(b, prior_accepted_b, prior_accepted_value)
else:
reject (or silently ignore)
A PROMISE includes the highest-ballot proposal the acceptor has already accepted (if any). This is the crux of Paxos's safety: an acceptor, when promising for ballot b, tells the proposer about any earlier value it is committed to.
Phase 2: Accept / Accepted — the proposer, having collected promises from a quorum, chooses a value and asks to accept it.
proposer on receiving PROMISE from a quorum:
if any acceptor reported a prior accepted proposal:
pick the value from the highest-ballot prior proposal
else:
pick any value (e.g., the client's request)
send ACCEPT(b, chosen_value) to acceptors
acceptor on receiving ACCEPT(b, v):
if b ≥ highest promised ballot:
accept(b, v) -- record it as durably accepted
reply ACCEPTED(b, v)
else:
reject
A value is chosen when a quorum of acceptors have ACCEPTED the same proposal. Learners discover that a value is chosen by observing the acceptor replies (either directly or via a distinguished learner).
That's Basic Paxos. Two round-trips, three roles, a majority quorum, and two invariants:
- Promise invariant: an acceptor that has promised ballot
bwill never accept any proposal with ballot< b. - Proposer-chooses-safe-value invariant: if any acceptor in the quorum reports a prior accepted value, the proposer is forced to propose that same value at the new ballot.
Those two invariants together give you safety: no two distinct values can both be chosen.
Why the invariants imply safety
Intuitively: suppose v was chosen at ballot b1. That means some quorum Q1 accepted (b1, v). Now consider a later ballot b2 > b1. For b2 to succeed, the proposer needed to collect promises from a quorum Q2. Since any two quorums intersect, at least one acceptor a is in both Q1 and Q2. That acceptor's promise for b2 included its prior accepted proposal — which was (b1, v) or later. The proposer is forced by the second invariant to propose v (or a value from an even higher ballot, which, recursively by the same argument, is also v).
So once v is chosen, every later successful ballot must propose v. Safety!
This is the heart of Paxos, and it is worth re-reading until it clicks. The safety argument is purely combinatorial — it doesn't depend on timing, failure detectors, or ordering of messages. It depends only on:
- Quorums intersect.
- Acceptors keep their promises.
- Proposers honor the "pick the highest prior value" rule.
Why Paxos survives concurrency
Two proposers can try to propose concurrently. They can interfere — proposer A gets promises for ballot 5, then proposer B gets promises for ballot 7 (which invalidates A's promises), then A gets for ballot 9, and so on. Each such interference wastes a round; in pathological adversarial schedules, this can continue forever, producing no decision. That is FLP — precisely the kind of livelock the impossibility result allows.
In practice, we avoid this with a distinguished proposer (leader). We'll get to it when we discuss Multi-Paxos.
Multi-Paxos: chaining decisions
One decision isn't useful; we want a log. The naive approach runs Basic Paxos independently for each log slot. That works — each slot is an instance of Paxos, indexed by slot number. But it's wasteful: every slot pays two round-trips of its own.
Multi-Paxos is the optimization: elect a distinguished leader, have the leader run Phase 1 once for a range of future slots, and then run Phase 2 repeatedly for each command. The leader is the stable proposer; clients send their commands to the leader; the leader assigns slot numbers and runs the Phase 2 step for each.
This reduces the steady-state cost to one round-trip per decision — the leader sends ACCEPT to the quorum, the quorum replies ACCEPTED, the leader commits. Phase 1 is only re-run when the leader changes.
steady state (leader L, followers F1..Fn):
client → L: REQUEST(cmd)
L → F1..Fn: ACCEPT(ballot, slot, cmd)
F1..Fn → L: ACCEPTED(ballot, slot)
L → client, followers: DECIDE(slot, cmd)
The catch, of course, is leader election itself. How do we agree on who the leader is? With a failure detector and... Basic Paxos on a special "leadership" register, or with a lease protocol, or with an external coordination service. In practice this is where most of the subtle bugs hide.
Fast Paxos
Lamport's 2005 Fast Paxos trims the steady-state case to a single round-trip from the client, bypassing the leader for the common case. It does this by letting clients broadcast proposals to acceptors directly, at the cost of needing a larger quorum (3/4 of acceptors rather than majority) for the fast path. If there's contention (two clients try to propose different values concurrently), the system falls back to classical Paxos.
Fast Paxos is clever and rarely implemented. Most Multi-Paxos deployments stick with leader-based for its simplicity. We mention it so you recognize the name.
Paxos as Lamport originally described it
The 1989 paper, The Part-Time Parliament, tells the story of the Greek island of Paxos, whose parliament passed decrees despite legislators wandering in and out for lunch. The math is encoded as parliamentary procedure — ballots, decrees, a Legislature Register.
The parable is charming and was completely ineffective as pedagogy. Nobody could figure out what the actual algorithm was. Lamport himself said in a retrospective: "I thought, and still think, that Paxos is an important algorithm. Since 1980, I had found that the best way to present algorithms clearly was with mathematics. Since math is a universal language..." — missing, as he admits, that most readers were looking for code, not math.
Nine years later, in Paxos Made Simple, he presented it as we have above: proposers, acceptors, learners, two phases. This version clicked. But the damage was done — a generation of engineers had concluded that Paxos was impenetrable. Raft was born as a direct response. We'll see it in the next chapter.
Paxos as it actually gets implemented
The gap between Paxos Made Simple and a running implementation is wide. Here are some of the things Paxos doesn't tell you directly:
- Log compaction. The log grows without bound as decisions accumulate. In practice you take periodic snapshots of the state machine and discard log prefixes older than the snapshot. This is a non-trivial engineering exercise — you have to handle in-flight requests, rejoining nodes that need to catch up, and recovering from a restart mid-compaction.
- Membership changes. Adding or removing a node requires care. The Paxos paper's "α" extension (reconfiguring which acceptors vote on future slots) is correct, but the implementation has pitfalls. Raft devotes a whole section to joint consensus for this reason.
- Failure detection. Who notices when the leader dies? Usually a timeout, but the leader's lease, the election protocol, and the followers' clocks all have to agree roughly enough to avoid flapping.
- Catch-up. A follower returning from a crash needs to learn all decisions it missed. This is a streaming protocol of its own, with flow control and snapshot-versus-log decisions.
- Batching and pipelining. A straightforward Multi-Paxos processes one slot at a time, end-to-end. For throughput, you batch multiple commands into one Accept and/or pipeline consecutive Accepts without waiting for previous ones to commit. Both optimizations are correct under Paxos's invariants but add implementation complexity.
These are not flaws in Paxos — they are engineering problems that every consensus implementation faces. But the original paper hand-waves most of them, which is a significant source of the "Paxos is hard" reputation.
Quorum arithmetic
Sizing is straightforward for crash-fault tolerance:
- Total acceptors:
N - Max simultaneous failures tolerated:
f - Quorum size:
Q = f + 1 - Requirement for safety: any two quorums intersect —
2Q > N→N ≥ 2f + 1
So:
Nodes N | Max failures f | Quorum Q |
|---|---|---|
| 3 | 1 | 2 |
| 5 | 2 | 3 |
| 7 | 3 | 4 |
Five nodes tolerating two failures is the sweet spot for most production systems. Three nodes is common but tolerates only one failure — awkward during maintenance, since routine reboots look like failures.
Even node counts are pointless: four nodes still only tolerate one failure, but require a quorum of three, costing you more latency and acks than three nodes. Always pick an odd count for the voting membership.
Correctness intuitions, not formal proofs
If you want the formal correctness argument, Lamport has written it several times; we recommend Paxos Made Simple followed by The Part-Time Parliament as a readable pair. Here is the intuition at speed:
- Safety (agreement): any two decided values at the same slot must be equal. This holds because of the quorum intersection argument above: once a value is decided, the quorum overlap forces every future successful ballot to re-propose that value.
- Safety (validity): if a value is decided, it was proposed by some proposer. This is immediate from the protocol — acceptors only accept values they receive in an Accept message.
- Liveness (termination): under partial synchrony with a stable leader, every proposal eventually completes. This requires the leader not to be preempted continuously by competing proposers; hence the emphasis on a single stable leader in Multi-Paxos.
What Paxos teaches
Paxos is not just an algorithm; it is a set of ideas that every later algorithm inherits:
- Quorum intersection as the bedrock of safety. Once you see this pattern, you'll notice it everywhere — Raft, PBFT, HotStuff, and beyond.
- Ballot numbers (or views, or terms, or epochs) as a logical clock for leadership. The whole "who is in charge" question, which is hellish in distributed systems, is corralled into a single monotonically increasing integer.
- Separation of agreement from log ordering. Multi-Paxos shows how to reduce a sequence of agreements to a single leader's assignment of slots, with Phase 1 amortized over many decisions.
Raft, which we meet next, is not a different algorithm so much as a different presentation of very similar ideas, engineered for clarity.
A small Python sketch
Here is a minimal single-decree Paxos, stripped of network and failure handling, just to show the shape. Do not put this in production; it is illustrative.
class Acceptor:
def __init__(self):
self.promised_ballot = None # highest ballot we've promised
self.accepted_ballot = None # ballot of most recent acceptance
self.accepted_value = None
def prepare(self, ballot):
if self.promised_ballot is None or ballot > self.promised_ballot:
self.promised_ballot = ballot
return ("promise", self.accepted_ballot, self.accepted_value)
return ("reject",)
def accept(self, ballot, value):
if self.promised_ballot is None or ballot >= self.promised_ballot:
self.promised_ballot = ballot
self.accepted_ballot = ballot
self.accepted_value = value
return ("accepted",)
return ("reject",)
def propose(proposer_ballot, client_value, acceptors):
# Phase 1
promises = []
for a in acceptors:
reply = a.prepare(proposer_ballot)
if reply[0] == "promise":
promises.append(reply)
if len(promises) * 2 <= len(acceptors): # no majority
return None
# Choose value: highest prior accepted, or our own
prior = [(b, v) for (_, b, v) in promises if b is not None]
if prior:
_, chosen = max(prior)
else:
chosen = client_value
# Phase 2
accepts = 0
for a in acceptors:
reply = a.accept(proposer_ballot, chosen)
if reply[0] == "accepted":
accepts += 1
if accepts * 2 > len(acceptors):
return chosen # decided
return None
This isn't runnable-distributed — there's no network, no concurrency, no crashes — but it is a faithful single-decree sketch. If you read it twice, the invariants should start to feel inevitable rather than mysterious.
What's next
We've seen consensus the way Lamport saw it: elegant, symmetric, stated in terms of ballots and quorums. Next chapter: the same problem, restated in a way that made a generation of engineers comfortable shipping it. Raft.
Raft: Paxos for Humans
In 2014, Diego Ongaro and John Ousterhout published In Search of an Understandable Consensus Algorithm — better known by its eventual name: Raft. The opening paragraph does something rare for a distributed systems paper: it names its goal as understandability.
That framing was radical. Other consensus algorithms had been optimized for provable correctness, theoretical elegance, or asymptotic performance. Raft optimized for humans reading and implementing it. The authors argued, convincingly, that an algorithm no one can implement correctly is an algorithm no one is actually using.
They were right. Within a few years of the paper, Raft was everywhere — etcd, Consul, CockroachDB, TiKV, RethinkDB, the Kafka controller (eventually), and dozens of smaller projects. Meanwhile Paxos, despite being older and better-studied, continued to live mainly in Google's internal systems and a handful of determined academic implementations.
This chapter walks through Raft the way it should be walked through: not as "Paxos, but easier," but as a coherent design with its own choices.
The three subproblems
Raft decomposes consensus into three subproblems that can be understood (mostly) independently:
- Leader election. Exactly one server is in charge at a time.
- Log replication. The leader replicates its log to followers.
- Safety. A committed entry never gets uncommitted, even across leader changes.
The genius of Raft is that each of these is a clean, intuitive protocol in isolation, and the interaction between them — while subtle — is confined to a few specific rules.
Every Raft server is in one of three states at any time:
- Follower — passive, responds to leaders and candidates.
- Candidate — actively seeking election.
- Leader — handling client requests, replicating the log.
┌─────────┐ timeout ┌───────────┐ majority ┌────────┐
│Follower │─────────▶│ Candidate │──────────▶│ Leader │
└─────────┘ └───────────┘ └────────┘
▲ │ │
│ discover │ │
└────────────────────┴──────────────────────┘
higher term / new leader
Terms: Raft's logical clock
Raft numbers time with terms. Each term begins with an election. If the election succeeds, the winner leads for the rest of that term. If the election fails (split vote), the term ends with no leader and a new term starts.
Every RPC carries a term number. The rule is universal: if a server receives any message with a term higher than its own, it updates its term and becomes a follower. This single rule does an enormous amount of work. It ensures that stale leaders step down as soon as they talk to any node that knows about the new term. It prevents a leader coming back from a long pause from issuing any commands, because its next communication will get a higher-term response.
Terms map directly onto Paxos's ballot numbers and VR's views. Same idea, clearer name.
Leader election
Followers start out receiving heartbeats from the current leader. If a follower goes some election timeout (typically 150–300ms, randomized) without hearing from a leader, it becomes a candidate:
1. currentTerm ← currentTerm + 1
2. state ← candidate
3. vote for self, count = 1
4. reset election timer
5. send RequestVote RPCs to all other servers
Other servers vote according to two rules:
- They haven't already voted for someone else in this term.
- The candidate's log is at least as up-to-date as theirs.
The "up-to-date" check compares last log entries: higher term wins; same term, longer log wins. This is Raft's key safety rule for elections — it ensures a leader always has all committed entries. We'll come back to why this works.
If the candidate gets a majority of votes, it becomes leader. If it receives an AppendEntries from a server with a higher or equal term, it steps down. If the election timer runs out (split vote), it starts a new term and tries again.
Random timeouts: the anti-livelock trick
Raft uses randomized election timeouts (e.g., uniform over [150ms, 300ms]) to break symmetry. Without randomization, if all followers notice the leader is gone at the same moment, they'd all start campaigns simultaneously, split the vote, fail, and retry in lockstep — livelock.
Randomization makes it likely that one follower times out noticeably earlier than the others, becomes a candidate, and wins before the others start. Simple; effective.
Log replication
Once elected, the leader receives client commands, appends them to its log, and replicates them to followers via AppendEntries RPCs.
Each log entry contains:
term— the term when the entry was createdindex— position in the log, 1-basedcommand— the operation to apply to the state machine
An entry is committed once the leader has replicated it to a majority. Once committed, the leader can safely apply the command to its state machine and reply to the client. Committed entries are eventually replicated to all followers and applied in order.
The AppendEntries RPC is a clever bit of engineering:
AppendEntries(leaderTerm, leaderId, prevLogIndex, prevLogTerm,
entries[], leaderCommit)
- It carries heartbeat (empty
entries[]) as well as new entries. - It tells the follower what should be at the position just before the new entries (
prevLogIndex,prevLogTerm). The follower refuses the RPC if its log doesn't have a matching entry there. - It carries
leaderCommit, telling the follower how far the leader has committed, so the follower knows how far to apply.
The prevLogIndex/prevLogTerm check gives Raft the log matching property: if two logs share an entry at the same index with the same term, they are identical for all prior entries. The leader and follower converge by walking backward until they find a match and then replaying forward.
Log matching, visually:
leader ...│ t1,c │ t1,c │ t2,c │ t3,c │ t3,c │ <-- leader
follower ...│ t1,c │ t1,c │ t2,c │ t2,x │ <-- divergence at index 4
AppendEntries with prevLog=(index=3, term=t2): OK, matches.
AppendEntries with prevLog=(index=4, term=t3): follower's entry is
term t2, mismatch. Reject. Leader decrements nextIndex and retries.
The leader maintains a nextIndex for each follower — where to send next. On a rejection, decrement and retry; on success, advance.
Safety: the subtle rules
The hard part of Raft isn't the common case; it's ensuring safety across leader changes. Three rules do the work:
Rule 1 (Election Restriction). A candidate only wins if its log is at least as up-to-date as the majority it requests votes from.
This ensures the new leader has all committed entries from previous terms. The proof is quorum-intersection: a committed entry was stored on a majority; any majority that votes for a new leader intersects the committing majority; by the "at least as up-to-date" rule, the new leader's log includes that entry.
Rule 2 (Leader Append-Only). Leaders never overwrite or delete entries in their own log; they only append.
This means a leader's log is a monotonically-growing prefix.
Rule 3 (Don't commit entries from previous terms via counting replicas). This is the subtle one. A leader from term T cannot simply replicate an entry from term T' < T to a majority and declare it committed. Why?
Consider: a leader in term 2 replicates entry X to a majority, crashes before committing. A new leader is elected in term 3. It finds X in its log. If it just counts replicas of X and commits once a majority has it, there's a scenario where a subsequent leader (in term 4) with a longer log but without X could overwrite X — violating safety.
The fix: a leader only commits an entry by counting replicas for an entry from its current term. If there are older-term entries to "rescue," the leader commits them implicitly by committing a new entry from its current term (the commit of a later entry commits everything before it).
In practice, this manifests as: a new leader immediately issues a no-op entry in its own term. Once that no-op is committed, everything before it is committed transitively.
The Raft paper spends pages on this, with a detailed scenario (Figure 8 of the paper). It's the single most important rule that beginners miss. Without it, Raft is subtly broken.
Membership changes
Changing the cluster configuration (adding or removing servers) is dangerous if done naively. The old and new configurations might have overlapping but non-intersecting majorities — two different leaders could coexist, one winning the old majority, one winning the new.
Raft's approach is joint consensus: temporarily require majorities in both the old and new configurations for any decision. The transition proceeds in two log entries:
C_old,new— a joint configuration containing both member sets. Decisions need a majority ofoldAND a majority ofnew.- Once
C_old,newis committed, appendC_new— the new configuration alone.
During the joint period, no single configuration can decide on its own, so there can be at most one leader. Once C_new commits, the old configuration is abandoned.
Newer Raft implementations often use single-server changes (add or remove one server at a time) for simplicity. The paper proves this is safe if you add/remove one server at a time and the quorums of old and new differ by exactly one member, always overlapping. Most production Raft libraries use this simpler form.
Log compaction
Logs grow; storage is finite. Raft uses snapshots: periodically, the state machine is serialized; everything in the log before the snapshot index can be discarded; the snapshot and a small bit of metadata (lastIncludedIndex, lastIncludedTerm) take their place.
A follower that is too far behind to catch up via the log receives an InstallSnapshot RPC. It installs the snapshot, discards its own log, and starts from there.
Snapshotting is the messiest practical aspect of Raft. You need:
- Consistent snapshots (usually by application-level serialization while buffering new commands).
- Atomic replacement of the snapshot file.
- Correct handling of log truncation across all servers.
- Recovery from mid-snapshot crashes.
There is nothing deep here, but there's a lot of code. Raft implementations are usually 1500-3000 lines; a chunk of that is snapshotting.
Client interaction
Three subtleties:
- Retries and duplicate commands. If a client's RPC to the leader times out, the client may retry. The command might have already been committed. Raft supports linearizability by having clients assign a unique ID per command, and the leader deduplicates based on ID. This is usually handled in the state machine, not Raft itself.
- Read-only queries. Reads that need linearizability can't just hit any follower — followers might be stale. Three common approaches:
- Forward to leader. Every read goes through the leader, which confirms it is still the leader (by exchanging heartbeats with a majority) and then serves the read.
- Leader leases. The leader holds a time-bounded lease; as long as the lease is valid, it can serve reads without a round-trip. Requires synchronized-ish clocks, which introduces its own risks.
- Follower reads with read index. A follower asks the leader for a read index, waits until its log has applied up to that index, then reads. Cheaper for the leader, slower for the follower.
- Leader stickiness. Clients usually cache "the last leader I talked to" and retry there first. If that server is no longer leader, it returns a hint; clients re-resolve.
Why Raft succeeded where pedagogical Paxos failed
A few reasons, empirical and structural:
- Named subproblems. Leader election, log replication, safety. Engineers can learn one at a time.
- A single canonical algorithm, not a family. Multi-Paxos has dialects; Raft has a reference implementation and most libraries match it closely.
- Strong leader. The leader is authoritative for its term; followers defer to it. This is less symmetric than Paxos but much easier to reason about.
- Concrete RPC interfaces. The paper gives you RequestVote and AppendEntries with field names. Paxos gives you phases. One you can implement directly; the other you have to translate.
- The paper itself is good. It's written like a textbook chapter, not a theorem-proof sequence. Diagrams, pseudocode with line numbers, worked examples.
The downside is that Raft's "strong leader" design can be less efficient than leaderless Paxos variants (e.g., EPaxos) under certain workloads. But the clarity-to-performance tradeoff favored clarity, and clarity won the market.
The subtle things beginners miss
Every Raft implementation review catches the same handful of bugs. Here are the ones you should look for in your own:
- Committing previous-term entries by counting. The commit rule is "current term, majority replicated." Entries from earlier terms piggyback via a current-term entry. Get this wrong and you lose writes.
- Voting without persisting.
currentTermandvotedFormust be persisted to disk before responding to a RequestVote. Otherwise a crashed-and-recovered server can vote twice in the same term. - Log index off-by-ones. Raft's log is 1-indexed. Snapshot metadata refers to the last included index. Every careful implementation has a set of invariants checked at each operation; every hasty implementation has a bug here.
- Election timer reset conditions. The timer resets on: receiving a valid AppendEntries from the current leader, granting a vote. It does NOT reset on: receiving an AppendEntries with a stale term, receiving a RequestVote you deny. Get this wrong and you get leader flapping.
- Uncommitted-to-committed transitions not atomic w.r.t. state machine. The state machine only applies committed entries, in order, one at a time. A restart must replay from the last applied index. Miss this and you apply commands twice.
Most production implementations have tests for each of these (good ones have Jepsen verdicts). Fewer students' implementations do. This is the gap between "I implemented Raft" and "I shipped Raft."
Raft by the numbers
For a 5-node Raft cluster:
- Quorum size: 3
- Maximum failures tolerated (crash-recovery): 2
- Steady-state round trips per write: 1 (leader → followers → leader)
- Messages per write (best case):
2 * (N - 1)= 8 at N=5 - Disk writes per committed entry: 1 on leader, 1 on each of the replicas (before ack)
You'll find that most Raft systems quote something like "~1ms writes in-cluster, ~5ms cross-AZ, ~50-100ms cross-region" as order-of-magnitude figures. These depend heavily on batching, hardware, and distance; treat them as ranges, not numbers to quote back.
What Raft teaches
- Clarity is not a nice-to-have; it is a property of the design that determines whether an algorithm will be implemented correctly by thousands of engineers who aren't its authors.
- Strong leadership simplifies the common case. Pay the cost at leader change, earn it back every second of steady state.
- Safety reasoning should be reducible to a small number of invariants. Raft has three; Paxos has two; PBFT has roughly four. If you can't list your algorithm's safety invariants in a short paragraph, you probably don't understand it yet.
- Pseudocode with line numbers beats prose descriptions. Always.
Next: Viewstamped Replication. The algorithm that was there first, and that teaches you things neither Paxos nor Raft does.
Viewstamped Replication
Barbara Liskov and Brian Oki published Viewstamped Replication (VR) in 1988, at the same PODC where Dwork, Lynch, and Stockmeyer presented their partial synchrony results. Paxos was a year away from Lamport's first draft. Viewstamped Replication was the first complete, workable, crash-fault-tolerant consensus-based replication protocol.
And then it got quietly forgotten.
It came back, three decades later, when Raft's authors cited it as a significant influence — much more so than Paxos. In 2012, Liskov and Cowling published Viewstamped Replication Revisited (VRR), updating the protocol for modern systems. If you squint, Raft is more a descendant of VRR than of Paxos.
This chapter is a bit shorter than Paxos and Raft, because much of VR will feel familiar. We focus on what VR teaches that the other two don't emphasize: the explicit view change protocol, the primary/backup framing, and the treatment of client requests as first-class concerns.
The setup
VR has N = 2f + 1 replicas, tolerating f failures. At any moment, one replica is the primary; the rest are backups. A view is a period during which a particular primary is in charge. Views are numbered with strictly-increasing view numbers. When a primary fails or is suspected of failing, the remaining replicas execute a view change protocol, electing a new primary and incrementing the view number.
Compare:
| Raft term | VR term | Paxos term |
|---|---|---|
| Term | View | Ballot |
| Leader | Primary | Distinguished proposer |
| Follower | Backup | Acceptor |
| Log | Op-log | Sequence of slot decisions |
| AppendEntries | Prepare | Phase 2 Accept |
These are different names for largely the same mechanism. But VR's choice of vocabulary — primary/backup, view, op-log — makes the protocol read less like agreement and more like replication of a service. That framing, turns out, is what real systems care about.
Normal operation
A client sends a request to the primary. The primary:
- Assigns the request an op-number (its position in the op-log).
- Appends
(view, op-number, request)to its log. - Sends
Prepare(view, op-number, request, commit-number)to each backup.
Each backup, on receiving a Prepare:
- Verifies the view is current and the op-number is next in sequence (waiting or requesting state transfer if not).
- Appends the entry to its log.
- Persists (with the same fsync caveats from Chapter 3).
- Replies
PrepareOK(view, op-number, replica-id).
The primary commits the op once it has PrepareOK from f backups (so f+1 total replicas, counting itself, have the op). It then:
- Updates its commit-number.
- Executes the op against the state machine.
- Replies to the client.
- Piggybacks the new commit-number on the next
Prepare(or sends an explicitCommitmessage if there's no pipelined traffic).
Backups execute committed ops against their state machines in op-number order.
This is identical in structure to Raft's normal operation: one round-trip to a majority, leader applies, followers apply as they learn the commit.
Client tables
VR makes explicit something Raft leaves to the application: each client has a monotonically increasing request-number, and the primary keeps a client table mapping each client to its latest request-number and the result of that request. If a client retries a request, the primary can short-circuit — look up the table, return the cached response — without redoing the work.
This matters because consensus without deduplication is not linearizable: a client that times out and retries can cause its command to execute twice. The client table is what makes VR's at-most-once semantics clean. In Raft, you implement this yourself; in VR, it's built in.
View change
The view change is VR's most educational contribution. When backups suspect the primary has failed (by timeout), they initiate a view change:
- Each replica increments its view number, enters
view-changestatus, and sendsStartViewChange(new-view, replica-id)to everyone. - On receiving
StartViewChangefromfother replicas (sof+1total in agreement), a replica sendsDoViewChange(new-view, log, old-view, op-number, commit-number, replica-id)to the proposed new primary — the replica whose id isnew-view mod N. - The new primary waits for
DoViewChangemessages fromfother replicas. It now has the logs fromf+1replicas (including itself). - The new primary chooses the log from whichever replica has the highest-view-number entries (ties broken by largest op-number). This is the definitive log.
- The new primary sends
StartView(new-view, log, op-number, commit-number)to the backups. - Backups replace their logs with the new primary's, enter the new view, and resume normal operation.
Three properties fall out:
- Safety. Any op that was committed in a prior view was, by the commit rule, replicated to at least
f+1replicas. AnyDoViewChange-gathering quorum off+1replicas intersects that quorum in at least one replica. The new primary sees the op and includes it. - Liveness. Under partial synchrony, view changes eventually succeed: messages arrive, timeouts are bounded, the new primary establishes itself.
- View number monotonicity. A replica in view
vnever regresses. APreparefor viewv' < vis rejected.
Raft's election + log-completeness check is a streamlined version of this: the election is the StartViewChange + DoViewChange combination, and the "at least as up-to-date" rule is how Raft picks the new leader's log without needing to gather logs from everyone explicitly.
Recovery
When a replica crashes and comes back up, it can't just resume. Its durable state might be stale. VR specifies a recovery protocol:
- The recovering replica picks a fresh nonce
xand sendsRecovery(replica-id, x)to all replicas. - Each other replica responds with
RecoveryResponse(view, x, log, op-number, commit-number)if it is the primary, or a simpler response with just view and nonce if it is a backup. - The recovering replica waits for responses from
f+1replicas (including the current primary's if available). It adopts the current view and, if the primary responded, the log. - It resumes as a backup.
The nonce prevents an attacker or stale state from sneaking old recovery responses in as current. This is a defense against replay; VR takes this sort of thing seriously because it was designed for research settings where replicas were actually distributed across untrusted administrative boundaries at times.
State transfer
A replica that falls behind (via a long delay or after recovery) catches up via state transfer. It asks another replica for the missing log entries between its op-number and the current commit-number. The other replica ships them.
This is functionally identical to Raft's catch-up mechanism, but VR makes it explicit in the protocol. It's one more piece you don't have to invent when implementing the algorithm.
What VR gets right that neither Paxos nor Raft emphasizes
Reading the VR papers and comparing them to Paxos and Raft, a few things stand out.
Client semantics are in the protocol. The client table, request-numbers, at-most-once delivery — all baked in. You don't read three VR implementations and see three different deduplication schemes; you see one.
The primary/backup framing is the right mental model. It matches how practitioners think about replicated services. When you say "primary/backup," engineers know roughly what you mean. When you say "proposer/acceptor" or "leader/follower," the meaning is the same but the intuition takes longer to build. VR's naming reduces the activation energy.
View change is treated as a first-class subroutine. The messages are named (StartViewChange, DoViewChange, StartView), the sequencing is explicit, and the correctness rule ("adopt the log from the replica with the highest-view latest op") is clean. Raft has this, but as part of the election plus AppendEntries pipeline; VR separates concerns.
Recovery is specified. The protocol says what a replica coming back from a crash does, step by step, with a nonce against replay. Raft assumes you persist currentTerm/votedFor and can resume; the rest is implicit. VR's explicitness is easier to implement correctly.
Relationship to Raft
Ongaro's dissertation credits VR more than Paxos, and it shows. Raft's:
- Strong-leader design: inherited from VR's primary/backup.
- Log-matching invariant: VR has the equivalent via
DoViewChangeand the primary's log-merge rule. - Explicit membership change via a reconfiguration entry in the log: VR has an equivalent.
- Catch-up via the same AppendEntries messages that carry new entries: Raft's slight innovation; VR uses separate state transfer.
If you already know Raft, VR reads like a well-documented alternate history. If you are new to both, reading VRR (the 2012 Revisited paper) and then Raft gives you a clearer view of the problem space than either alone.
Why VR was "lost"
A guess, and no more: Paxos had Lamport's reputation behind it, and a paper with his name on the front. Viewstamped Replication had two MIT researchers and an unfamiliar framing (it was originally aimed at replicated services in a slightly different model — replication through agreement on a sequence of views). Paxos was mysterious but famous; VR was concrete but less cited. By the time the community realized that concrete beats mysterious, a decade had passed and the algorithm was "old."
The 2012 VRR paper was partly an acknowledgment that VR deserved re-introduction. It did not have Raft's cultural impact, but it should be read by anyone serious about implementing consensus. It is shorter and more complete than most of the Paxos canon.
What you should take from VR
- There is rarely one "right" algorithm; there are multiple coherent designs that tolerate the same failures. Paxos and VR were contemporaneous, solved the same problem, and pick different emphases.
- The client-facing parts matter. Request numbers, client tables, at-most-once semantics — if these aren't in your algorithm, they'll be in your bug tracker.
- View changes are not scary if you name them. The VR view change is four messages long. Raft's election is also short. The length of these protocols is the length of a careful state diagram, not an impossibility.
- If you ever need to defend a choice of "we used Raft" to a reviewer, the honest answer is: "It's Viewstamped Replication with better docs and a better name." Nobody will be mad.
Next, we turn to the dark side: what happens when nodes don't just crash, but lie.
Byzantine Fault Tolerance, Classical
We have spent six chapters assuming that nodes, when they fail, fail cleanly. They stop. They don't send any more messages. When they come back, they come back honestly with a consistent view of their own past.
Now drop that assumption. What if a node can do anything?
Lie about its log. Send different votes to different peers. Forge messages, if the cryptography allows. Collude with other faulty nodes. Act perfectly normal for an hour and then betray.
This is the Byzantine failure model, named by Leslie Lamport, Robert Shostak, and Marshall Pease in their 1982 paper "The Byzantine Generals Problem." The paper frames the model as a story about Byzantine generals besieging a city, some of whom might be traitors. The story is charming; the math under it is unforgiving.
The generals' problem
Several generals, each commanding a division outside the city, must agree on a single plan: attack or retreat. They communicate by messenger. Some of the generals may be traitors who will try to prevent the loyal ones from reaching agreement. Traitors can:
- Send different plans to different generals.
- Lie about what other generals told them.
- Collaborate with other traitors.
Can the loyal generals agree on a plan, and can they carry out that plan together?
Translated to distributed systems: can correct nodes reach consensus despite some peers behaving arbitrarily?
The core theorem
The paper's main result: 3f + 1 nodes are necessary to tolerate f Byzantine failures.
This is tight — you cannot do it with fewer. The bound applies under a particular communication model (unauthenticated messages — no cryptographic signatures). With signatures, the bound changes, which we'll discuss in a moment.
Let's build intuition for why 3f+1 is needed.
Why 3f+1? (Intuition, not proof)
Suppose we have N nodes, with up to f Byzantine. We want the honest majority to reach agreement despite the liars.
When you want a decision, you gather messages from the group. But:
- Up to
fnodes might not respond (silent, crashed, partitioned). - Up to
fnodes might respond with lies.
So from any quorum Q that you wait for, you subtract:
fnodes that might not be inQat all (they're silent).fof the nodes inQmight be lying.
For the remaining honest-and-responsive nodes in Q to constitute a majority of the honest nodes overall, we need:
|Q| - f > (N - f) / 2
Solving: |Q| > (N + f) / 2, i.e., |Q| ≥ (N + f + 1) / 2 = (N + f)/2 + 1/2 (integer math) → Q ≥ ⌈(N+f+1)/2⌉.
Now for two such quorums to intersect in at least f+1 honest nodes (so that any two quorums agree on the honest majority), we need:
2|Q| - N ≥ f + 1
Combining gives N ≥ 3f + 1. Hence the magic number.
(The careful version of this argument is in the 1982 paper and in Castro and Liskov's PBFT paper. The sketch above is enough to feel why it's not 2f+1.)
Quorum sizes in BFT
With N = 3f + 1:
- Quorum size (for most BFT protocols):
2f + 1. - Two quorums intersect in at least
f + 1nodes, so at least one honest node is in any intersection — enough to ensure consistency.
f | N | Quorum |
|---|---|---|
| 1 | 4 | 3 |
| 2 | 7 | 5 |
| 3 | 10 | 7 |
The jump from 2f+1 to 3f+1 is expensive. Tolerating one failure goes from 3 nodes (crash) to 4 nodes (Byzantine). Tolerating two failures goes from 5 to 7. The proportional overhead of BFT is roughly 50% more hardware for the same fault tolerance.
Authenticated vs. unauthenticated messages
Lamport, Shostak, and Pease distinguish two message models:
- Unauthenticated (oral) messages. A node receiving a relayed message has no way to verify who originally sent it. A traitor can claim "General A told me X" and there's no way to verify. This is the model in which
3f+1is necessary. - Authenticated (signed) messages. Every message is signed by its author. A relayed message can be verified: if General A signed it, no other node can forge it. In this model,
f + 1nodes suffice for some variants of the problem, though the general case still depends on network assumptions. For the full distributed-consensus problem under partial synchrony, you still need3f+1nodes.
In practice, every modern BFT system (PBFT, HotStuff, Tendermint, the whole lineage) uses signatures. The 3f+1 bound still applies to the node count, because it's driven by quorum intersection arguments, not by message authenticity. But signatures change what a Byzantine node can do — it can still lie about what it received, but it can't impersonate other nodes or forge their votes.
This is a significant practical simplification. Without signatures, a Byzantine node can claim "I heard X from A" when A said nothing of the sort, and honest nodes have to untangle that. With signatures, every statement about another node's vote must be backed by that node's actual signature, which a Byzantine node cannot fabricate.
What changes from CFT to BFT
Compare the required machinery:
| Aspect | CFT (Paxos/Raft/VR) | BFT (PBFT and descendants) |
|---|---|---|
| Node count | 2f+1 | 3f+1 |
| Quorum | f+1 | 2f+1 |
| Message authenticity | MACs or TLS suffice | Full digital signatures needed |
| Phase count per decision | 2 (prepare, accept) | 3 (pre-prepare, prepare, commit) or more |
| Why 3 phases | N/A | To prevent equivocation — a primary sending conflicting pre-prepares to different nodes |
| Client confirmation | 1 replica's reply | f+1 matching replies (so at least one honest) |
| Message complexity | O(n) per decision | O(n^2) classical; O(n) with linear BFT |
The three-phase protocol is specifically to handle a Byzantine primary. A CFT leader might crash, but it won't lie about what it proposed. A Byzantine primary might propose X to half the replicas and Y to the other half — equivocation. The extra phase exists to let replicas cross-check each other before committing.
What stays the same
A surprising amount:
- State machine replication as the goal. Same problem, same decomposition.
- View/ballot/term as a logical clock for leadership. PBFT calls it a view, just like VR.
- Leader changes as a first-class subroutine. PBFT's view change is structurally similar to VR's, plus extra cross-checking for Byzantine primaries.
- Commit rules driven by quorum intersection. Everything still grounds in "any two quorums overlap in enough correct nodes."
- Partial synchrony for liveness. BFT also inherits the FLP escape hatch; it's safe under asynchrony, live under partial synchrony.
If you understand Paxos/Raft/VR deeply, BFT feels like "the same thing, but everyone's paranoid."
The variants of the generals' problem
The 1982 paper actually presents several versions, with subtly different requirements:
- Byzantine Agreement (BA). One distinguished commander starts with a value. All loyal lieutenants must agree on the same value. If the commander is loyal, the agreed value must be the commander's.
- Interactive Consistency (IC). Every node starts with a value. All loyal nodes must agree on a vector of values, one per node, such that the entry for each loyal node is that node's input.
- Byzantine Generals Problem (BGP). The general one, essentially BA without designating a commander.
For distributed-systems purposes, BA is the foundational case — "the leader has a value, everyone agrees on it (or detects the leader as faulty)." State machine replication is BA applied repeatedly, with the primary as commander.
Where this leaves us
The classical Byzantine results are forty-plus years old. They establish:
- The node count needed (
3f+1). - The message authentication model options.
- The formal definition of what "agreement under lying" means.
What they don't give you is a practical algorithm you can run on modern hardware. The 1982 paper's algorithms are pedagogical — they show existence, not performance. For decades after, BFT was "theoretically solved but practically intractable": the message complexity was O(n^{f+1}) or worse, making it useless beyond tiny n.
That stayed the state of the art until 1999, when Miguel Castro and Barbara Liskov published "Practical Byzantine Fault Tolerance", which reduced the cost enough to actually run. PBFT was the breakthrough that made BFT useful, and it is the direct ancestor of nearly every modern BFT system.
That's the next chapter.
PBFT: Practical Byzantine Fault Tolerance
In 1999, Miguel Castro and Barbara Liskov published Practical Byzantine Fault Tolerance at OSDI. The title is a statement. The paper's central claim was that BFT had finally become cheap enough to use — that a BFT-replicated NFS was only about 3% slower than an unreplicated one. For a field that had been assuming BFT was academically interesting but operationally prohibitive, this was a sea change.
Nearly every Byzantine-fault-tolerant system you can name descends from PBFT. Tendermint is PBFT with some changes. HotStuff is PBFT refactored for linear communication. LibraBFT, DiemBFT, AptosBFT — all HotStuff descendants, so all PBFT's grandchildren. Hyperledger Fabric's BFT ordering service is a PBFT variant. If you understand PBFT, you have the backbone of modern BFT.
The setting
N = 3f + 1replicas, numbered0..N-1.- Up to
freplicas may be Byzantine. - Replicas are linked by an asynchronous network (messages can be delayed but not forged, assuming signatures).
- Messages are authenticated — signatures or MAC-authenticators.
- At any time, one replica is the primary, the rest are backups.
- The primary for view
vis replicap = v mod N. - A view is a period during which a particular primary is in charge.
Normal operation: three phases
The three-phase protocol is the heart of PBFT. Clients send requests to the primary; the primary drives them through three phases; replicas commit and reply.
client primary backup backup backup
│ │ │ │ │
│──req─▶│ │ │ │
│ │──pre-prepare───▶│─────▶│──┐
│ │ │ │ │
│ │◀────prepare─────┤ │ │ Phase 2: everybody broadcasts
│ │◀────prepare─────┼──────┤ │ to everybody
│ │───prepare──────▶│──────▶│ │
│ │ │ │ │
│ │──────commit─────┼──────▶│
│ │◀─────commit─────┤ │
│ │ │ │
│◀──reply───────────────── │
In more detail:
Phase 1: Pre-prepare. The primary assigns the request a sequence number n and sends PRE-PREPARE(v, n, d, m) to all backups, where v is the view, n the sequence number, d the digest of the request, and m the request itself. The primary is committing to the assignment: "this request goes in slot n of the log."
Phase 2: Prepare. Each backup, if it accepts the pre-prepare, sends PREPARE(v, n, d, i) to all other replicas (including the primary), where i is its own id. A replica prepares a request when it has collected 2f matching PREPARE messages (from itself and 2f others, so 2f+1 total agree on the (v, n, d) tuple).
Prepare is what ensures no two conflicting requests can be prepared under the same view and sequence number. It needs 2f+1 nodes to agree — a quorum of honest-plus-possibly-lying nodes that, by quorum intersection, share honest witnesses with any other quorum.
Phase 3: Commit. Having prepared, each replica sends COMMIT(v, n, d, i) to all others. A replica commits when it has collected 2f+1 matching COMMIT messages. Once committed locally, it executes the request and replies to the client.
Commit is what ensures a committed request survives view changes. Prepare alone would be enough for agreement within a view, but a Byzantine primary could tell different replicas that different values were prepared. The commit phase forces cross-replica confirmation that a value is not only locally prepared but visible-as-prepared to 2f+1 replicas. Any future view change will see at least f+1 honest replicas with that committed value, and they will propagate it.
Why three phases
Why do we need three phases instead of Raft's two?
Consider a Byzantine primary. In Raft, the leader is trusted: it might crash, but it won't lie. In PBFT, the primary can equivocate — send PRE-PREPARE(v, n, d_1, m_1) to some replicas and PRE-PREPARE(v, n, d_2, m_2) to others, with d_1 ≠ d_2. This is a Byzantine primary trying to get two different values committed for the same slot.
The prepare phase defeats this. A replica only prepares a request if it sees 2f+1 matching prepares — meaning that at least f+1 honest replicas agree on the same digest. Two different digests couldn't both collect 2f+1 prepares, because the quorums overlap in at least f+1 honest replicas, who wouldn't prepare two different digests.
So after prepare, one can say: "at most one digest could have been prepared in view v at slot n."
Why then commit? Because prepare-only is insufficient under view change. Suppose value X is prepared in view 1 — 2f+1 replicas have matching prepares. A view change happens. The new primary gathers state from 2f+1 replicas. Only f+1 might be among those who have the prepared X; an adversary (or just bad luck) could arrange that the gathered 2f+1 includes fewer than f+1 honest-and-with-X replicas, and X gets lost.
Commit forces a second round of confirmation: by the time a replica commits X, it knows that 2f+1 replicas agree X was prepared, and any future view change's gathered state must include at least f+1 honest replicas who saw that agreement. The new primary reconstructs the committed value.
In short: prepare ensures agreement within a view; commit ensures agreement across views.
Client semantics
A client sends its request to the primary (or all replicas, if the primary is suspected faulty). The client waits for f+1 matching replies, so that at least one honest replica vouches for the result. f+1 is the minimum such that, even if f are Byzantine and forge replies, at least one honest-and-correct reply is in the set.
Clients maintain a timestamp (monotonically increasing per client) to deduplicate requests. Replicas keep a per-client table, similar to VR's client table, mapping client id to the latest timestamp executed.
View change
When backups suspect the primary is faulty (typically via a timeout on a request they forwarded), they initiate a view change.
- Each backup enters
view-changestatus and sendsVIEW-CHANGE(v+1, n, C, P, i)to all replicas, where:v+1— the new view.n— the highest sequence number the replica has executed (stable checkpoint).C— the set of checkpoint certificates (proofs that the state up tonis agreed).P— for each request prepared since the last stable checkpoint, a certificate of that preparation (a set of2fmatchingPREPAREs plus thePRE-PREPARE).
- The new primary for
v+1(which is(v+1) mod N) collectsVIEW-CHANGEmessages from2freplicas (plus its own, so2f+1total). - The new primary constructs a
NEW-VIEW(v+1, V, O)message, where:Vis the set of validVIEW-CHANGEmessages it received.Ois a set ofPRE-PREPAREmessages for sequence numbers to be re-proposed in the new view. For each sequence numbernin the relevant range, if any of the receivedVIEW-CHANGEs showed a prepare certificate fornwith digestd, the new primary issuesPRE-PREPARE(v+1, n, d). For sequence numbers with no prepare certificate, it issues a null pre-prepare (a placeholder no-op).
- The new primary broadcasts
NEW-VIEWalong with all thePRE-PREPAREs. - Backups validate the
NEW-VIEW: they check that it is consistent with theVIEW-CHANGEmessages, that thePRE-PREPAREs correctly reflect prepared values from prior views, and that all relevant sequence numbers are handled. If valid, they enterv+1and proceed with normal operation.
The complexity here is real. View change is the hardest part of PBFT to implement correctly; it has an enormous set of invariants to preserve. The original paper and Castro's thesis spend many pages on it. It is the single place where a Byzantine primary can cause the most havoc, and it's the place where the protocol's correctness is tightest.
Why view change is so hairy
A view change must:
- Ensure every committed request from prior views is preserved.
- Rebuild the log for sequence numbers that were prepared but not committed.
- Not introduce any new conflicts with past prepares.
- Do all this despite the fact that the new primary (and up to
f-1other replicas) may be Byzantine.
Each received VIEW-CHANGE from a backup carries cryptographic proofs — signed PRE-PREPAREs and PREPAREs — that the new primary must validate. Replicas validate the NEW-VIEW by re-performing the same logic. This is a lot of crypto-checking and state-stitching.
A later revision of the protocol (Castro's dissertation) simplifies and tightens this. Modern descendants have found further simplifications — HotStuff's chained-commit structure, in particular, folds view changes into regular message flow without needing a separate hairy subroutine. We'll meet that in the next chapter.
Checkpoints and garbage collection
The log of requests grows forever without garbage collection. PBFT uses checkpoints: every k sequence numbers (typically every 100 or 128), each replica computes a digest of its state and sends CHECKPOINT(n, d, i). A replica considers a checkpoint stable when it has 2f+1 matching checkpoint messages. Everything before a stable checkpoint can be discarded — including prepared-but-not-committed entries (which, at a stable checkpoint, either committed or will never commit).
Stable checkpoints are important not just for storage but for bounding view change cost: the VIEW-CHANGE message only has to carry prepare certificates since the latest stable checkpoint, not the whole log.
The O(n²) problem
PBFT's normal-case message complexity per decision:
- Pre-prepare: primary →
N-1backups.N-1messages. - Prepare: each of
Nreplicas →N-1others.N(N-1)messages. - Commit: each of
Nreplicas →N-1others.N(N-1)messages.
Total: O(N^2) per decision. At N = 4, that's about 16 messages per commit. At N = 100, 10,000 messages. At N = 1000, a million.
This quadratic behavior is tolerable at small scale — BFT systems historically ran with 4, 7, or at most a few dozen nodes. It becomes prohibitive at blockchain scales, where validator sets of 100+ are common. This is exactly what HotStuff addresses.
The quadratic cost comes from the all-to-all broadcasts in prepare and commit. If you could somehow collapse those into linear communication while preserving the safety guarantees, you'd have PBFT with much better scaling. That's what HotStuff does, by using threshold signatures and a leader that aggregates votes.
Optimizations in the original paper
Castro and Liskov did not stop at the pedagogical version. The OSDI 1999 paper and Castro's PhD thesis include optimizations that make real PBFT deployments workable:
- MAC-based authentication for normal case. Digital signatures are expensive; MAC vectors are cheap. PBFT uses MACs for pre-prepare/prepare/commit in the normal case and reserves signatures for view change messages (where non-repudiation matters most).
- Batching. The primary batches multiple client requests into a single pre-prepare.
- Read-only optimization. Read-only operations can skip the three-phase protocol — the client sends the read to all replicas, waits for
2f+1matching replies, and is done. This is safe because reads don't modify state. - Tentative execution. Replicas execute requests after the prepare phase (tentatively) and send replies to the client. A client waits for
2f+1matching tentative replies; if it gets them, it has a strong enough certificate to proceed without waiting for commit. If not, it waits forf+1matching committed replies.
These optimizations cut the latency of normal operations substantially — from 3 phases to "1 phase for reads, 2 phases for common-case writes."
Engineering lessons from PBFT
- Byzantine safety is composable. PBFT's three-phase protocol, once you understand it, is a reusable building block. You can run it for each log slot (linearly) or for each round of a pipeline (chained).
- View change is where the bodies are buried. Most PBFT implementation bugs have been found in view change. Anyone implementing BFT should write the most careful tests here.
- Crypto costs matter. The MAC vs. signature choice had large performance consequences. Modern threshold signature schemes (BLS, etc.) have changed this balance again.
- BFT is not binary. "BFT" can mean a 4-node system handling 1 failure or a 100-node system handling 33 failures. The cost structure is very different.
What to take into HotStuff
PBFT's normal-case sequence (pre-prepare, prepare, commit) maps directly onto HotStuff's chained phases (prepare, pre-commit, commit, decide, but pipelined). The reason HotStuff can claim "linear communication" while PBFT cannot is that HotStuff aggregates votes at the leader using threshold signatures, rather than having all replicas gossip with all others.
If PBFT is the O(n²) landmark, HotStuff is the O(n) response. We'll see how next.
HotStuff and the Linear BFT Family
PBFT changed the world by making Byzantine consensus fast enough to use. HotStuff changed it again by making Byzantine consensus scalable enough to use with many replicas.
Maofan Yin, Dahlia Malkhi, Michael Reiter, Guy Gueta, and Ittai Abraham published HotStuff: BFT Consensus in the Lens of Blockchain at PODC 2019. The paper's name is a tell — it was born out of the realization that classical BFT (PBFT with its O(n²) communication) could not keep up with the n that blockchain consortia wanted.
HotStuff's two contributions:
- Linear communication per decision. By having replicas send votes to the leader rather than gossiping with each other, and aggregating those votes with threshold signatures, HotStuff cuts the per-decision message count from
O(n²)toO(n). - Optimistic responsiveness. In partial synchrony, HotStuff progresses as fast as the network allows, without waiting for fixed timeouts in the common case — a property PBFT lacks.
The setting
N = 3f + 1replicas, as in PBFT.- Up to
fByzantine. - Digital signatures (actually threshold signatures — more on this shortly).
- A rotating leader — each view's leader is different. Unlike PBFT, where the leader stays until suspected faulty, HotStuff rotates the leader every consensus round.
- Messages carry Quorum Certificates (QCs) — threshold-signature aggregations of
2f+1matching votes.
Quorum certificates and threshold signatures
Core HotStuff move. Instead of every replica broadcasting PREPARE to every other replica (quadratic), every replica sends its signed vote to the leader. The leader collects 2f+1 votes and aggregates them into a Quorum Certificate — a single, constant-size object that is verifiable proof that 2f+1 replicas voted for the same thing.
Threshold signatures make this possible. A (t, n)-threshold signature scheme lets any t of n signers produce a single signature on a message, verifiable with a single public key. The leader, on collecting t = 2f+1 signature shares, combines them into one threshold signature. The resulting QC is one object, one signature-verify away from proof.
Without threshold signatures, a QC would be 2f+1 separate signed votes — O(n) in size, and O(n²) to broadcast. With threshold signatures, a QC is O(1) in size, and broadcast is O(n).
BLS signatures (Boneh-Lynn-Shacham) are the standard choice — their aggregation is particularly clean. Cryptographic cost is nontrivial but amortizable.
The basic HotStuff protocol
Basic HotStuff (as opposed to Chained HotStuff) runs in four phases per decision. Each phase is a round-trip between leader and replicas, using QCs to carry proof of the previous phase.
PREPARE PRE-COMMIT COMMIT DECIDE
│ │ │ │
└─leader └─leader └─leader └─replicas execute
proposes sends sends
block prepareQC precommitQC
Specifically:
- Prepare phase.
- Leader proposes a new block
bthat extends the highestprepareQCseen. - Each replica votes if it safe to vote (details below).
- Leader collects
2f+1votes intoprepareQC.
- Leader proposes a new block
- Pre-commit phase.
- Leader broadcasts
prepareQC. - Each replica, upon seeing
prepareQC, votes for pre-commit. - Leader collects
2f+1votes intoprecommitQC.
- Leader broadcasts
- Commit phase.
- Leader broadcasts
precommitQC. - Each replica votes for commit.
- Leader collects
2f+1votes intocommitQC.
- Leader broadcasts
- Decide phase.
- Leader broadcasts
commitQC. - Each replica, upon seeing
commitQC, executes the block's commands.
- Leader broadcasts
The safety rule for voting is the locked-QC rule: a replica votes for a block b if b extends the block it is locked on (the one with its highest precommitQC) or if b's parent's QC has a higher view than the locked one.
The reason for four phases rather than PBFT's three is subtle. The extra phase gives HotStuff a property called optimistic responsiveness — the protocol advances as fast as the leader can collect votes, without waiting for a fixed timeout, because the additional phase's structure ensures that a new leader can always make progress based on QCs from past views, without waiting for a "view-stable" timer.
Don't worry if you have to re-read the paper for this; it is the most debated part of the design.
Chained HotStuff
Four phases per decision is a lot of round-trips. Chained HotStuff pipelines them. Each view's single phase of messages serves double duty:
- It is the decide phase for some earlier block
b. - It is the commit phase for a slightly-less-earlier block
b'. - It is the pre-commit phase for the block after that.
- It is the prepare phase for the current block.
One round of message exchange, four blocks worth of progress simultaneously. The net effect: one message round-trip per decision, amortized.
view v: prepare(b4) | pre-commit(b3) | commit(b2) | decide(b1)
view v+1: prepare(b5) | pre-commit(b4) | commit(b3) | decide(b2)
view v+2: prepare(b6) | pre-commit(b5) | commit(b4) | decide(b3)
Each block is decided three views after it is prepared. Latency per decision is higher (measured in views), but throughput is one decision per view. For a system processing many requests, throughput is the relevant metric.
Rotating leaders
In PBFT, the leader stays in place until it is suspected faulty. HotStuff rotates the leader every view, without waiting for suspicion. Why?
- Fairness. No single replica has disproportionate influence.
- Simplicity. Every view has a clear new leader; view changes and normal operation are unified.
- Liveness. If the current leader is Byzantine or slow, the next view automatically gives someone else the chance to lead. No explicit "is the leader faulty?" decision needed.
The cost: you lose the "stick with a good leader" optimization. In the common case where the current leader is correct and fast, HotStuff still rotates. The linear communication pattern compensates for this by making each leader-round cheap.
View change and liveness
A view change in HotStuff is just "the next view's leader starts proposing." There is no separate view change protocol. If a leader is slow or silent, replicas time out and advance to the next view. The new leader gathers 2f+1 NEW-VIEW messages (each containing the sender's locked QC and prepareQC) and proposes a new block extending the highest safe block.
This unified structure is part of why HotStuff is easier to reason about than PBFT — there's no discontinuity between "normal case" and "view change."
LibraBFT, DiemBFT, AptosBFT
Facebook's Libra project (2019) needed a consensus protocol for its permissioned blockchain with dozens to hundreds of validators. Classical PBFT wouldn't scale; proof-of-stake was overkill for a known validator set. They picked HotStuff and extended it into LibraBFT.
Libra was renamed Diem in 2020 (and DiemBFT with it) and shut down in 2022. Meta transferred the technology to the Diem Association, which sold parts of it off. Meanwhile, several ex-Diem engineers founded Aptos, which ships AptosBFT (another HotStuff descendant, currently part of the Aptos blockchain network). Sui, another ex-Diem offshoot, uses a different consensus design (Narwhal+Bullshark).
Key extensions these projects made:
- Pacemaker. A separate module that handles view timeouts, randomization, and leader selection, decoupled from the core consensus logic. This makes the protocol testable and tunable.
- Pipelined commit rule. Three consecutive chained views of the same branch imply commit of the oldest of those three. Concretely: if a block has a QC that is "grandchild-QC'd" by a block at view
v+2, it commits. Simplifies the safety argument. - Reconfiguration. Validator set changes via epoch boundaries — consensus is run within an epoch, and epoch transitions include a reconfiguration step that atomically swaps validator sets.
- Execution optimizations. Batching, parallel execution of independent transactions, careful memory management. Less "consensus" per se and more "making the practical system fast."
These are not fundamental changes to HotStuff's consensus mechanics. They are the additional engineering it takes to ship a production blockchain using HotStuff.
The tradeoffs
HotStuff vs. PBFT:
| Dimension | PBFT | HotStuff (chained) |
|---|---|---|
| Communication per decision | O(n²) | O(n) |
| Cryptographic ops per decision | Signatures/MACs across O(n²) messages | Threshold signature aggregation, O(n) |
| Leader stability | Sticky (until suspected) | Rotates every view |
| Latency (phases per decision) | 3 | 4 (but pipelined → effective 1 per decision) |
| View change complexity | Separate, intricate protocol | Unified with normal case |
| Implementation lines | Large; VR- and PBFT-like state machine | Modular with pacemaker; surprisingly compact |
| Ecosystem | Handful of research and industry impls | Multiple production deployments (Diem/Aptos, others) |
What you give up:
- Throughput under an optimally-correct leader is slightly lower in HotStuff, because leader rotation means every view has new cold caches, new leader-overhead costs. A PBFT system with a very reliable primary might outperform HotStuff at small
n. - Simpler to grasp: PBFT's three phases are easier to internalize the first time than HotStuff's four phases + chaining. HotStuff's elegance is structural but takes a second read.
What you gain:
- Scalability.
ncan be 100+ without the quadratic cost dominating. - Uniformity. Normal case and view change share structure.
- Optimistic responsiveness. Common-case progress tracks network speed.
The linear-BFT family
HotStuff was not the first attempt at reducing BFT communication, but it was the one that found the right combination of rotating leaders, QCs, and pipelining. Several other proposals in the same family:
- SBFT (Gueta et al., 2019) — uses threshold signatures for vote aggregation; closer to PBFT in structure.
- Tendermint (Buchman, 2016) — PBFT-descended, rotating leader, used in Cosmos. Predates HotStuff. Linear in common case, but has a different liveness property.
- Narwhal and Bullshark (Mysticeti, Sui) — mempool/consensus separation; DAG-based.
- Jolteon / DiemBFT v4 — a HotStuff variant trading one phase for an exponential-backoff liveness rule.
The common thread: reduce communication complexity and unify the view change path.
An observation for practitioners
If your system has fewer than, say, 10 replicas, PBFT is probably fine. The O(n²) factor at n=10 is 100 messages per decision, which is still cheap.
If your system has 20+ replicas — permissioned blockchain territory — you want a linear-BFT variant. HotStuff and its descendants are the default choice.
If your system has more than about 200 replicas, classical BFT starts to stress even linear protocols, and you start to see research into randomized, asynchronous, or DAG-based BFT. That's roughly where we are in 2026.
What HotStuff teaches
- Communication complexity is the binding constraint at scale. Safety alone isn't enough — an algorithm has to be cheap enough to run.
- Pipelining amortizes latency. You can have a 4-phase protocol that delivers decisions at a rate of one per phase.
- Threshold signatures are a structural primitive, not just an optimization. They change what collective agreement "looks like" on the wire.
- Unifying normal case and view change makes implementations tractable.
Next, a stranger branch of the tree: randomized consensus, which sidesteps FLP by flipping coins.
Randomized Consensus
So far we have seen two ways to live with FLP. The first: assume partial synchrony, and you get safety always, liveness under well-behaved networks. Paxos, Raft, VR, PBFT, HotStuff all choose this path.
There is another path. Randomization.
Michael Ben-Or published "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols" in 1983. The title is a joke with an edge. FLP (1985, but circulating as a draft) had just shown that deterministic asynchronous consensus was impossible. Ben-Or's response: fine, then let's not be deterministic. Let each process flip a coin when it can't decide. It turns out that is enough.
Randomized consensus sidesteps FLP entirely. It is safe and live (with probability 1) in pure asynchrony. It does not need partial synchrony, timeouts, or leaders. The price is probabilistic — under worst-case adversarial scheduling, expected number of rounds can be exponential in n for the simplest variants. Modern variants dramatically improve this.
You will rarely see pure randomized consensus deployed standalone in mainstream systems. Its ideas, however, are baked into several modern BFT designs, and understanding it opens your conceptual toolkit.
Ben-Or's algorithm
The setting: N processes, up to f Byzantine, synchronous message passing in rounds (each round, every process sends messages, receives messages, decides). But note — we can't guarantee the network is synchronous; we use the round structure as a conceptual device. Ben-Or's algorithm still works in asynchrony; rounds are just ways to group messages.
Each process starts with a bit v ∈ {0, 1} and wants to agree on some single bit. The algorithm runs in phases, each phase has two rounds:
phase k, round 1:
broadcast <PROPOSE, k, v_i>
wait for N-f <PROPOSE, k, *> messages
if > N/2 of them have the same value w:
broadcast <VOTE, k, w>
else:
broadcast <VOTE, k, ⊥> // "no majority"
phase k, round 2:
wait for N-f <VOTE, k, *> messages
if >= f+1 of them are <VOTE, k, w> with w ≠ ⊥:
decide w (and keep sending VOTE for this w)
else if exists one <VOTE, k, w> with w ≠ ⊥:
v_i := w
else:
v_i := flip_coin() // random bit
The algorithm's skeleton:
- Propose what you currently believe.
- If most of your peers agree, you vote for that value.
- Otherwise, if nobody in your received quorum reached agreement, you flip a coin and pick a new value for the next phase.
Why it terminates
The key insight: if all processes happen to flip to the same coin value in the same phase, the protocol decides in the next round. Each process independently has a 50% chance of flipping 0 vs. 1; the chance of all N landing the same way in a given phase is 2 · (1/2)^N = 2^{1-N} for 2 values. So expected rounds to termination is exponential in N.
That's bad for N = 100. It's tolerable for N = 10.
Why it's safe
A value can only be decided in round 2 of some phase. For a decision, the decider saw f+1 VOTEs for some value w. These include at least one honest voter. An honest voter votes for w only if it saw a majority of PROPOSEs for w in round 1. Two different decisions in the same phase would require two disjoint majorities of PROPOSEs — impossible. Decisions in subsequent phases are forced to also be w because, once f+1 VOTEs for w are out, any process in round 2 of the next phase sees at least one VOTE for w (by quorum intersection) and either decides w or adopts it as its next v_i.
Safety does not depend on the coin being fair or even random — it holds no matter what the coin does. Randomness only helps liveness.
Common coins: smarter randomness
The exponential expected rounds of Ben-Or is a problem. The fix: instead of each process flipping its own coin, have all processes share a single coin, unpredictable to the adversary but consistent across processes. If all honest processes see the same bit, and that bit was chosen randomly from the adversary's perspective, a decision happens in one round with probability 1/2 — expected rounds become constant, not exponential.
Shared-coin protocols require cryptographic primitives. The common approach, due to Rabin (1983) and Cachin, Kursawe, and Shoup (2000), uses threshold cryptography:
- There is a secret key split into shares, one per process.
- To reveal the round's coin, at least
f+1processes must contribute their share. - Until
f+1shares are collected, the coin value is unknown to anyone — including an adversary who controls up tofof the processes.
A process participating in a round commits to its share; the adversary cannot delay the coin reveal without blocking the whole group; when f+1 shares arrive, the coin reveals deterministically to everyone.
Now each round decides with probability 1/2 (or higher). Expected rounds to termination: 2. Practical randomized BFT becomes viable.
Modern randomized BFT
HoneyBadgerBFT (Miller et al., 2016) is the emblematic modern asynchronous randomized BFT protocol. Its design:
- Each process proposes a batch of transactions.
- A Reliable Broadcast (RBC) primitive ensures that every honest process eventually gets every other process's proposal.
- A Binary Byzantine Agreement (BBA) — using Ben-Or-style randomized consensus with a common coin — runs in parallel for each process's batch, deciding whether to include it.
- Batches that pass BBA are combined (in a canonical order) into the final block.
HoneyBadgerBFT is the first BFT protocol practical at Internet scale and asynchronous-safe. It does not rely on any timing assumptions for safety or liveness. Its throughput depends on message and cryptographic overhead; early benchmarks showed tens to hundreds of transactions per second per node in WAN deployments, which, for an asynchronous BFT protocol, was a breakthrough.
Successors include Dumbo and DispersedLedger (both improvements on HoneyBadger's latency), and Speeding Dumbo (further optimizations). These papers are active research as of 2026, and the community is converging on designs that are actually practical asynchronous BFT.
Where randomization shows up in mainstream BFT
Even outside of explicit "randomized BFT" papers, randomness appears subtly in production protocols:
- Random leader selection in some HotStuff variants (to prevent adversarial scheduling of leader slots).
- Random tiebreaking in vote collection (to avoid deterministic tie patterns that an adversary could exploit).
- Randomized timeouts in Raft and PBFT (to break livelock among simultaneous candidates).
None of these rise to "the algorithm is randomized" in Ben-Or's sense. But they borrow the key insight: an adversary is weaker when you insert genuine randomness into the protocol.
Why randomized consensus is rare in production
Given that it's theoretically lovely — safe and live without timing assumptions — why isn't randomized consensus the default?
Several reasons:
- Engineering cost. Threshold cryptography is complex, requires distributed key generation, and adds cryptographic overhead. For most deployments, "partial synchrony + occasional view change" is cheaper.
- Operator intuition. Random algorithms are harder to debug. "Why did this view take 14 rounds to terminate?" — because the coin flips landed against you. That explanation is correct but unsatisfying.
- Network reality. Real networks are partially synchronous the vast majority of the time. You pay for the optimization (asynchronous safety) but rarely cash it in. Deterministic protocols with timeouts are simpler and empirically adequate.
- Throughput. Until recently, randomized BFT was orders of magnitude slower than PBFT/HotStuff in the common case. Modern designs are closing this gap, but classical protocols had a long head start.
It's a case of the theoretically preferable solution losing to the empirically good enough one — a common pattern in distributed systems.
What randomized consensus teaches
- FLP isn't load-bearing in practice. You can route around it with randomness just as easily as with timing assumptions. Knowing both routes is clarifying.
- Coins change the adversary's power. A deterministic protocol lets the adversary schedule messages to keep the protocol undecided; a randomized protocol denies the adversary that power (if the coin is truly unpredictable).
- Threshold cryptography is a gift. It makes "shared randomness" a concrete, implementable primitive. It also underlies HotStuff's QC aggregation — the same tool, used for a different purpose.
- Asynchronous safety is a real, valuable property. Systems that are correct in pure asynchrony (not just partial synchrony) fail gracefully under extreme conditions where partially-synchronous systems do not.
A note on proofs and practice
Randomized consensus proofs are delicate. "Terminates with probability 1" is not the same as "terminates in constant rounds." Some protocols terminate in expected constant rounds but have fat tails (high variance). Others terminate in expected-constant and bounded-variance rounds. If you read a randomized BFT paper, look for the probability claim and the bound on the number of rounds for termination with high probability — those are different guarantees, and the paper may be claiming just one.
The textbook treatment lives in Cachin, Guerraoui, and Rodrigues, Introduction to Reliable and Secure Distributed Programming (2nd ed., 2011), Chapter 5. It's the clearest exposition I know.
Closing the consensus-algorithm tour
With randomized consensus, we've now seen the major branches of the family tree:
- Crash-fault tolerant, deterministic, leader-based. Paxos, Raft, VR.
- Byzantine-fault tolerant, deterministic, leader-based, quadratic. PBFT.
- Byzantine-fault tolerant, deterministic, rotating-leader, linear. HotStuff and descendants.
- Byzantine-fault tolerant, randomized, leaderless. Ben-Or; HoneyBadgerBFT and successors.
Each is the right answer to a particular set of assumptions. No algorithm dominates on all dimensions. The rest of the book turns to practice — what these algorithms look like in systems you actually run.
Real Production Systems
The algorithms in the preceding chapters are the theory. Now: what happens when you run one of them for a living?
This chapter walks through the production systems you are most likely to encounter, what algorithm each uses, the choices that shaped the implementation, and what practitioners have learned about operating them. All claims are dated to 2026; consensus systems evolve, and the landscape will have shifted by the time you read this.
etcd
Algorithm: Raft.
etcd is a distributed key-value store written in Go, originally by the CoreOS team and now a CNCF project. It is best known as the backing store for Kubernetes — the cluster's entire desired-state configuration lives in etcd. When etcd is unhappy, your Kubernetes cluster is unhappy.
Implementation choices:
- raft library split out. The
go.etcd.io/raftlibrary is independent of etcd itself — it implements only the Raft state machine, leaving I/O, storage, and networking to the caller. This design is what let CockroachDB, TiKV, and dozens of others adopt Raft quickly: they reused etcd's raft library rather than implementing from scratch. - Multi-raft (in CockroachDB, TiKV) isn't in etcd itself. etcd is single-group Raft; if you want to horizontally scale, you go elsewhere.
- MVCC with persisted log. The log (append-only) and the MVCC key-value state live together. Reads are served from the leader by default; there's a
linearizableflag for reads that forces a round-trip-to-quorum; there's aserializableflag for faster but stale reads. - Compaction. History is retained for a configurable window; old revisions are compacted on a schedule. A running cluster has to keep up with compaction or the log fills disk.
Operational lessons:
- Three-node clusters are fragile. They tolerate one failure, but during maintenance (rolling updates), you have no failure tolerance at all. Five-node clusters are the production norm for anything important. Kubernetes documentation itself recommends five for larger clusters.
- Disk latency is everything. Raft does an
fsyncper committed entry. If your disk has 10ms commit latency, your writes have 10ms commit latency. SSDs, preferably NVMe. Avoid networked block storage with bad tail latency. - Don't run on spinning disks. Really.
- Watch carefully at scale. Kubernetes' etcd guidance recommends keeping etcd's data size under a few gigabytes; beyond that, leader elections and snapshots get painful. Periodic defragmentation is part of ops.
Known incidents: the Kubernetes documentation and various etcd operator guides document etcd failure modes that real clusters hit. Post-mortems on Kubernetes clusters losing etcd quorum are a recurring genre.
Consul
Algorithm: Raft.
Consul by HashiCorp provides service discovery, health checking, KV storage, and service mesh functions. Its central data store is Raft-replicated.
Implementation choices:
- Its own Raft implementation in the
hashicorp/raftGo library, independent of etcd's. Feature-compatible, different code. HashiCorp's library is also widely used beyond Consul. - Servers and clients. Consul has two roles: servers (participate in Raft) and clients (agents that connect to servers, do health checks, register services). Most Consul deployments have 3 or 5 servers and many clients.
- Multi-datacenter. Consul supports federation across datacenters, with each datacenter running its own Raft cluster and cross-datacenter gossip for service discovery.
- Network coordinates. An interesting non-Raft bit: Consul maintains a Vivaldi network coordinate for each node, used to estimate RTTs for nearest-neighbor routing. Not consensus, but worth mentioning as an example of the kinds of features that sit on top of consensus.
Operational lessons:
- Leader election is sensitive to network hiccups. Cross-region Consul federations have seen leader flapping under WAN latency variability. The answer is usually "don't stretch a single Raft cluster across regions."
- Gossip isn't consensus. Consul's cross-datacenter membership uses a gossip protocol (Serf), which is eventually consistent — don't treat it as if it were.
ZooKeeper
Algorithm: Zab (ZooKeeper Atomic Broadcast).
Apache ZooKeeper is the elder statesman — released by Yahoo in 2007, with roots in the Chubby lock service work at Google. It predates Raft and was the default "consensus thing" for a decade.
Zab is a Paxos cousin — not quite Paxos, not Raft, but in the same family. Its normal-case operation is a two-phase protocol with a leader; its leader-election protocol is distinct but conceptually similar to view change. Zab's explicit goal is total order on broadcast messages, not general-purpose consensus — though the difference is small.
Implementation choices:
- JVM. ZooKeeper is Java. Bringing a JVM dependency into your infrastructure is a nontrivial choice.
- ZNodes. The data model is a hierarchical namespace of "znodes," each with a small payload. Watches on znodes are a key feature: clients register interest, get notified of changes.
- Sessions and ephemeral nodes. A client holds a session; ephemeral znodes exist only while the session is active. This is how ZooKeeper underlies distributed locking and leader election for other systems.
- Observers. Non-voting replicas that serve reads without participating in consensus. Useful for scaling reads without inflating the quorum.
Operational lessons:
- ZooKeeper was complicated to run. You needed to tune JVM heap, GC, session timeouts, transaction log sync, snapshot timing — and different applications wanted different configurations.
- Clients had bugs that looked like ZooKeeper bugs. A common production pattern: app hits some weird behavior, blames ZooKeeper, actually the client library's session handling.
- It worked. For all the complexity, ZooKeeper clusters ran for years in production. It's not fashionable anymore, but it's not broken.
Kafka: ZooKeeper to KRaft
For its first decade, Apache Kafka used ZooKeeper to store cluster metadata — topic configurations, partition assignments, broker membership. This coupling was a major operational burden: you had to run (and understand) ZooKeeper just to run Kafka.
In 2020, KIP-500 proposed replacing ZooKeeper with a built-in Raft quorum — KRaft. The idea: Kafka already knew how to replicate logs (that's what Kafka is); the ZooKeeper dependency was a historical accident. Using an internal Raft quorum to store metadata eliminates the external dependency.
- KRaft became production-ready around Kafka 3.3 (2022). ZooKeeper-based Kafka was formally deprecated; Kafka 4.0 (2025) removed ZooKeeper support entirely.
- The metadata quorum in KRaft is usually 3 or 5 controllers, separate from the data brokers.
- The transition was multi-year and had extensive compatibility concerns.
The KRaft migration story is an interesting example of a system de-coupling from consensus: Kafka went from "ZooKeeper is a hard dependency" to "we embed our own Raft." Many large users preferred the simpler operational model — one system to understand, not two.
Google Spanner (Paxos + TrueTime)
Spanner is Google's globally-distributed, externally-consistent database. Its consensus mechanism is Paxos — but the interesting bit is the clock.
Spanner's shards are each a Paxos group (typically 5 replicas across regions). Transactions are ordered using TrueTime — an API that returns a time interval [earliest, latest] guaranteed to contain the true current time. Google achieves tight TrueTime bounds (a few milliseconds) with GPS and atomic clocks in every datacenter.
Why this matters: with reliable time, Spanner can commit-wait to ensure that timestamp ordering respects real-time ordering. External consistency (also called "strict serializability") comes out the far side.
Implementation specifics:
- Paxos is used within each shard for replication.
- Two-phase commit across shards (for multi-shard transactions).
- TrueTime is the novel piece — GPS + atomic clocks + a custom API returning intervals, not points.
Spanner's architecture is unusual, and Google made the hardware investment to make it work. Knock-offs (CockroachDB, YugabyteDB, TiDB) use variations of Paxos/Raft without TrueTime-level clocks, trading external consistency for operational simplicity.
CockroachDB
Algorithm: Multi-Raft.
CockroachDB is a SQL database inspired by Spanner but without TrueTime. Instead of TrueTime, it uses HLC (Hybrid Logical Clocks) and a "commit wait" equivalent that is looser than Spanner's.
- Ranges. Data is split into 512 MB (by default) ranges; each range is a Raft group.
- Multi-raft. A single node participates in many Raft groups — thousands, at scale — one per range-replica. The
go.etcd.io/raftlibrary is the foundation. - Leaseholder. Each range has a leaseholder, a replica that serves reads and coordinates writes without a Raft round-trip for every read. Leases are time-bounded.
Engineering lessons:
- Multi-Raft at scale is engineering-heavy. Thousands of Raft groups means thousands of heartbeats, leader elections, snapshots. CockroachDB has invested heavily in heartbeat coalescing and snapshot streaming.
- Leaseholder placement matters. If your reads are cross-region but the leaseholder is somewhere else, you pay WAN latency. Follower reads (with bounded staleness) are an optimization here.
- Range splits and merges. As data grows or shrinks, ranges are dynamically split/merged. This is a consensus-level operation and needs care.
FoundationDB
Algorithm: Custom — related to Paxos, but with a distinct architecture.
FoundationDB, originally a startup, now an Apple project and open source, is an ordered key-value store with strict serializability. Its architecture separates:
- Coordinators — a small, stable set that run consensus over cluster configuration.
- Proxies — accept client writes.
- Resolvers — detect conflicts.
- Log servers — durable log.
- Storage servers — serve reads.
The design emphasizes testability. FoundationDB has a deterministic simulator that can replay the entire cluster under adversarial timing and failure injection. This enabled development confidence that is unusual in the field.
Aerospike, Cassandra, DynamoDB: the eventually-consistent alternative
Not every distributed database wants strict consistency.
- Cassandra (Dynamo-style, AP) — tunable consistency, eventually consistent by default. No consensus per write in the strict sense; reads and writes use quorum counts to approximate consistency.
QUORUMreads+writes give you consistency at the cost of write latency, but still not linearizable. - DynamoDB (AWS) — publicly describes itself as highly available and partition-tolerant, with a choice of strong or eventual consistency on reads. Internal architecture uses Paxos for leader election of partitions but is not a traditional SMR system.
- Aerospike — focuses on very-low-latency access, with its own consistency mechanisms that can be configured for "strong consistency" (using internal consensus for partition assignment) or for eventual consistency (trading consistency for latency).
The tradeoff these systems accept: consistency is something you dial up or down per operation, not something the database guarantees globally. For high-throughput workloads where occasional inconsistencies are manageable by the application, this is a very reasonable choice.
This is not the same tradeoff as "use a Raft-backed database and accept the latency." It is a different model, one where linearizability is not on the table even in the happy path. For a social media feed, fine. For your bank account, probably not.
Observations on production reality
After walking through these systems, a few patterns emerge:
- Nearly all mainstream systems are CFT, not BFT. etcd, Consul, ZooKeeper, Kafka/KRaft, Spanner, CockroachDB, FoundationDB — all crash-fault tolerant, not Byzantine. BFT is reserved for permissioned blockchains (Chapter 12) and some government/defense systems.
- Raft dominates new designs. Among systems designed post-2014, Raft is the default. Paxos variants persist in older systems (ZooKeeper/Zab, Spanner) and in places where the team had a specific reason to go custom.
- Multi-raft is the scaling pattern. Single-group consensus doesn't scale horizontally; partitioning your data and running one consensus group per partition does. The complexity is in managing the many groups coherently.
- Storage and networking matter more than the algorithm. A Raft cluster with bad disks is a slow Raft cluster. A Paxos cluster with saturated NICs is a slow Paxos cluster. The algorithm is often the cheapest part of the system's latency budget.
- Observation tools are lagging. Operators of these systems are often flying blind on internal consensus state. Better introspection — "why did leader election take 4 seconds?" — is ongoing work.
We turn next to the systems that call themselves blockchains but look a lot like the systems in this chapter.
Permissioned Blockchains and Where They Fit
Let's be honest about the genre.
A permissioned blockchain is a distributed ledger where participation is restricted — you have to be authorized to run a node or submit transactions. Every participant is known, authenticated, and part of some governance arrangement. The participants may not fully trust each other (otherwise why bother), but they are not anonymous and not pseudonymous. They are named organizations with signed agreements.
This is, mechanically, the classical BFT problem. You have a known set of N nodes, some of which may be Byzantine, and you want to totally order a sequence of transactions. The answer is PBFT, HotStuff, or a close relative. We have met all of them.
The open question is: does calling this system a "blockchain" help, or is it marketing paint on a distributed database?
This chapter tries to answer that with specifics rather than polemic.
The cases
The major permissioned-blockchain platforms, as of 2026:
Hyperledger Fabric
Developed under the Linux Foundation's Hyperledger project, originally contributed by IBM. Fabric's architecture is explicitly modular:
- Peers execute chaincode (smart contracts) and hold ledger state.
- Orderers form a consensus group that orders transactions.
- The consensus algorithm in the ordering service is pluggable — historically Kafka (which is itself not BFT), then Raft (CFT), and more recently BFT variants (SmartBFT, a PBFT descendant).
Why the modularity? Because Fabric's authors recognized that "consensus on transaction order" is the classical SMR problem, and they let you pick your favorite algorithm. The rest of Fabric — smart-contract execution, privacy via channels, membership service providers — is built around the ordering service.
For most Fabric deployments, the chosen ordering algorithm is Raft (CFT). The consortium members trust each other enough that Byzantine tolerance isn't needed; they just want fault tolerance against crashes. This is telling: even in the "permissioned blockchain" space, CFT is often the pragmatic choice.
R3 Corda
Corda is a different shape. Rather than replicating the ledger to every participant, each transaction is only visible to the parties involved. Consensus is reached via notary services — each notary is a Paxos, Raft, or BFT cluster that stamps uniqueness on transactions (preventing double-spending of states).
Corda's insight: in a financial consortium, you don't want every participant to see every transaction. Privacy is a first-class concern. The "blockchain" label is somewhat misleading for Corda — it's really a distributed ledger with pairwise-visible transactions and external uniqueness services. But the notary services are BFT or CFT consensus groups under the hood.
Ethereum's permissioned variants
Quorum (originally by JPMorgan, now owned by ConsenSys) is a permissioned fork of Ethereum with pluggable consensus — originally Istanbul BFT (an IBFT, a PBFT descendant), later updated.
Besu (another Ethereum client, Apache-licensed) supports IBFT and QBFT consensus in addition to standard Ethereum PoW/PoS. Used in Hyperledger Besu and various enterprise settings.
These use Ethereum's execution layer (EVM, smart contracts, accounts) but replace the Nakamoto consensus with classical BFT. Why? Because Nakamoto consensus is expensive, slow to finality, and unnecessary when participants are known.
Other deployments
- Interbank networks (e.g., the former Libra/Diem project, various bank consortia) — typically use HotStuff variants.
- Supply chain platforms — often Fabric-based, with chosen consensus depending on trust requirements.
- Tokenized asset platforms — varies widely; some are Fabric, some are Quorum, some are proprietary.
Almost invariably, the consensus algorithm in these systems is one we've met: Raft, PBFT, HotStuff, or a close descendant.
What the blockchain framing adds
Three things that can genuinely matter, even in a permissioned setting:
- Smart contracts / executable state transitions. If the application requires automated execution of business logic with agreement on outcomes, the blockchain model offers a clean abstraction. This is not inherent to consensus — you could bolt smart contracts onto any SMR system — but blockchain platforms come with the tooling.
- Tamper-evident log + cryptographic audit. Blockchains chain blocks via hashes. Even if the consensus algorithm doesn't care about this (a Raft log works fine without hashes), the hash chain is a forensic feature: if you want to prove to a regulator in 2030 that a certain transaction was ordered at a certain time, signed blocks with hash links are a nice artifact.
- Token / asset semantics. If the application is about transferring assets with strict conservation rules, the UTXO or account-based models of blockchain platforms are well-developed. Reimplementing them on a Raft-backed database is possible but reinventing a wheel.
If none of these three matter for your application, the "blockchain" label is mostly aesthetic. You could replace the system with, say, Consul or etcd plus application logic and get a simpler, faster, cheaper stack.
What the blockchain framing adds that is actually harmful
To be fair in the other direction:
- Vocabulary mismatch with operators. Operators of Raft/Paxos clusters have decades of shared vocabulary, runbooks, debugging tools, monitoring practices. Permissioned-blockchain teams sometimes reinvent these from scratch in blockchain terminology. A "validator timeout" is a leader election timeout; a "stuck block" is a failed view change. Rebranding doesn't help new operators.
- Performance budget for features that are not needed. Hash-chaining every block, signing every transaction, maintaining Merkle trees — these have cost. If your application doesn't need cryptographic audit trails at the block level, you're paying for a feature you don't use.
- Regulatory cargo-culting. "Putting it on the blockchain" sometimes appears in compliance conversations as if it solves problems it doesn't — a blockchain won't prevent a compromised operator from submitting false transactions, just like a database won't.
The honest question
When a colleague proposes a permissioned blockchain, the productive question is: "What does the blockchain framing give us that a distributed database with signed transactions wouldn't?"
Good answers:
- We need to execute smart contracts in a standardized language with tooling our auditors can inspect.
- We need tamper-evident logging for regulatory reasons, with cryptographic proofs verifiable by outside parties.
- We already have blockchain infrastructure and the marginal cost of adding this workload to it is low.
- We expect the set of participants to change over time in ways that benefit from explicit on-chain governance.
Bad answers:
- Blockchains are the future.
- It's a buzzword we need for funding.
- It sounds better in the PR release.
Neither list is exhaustive, but the distinction is roughly: does the blockchain framing earn its cost?
A brief note on "DLT"
"Distributed Ledger Technology" (DLT) is the umbrella term favored by people who want the benefits of the blockchain framing without some of its baggage. In practice, DLT includes:
- Public blockchains (Bitcoin, Ethereum) — out of scope here.
- Permissioned blockchains (Fabric, Corda, Quorum) — covered above.
- Distributed databases positioned as DLT (various startups).
The term is broad enough to be not very useful. If someone says "DLT," the right follow-up is "what's your consensus algorithm?" If the answer is Raft, you're running a distributed database. If the answer is HotStuff, you're running a classical BFT system. Call it what it is.
A practical compass
Here's a rough decision procedure:
Is membership open to anyone?
├── Yes → You probably want a public blockchain. Out of this book's scope.
└── No ─ (permissioned)
│
Do you trust all operators fully?
├── Yes → CFT consensus (Raft, etc). Not a blockchain, just
│ a distributed database.
└── No ─ (some operators might be malicious)
│
Do you need smart contracts, tamper-evident logs, or
token semantics as core features?
├── Yes → Permissioned blockchain (Fabric, Quorum, etc).
│ The blockchain framing earns its cost.
└── No ─── BFT consensus (PBFT, HotStuff) over whatever
data model you already have. The blockchain
framing would be overhead.
This is a sketch, not a law. Real decisions involve team expertise, regulatory context, vendor relationships, and operational preferences. But the core question — does the framing earn its cost? — is the useful one.
Fair summary
Permissioned blockchains are a legitimate corner of the distributed systems space. They exist because some applications genuinely benefit from the combination of classical BFT consensus + smart-contract execution + tamper-evident audit trails, wrapped in a consistent platform.
They are also oversold, over-applied, and often confused with public blockchains by stakeholders who don't care about the difference. An engineer who understands classical consensus can cut through this confusion efficiently: ask what algorithm is under the hood, what the trust model is, and what the blockchain framing actually provides. Most of the time, you'll find yourself back in the territory of the previous eleven chapters.
With that out of the way, let's turn to what actually goes wrong in systems using these algorithms.
Liveness, Safety, and the Things That Go Wrong
In distributed systems, correctness has two faces.
- Safety — nothing bad ever happens. No two nodes commit conflicting values; no committed write is lost; no client sees state that violates the contract.
- Liveness — something good eventually happens. Requests eventually get responses; leaders eventually get elected; snapshots eventually get taken.
These are orthogonal. Safety alone is easy (never decide anything, and you never decide wrong). Liveness alone is easy (decide something fast, correctness be damned). Systems are hard because they want both at once.
This chapter is about where real systems fail on each axis — with enough concreteness that you recognize the shapes when you see them in your own logs.
Safety vs. liveness, operationalized
Leslie Lamport, again, gave us the formal distinction:
- A safety property is one whose violation can be observed in a finite execution. If a system ever has two leaders at the same term, that's a safety violation — a specific bad moment in time.
- A liveness property is one whose violation requires an infinite execution to observe. "The cluster eventually elects a leader" — you can't rule it out by looking at any finite run; maybe it just needs more time.
In practice, safety violations are worse. A safety violation is a bug that corrupted state or let inconsistent data through. It cannot be fixed by waiting; someone has to reconcile. A liveness failure is a stall — bad, but recoverable by the system resuming forward motion.
Production systems prioritize safety. An etcd cluster that stalls for 30 seconds during a bad election is annoying; an etcd cluster that lets two clients both win a lock is catastrophic. Every consensus algorithm in this book is built to sacrifice liveness before safety.
The common failure modes
Split brain
Two leaders believe they are in charge simultaneously and both serve writes. Causes:
- Stale leader that hasn't yet discovered a new term / view.
- Network partition where both sides elected a leader.
- Clock skew fooling a leader lease into looking valid when it's not.
All modern consensus algorithms have specific defenses:
- Quorum intersection. A partition can only elect a leader on a side that has a majority. The other side can't. So at most one "real" leader at any term, even if the old one is still alive and unaware.
- Term / view numbers. The stale leader's RPCs are rejected because they carry the old term. Once it hears about the new term, it steps down.
- Lease fences. If a leader relies on a time-bounded lease, the lease expires before a new leader can be elected (by design), so the old one stops serving.
Split brain in a correctly implemented Raft is impossible. Split brain in a Raft that skips fsync or mishandles votes is absolutely possible. This is why the boring implementation details from Chapter 5 matter.
The stale read
Related but distinct. A leader thinks it's still the leader, serves a read from its local state, but has actually been superseded. The returned data is stale.
Defenses:
- Read index. Leader sends a heartbeat to confirm it still holds majority, then serves the read.
- Leader lease + synchronized clocks. Leader holds a lease; during the lease it's safe to read locally. Needs clock safety.
- Read from quorum. Read from majority; take the most recent. (Paxos-style read.)
Choosing here is a latency-vs-consistency dial. Production systems differ.
Leader flapping
Leaders keep getting elected and then losing their position in quick succession. Causes:
- Election timeout tuned too aggressively for the network's variability.
- Partial partitions where different subsets of nodes see different "reachable" leaders.
- GC pauses on leaders (JVM systems especially) making them look dead for a moment, triggering an election, then coming back.
Symptoms: log churn, write latency spikes, reduced throughput. The cluster is technically correct — each elected leader is valid — but each election costs real time and the system never stabilizes.
Fixes:
- Longer election timeouts (at the cost of slower recovery from real failures).
- Leader lease-based heartbeats (makes brief unreachability less likely to cause elections).
- Better GC tuning.
- Network fixes (partial partitions need dedicated effort).
Cascading failure
One node fails; the remaining nodes pile on load; a second node fails under the load; the quorum breaks; the cluster stalls. The classic "thundering herd" variant.
Cascading failures in consensus systems tend to involve:
- Retrying clients hammering the new leader.
- Snapshot transfers overloading the network of a newly-caught-up replica.
- Unbounded memory growth in leader election storms.
Defenses:
- Backpressure. The leader should refuse requests when overloaded, not queue them.
- Rate limiting on catch-up. Snapshot streaming should respect bandwidth budgets.
- Jitter. Clients retrying a failed write should add randomized delay.
- Capacity planning. The cluster should be sized so that
N-1nodes can handle peak load with headroom — notNat redline.
Clock skew
Spanner makes clock skew a first-class concern. Most other systems assume clocks are "close enough" and are surprised when they aren't.
- NTP skew in most cloud environments is sub-10ms but occasionally jumps.
- Clock jumps happen: VM migrations, NTP corrections, leap-second handling gone wrong.
- Clocks running fast on one node vs. another — rare but real.
Failure modes:
- A leader's lease looks longer than it actually is; a new leader is elected before the old one has stepped down.
- Timeouts fire early or late.
- Audit logs claim impossible orderings.
If your algorithm claims safety without reliance on clocks (Raft/Paxos/etc proper), clock skew should only cause liveness issues — leader flapping, slow elections. If safety depends on clocks, you are one VM migration away from a split brain.
Disk failure and "fast" fsync
The quiet safety killer.
fsyncis expensive; under load, implementers are tempted to batch or skip it.- Cloud block storage (EBS, GCE persistent disks) has its own journey from "application fsync" to "bytes durable," and various failure modes in between.
- A storage device that lies about its sync behavior (some consumer SSDs do this, under the rubric of "write caching") violates the protocol's assumptions.
If disk A returns from fsync before the data is durable, and then the node crashes, the on-disk log is not what the protocol thinks it is. On recovery, the node may contradict its own past actions — a Byzantine failure from the protocol's perspective, even if no nodes are malicious.
Jepsen has caught several products doing this. The general guidance:
- Use
fsync(orfdatasync) religiously on the critical path. - Understand your storage layer's durability semantics.
- Run Jepsen-like tests against your storage choices.
The "I'm healthy" lie
Node A is up but can't reach node B. Node B is up but can't reach A. Both report themselves healthy to external monitoring. The cluster is reporting no failures but not making progress.
This is the partial partition problem. Health checks are usually "am I up and responsive to the check"; they don't tell you "am I able to participate in consensus." Sophisticated monitoring tests consensus progress directly — "is the leader making forward progress?" — which is better, if you have it.
Real post-mortems that illuminate the theory
MongoDB rollback incidents
MongoDB's replication layer has been the subject of several Jepsen reports (jepsen.io/analyses/mongodb-4.2.6 among them) showing scenarios where committed writes could be lost under specific failure patterns and configurations. The fixes in subsequent versions bring it closer to conventional Raft-like behavior, but the history is a good reminder that "we have a replication protocol" is not the same as "it's correct under adversarial failures."
etcd disk latency incidents (Kubernetes outages)
Various Kubernetes outage write-ups attribute cluster-wide failures to etcd struggling under disk latency — slow commits cause slow API server responses, which cause slow controller reconciliations, which cascade into application-layer problems. The root-cause in these is rarely a bug in etcd's Raft — it's the operational assumptions around disk performance not being met. (Look at public Kubernetes post-mortems and vendor blog posts for specific cases.)
The great GitHub MySQL outage of 2018
Not a consensus system exactly, but instructive. A brief network partition triggered an automated failover; the cross-region replication state was not what the failover system assumed; inconsistencies ensued. GitHub's public post-mortem documented 24 hours of degraded service for what started as a 43-second network blip.
Lessons from this kind of incident:
- Automated failover is safety-critical and can make things worse if the model is wrong.
- Cross-region replication is partial synchrony at best, and tooling should know that.
- Post-incident cleanup (reconciling the diverged state) took vastly longer than the triggering event.
Kyle Kingsbury's Jepsen series
Kingsbury's Jepsen testing (jepsen.io) has found consistency bugs in essentially every major distributed system: Riak, MongoDB, Elasticsearch, Aerospike, VoltDB, CockroachDB, YugabyteDB, etcd (minor), Consul, ZooKeeper, FoundationDB (passed, notably), Kafka, Redis, DynamoDB, and many more.
The common themes across Jepsen reports:
- Claims of linearizability that are violated under specific timing.
- Default settings that silently lose data under network partitions.
- Retry and deduplication logic with edge cases.
- Storage layer assumptions that break in cloud environments.
Every practitioner should read 5–10 Jepsen reports cover to cover. They teach the gap between "we implemented Raft" and "our system is safe" better than any textbook chapter.
What to take from this chapter
- Safety and liveness are different properties. Safety violations corrupt data; liveness failures stall progress. Systems are designed to sacrifice liveness before safety.
- Most production issues are operational, not algorithmic. The algorithm is usually correct on paper. The deployment, the tuning, and the environment are where the bugs live.
- The same shapes recur. Split brain, stale read, leader flap, cascading failure, clock skew, disk lies. Recognize them early; instrument for them; test for them.
- Jepsen is the practitioner's school. Read reports. Apply the mindset to your own systems.
With safety, liveness, and failure modes in hand, we can finally turn to the practical question: given all this, how do you pick a consensus algorithm?
Choosing a Consensus Algorithm
You've read twelve chapters of theory and history. Someone walks up to your desk and says: "We need to replicate this service across three datacenters. What algorithm do we use?"
This chapter is the answer. Or rather, it is a procedure for producing an answer that is specific to your situation.
Start with three questions
Before picking an algorithm, answer:
- Who are the participants, and do you trust them?
- What is the blast radius of a failure — how strong does consistency need to be?
- What is the network between participants — LAN, WAN, cross-continent?
These three questions define the shape of the design space. Most bad consensus decisions come from picking an algorithm before answering them.
Who are the participants?
- Single organization, single deployment: you trust the operators. CFT is enough.
- Single organization, multiple teams/regions: you mostly trust the operators but have process isolation concerns. CFT with good monitoring is usually enough.
- Multiple organizations, known and contractually bound: permissioned setting. CFT may still be enough if the contract and incentives align. BFT if individual nodes could be compromised or act adversarially.
- Open participation, pseudonymous: you are outside this book. Go read about Nakamoto consensus, proof-of-stake, and the economic-security literature.
A surprising share of real systems can honestly answer "single organization, single deployment." For those, do not use BFT. It costs more in nodes, messages, and operational complexity without buying real benefit.
Consistency requirements
Strong consistency (linearizability) via consensus is expensive. Decide if you need it:
- Linearizable: operations appear instantaneous, in a real-time-respecting order. The model most "consensus databases" promise. Costs: one round-trip per write, latency pauses during partitions.
- Sequential / serializable: total order, but not necessarily respecting real-time across clients. Cheaper than linearizable.
- Causal: preserves happens-before. Good enough for many collaborative applications (document editing, chat).
- Eventual: reads might be stale; writes eventually propagate; conflicts resolved by CRDTs or application-level rules. Cheapest; highest throughput; hardest to reason about.
If eventual consistency is enough, you may not need consensus at all. Dynamo-style systems (Cassandra, Riak, DynamoDB in default modes) are the alternative. Consensus-based systems trade latency and availability for stronger guarantees. Know which you need.
Network realities
- Within a datacenter / LAN: single-digit ms RTTs, low variance. Raft or Paxos with short timeouts work well.
- Cross-AZ within a region: still single-digit ms, some variance. Fine for most consensus systems.
- Cross-region, same continent: 30–100ms RTT. Consensus still viable but every write pays this once. Consider whether writes need to synchronously hit multiple regions.
- Cross-continent: 100+ ms RTT. Cross-continental consensus for every write is painful. Consider sharding by region, using follower reads, or accepting asynchronous replication.
A five-node cluster where two nodes are in Sydney and three in New York is a cluster where every decision needs a Pacific crossing. Pay attention to where quorums live.
The decision procedure
A rough flowchart:
Do you trust all participants?
Yes: Use CFT.
No: Do you have smart contracts, audit, token semantics?
Yes: Permissioned blockchain (PBFT/HotStuff-backed).
No: BFT consensus directly (PBFT for small n, HotStuff for large n).
In CFT, pick based on team experience and ecosystem:
Team already knows Raft: use Raft. (etcd's raft library, hashicorp/raft.)
Team knows Paxos or has Paxos-shaped code: use Multi-Paxos.
Neither: use Raft.
Does your data need to scale horizontally?
Yes: Multi-raft (partition data, one Raft group per partition).
CockroachDB, TiKV, some in-house patterns.
No: Single-group Raft/Paxos. (etcd, Consul, ZooKeeper-pattern.)
Cross-region?
If yes, and writes need to be linearizable globally:
- Spanner-style (TrueTime + Paxos) if you have the clock infra.
- Otherwise: accept the WAN-RTT write latency, or shard by region.
If yes, but writes are region-local with occasional cross-region reads:
- One consensus group per region, application-level handling.
When crash-fault tolerance is enough
For most working engineers, the honest answer is "CFT is enough." This includes:
- Operational metadata (cluster state, service discovery, feature flags): etcd, Consul, ZooKeeper.
- Traditional databases needing HA: any Raft/Paxos-backed system (CockroachDB, FoundationDB, TiDB, etc).
- Message queues and log systems: Kafka (KRaft), Pulsar, Apache Pulsar's BookKeeper.
- Distributed locks and leader elections for larger applications: built on top of ZooKeeper/etcd/Consul.
The node count is 2f+1, the algorithm is Raft or Paxos, and the algorithm choice is less important than picking a battle-tested library and operating it well.
When Byzantine fault tolerance is actually needed
BFT is justified when at least one of these is true:
- Multi-organization trust boundary. A consortium where individual nodes could be compromised or misbehave independently.
- Regulatory or audit requirements. Where cryptographic non-repudiation of every decision is a feature, not just a side effect.
- High-integrity systems. Financial settlement, government record-keeping, safety-critical coordination, where undetected corruption is unacceptable at any rate.
- Known adversarial environment. Multi-tenant systems where you have reason to believe some participants will act against the group's interest.
BFT is not justified merely because:
- "We might get hacked." (Defense in depth > BFT; BFT doesn't help if the attacker controls
f+1nodes.) - "We want to be really safe." (CFT is very safe. BFT is safer against specific failure classes that may not apply.)
- "We might deploy on untrusted hardware." (Usually better addressed with attestation and tooling than with BFT.)
- "Blockchain-like, but permissioned." (Maybe. See Chapter 12.)
When eventual consistency beats consensus
Consensus is not the right answer for:
- Shopping cart state. Last-write-wins or merge semantics are fine.
- Social media feeds. Ordering can be approximate; occasional staleness is tolerable.
- Analytics pipelines. Throughput matters more than strict consistency.
- Cache tiers. You can always recompute; staleness is just a cache miss with higher latency.
- Systems where writes conflict rarely. CRDTs and eventual consistency are cheaper and simpler.
A common mistake is to shove all these into a consensus-backed database because the team already runs one. If the semantics don't require consensus, you are paying for consistency you don't use — and the system might be less available for unrelated reasons.
How to evaluate a consensus claim
When you receive a system and a vendor or author claims it "uses Raft" or "achieves linearizability" or "has Byzantine fault tolerance" — how do you evaluate?
A short checklist:
- What is the exact algorithm? Raft / Paxos / VR / PBFT / HotStuff / custom? If custom, is there a paper or spec?
- What is the failure model? Crash, crash-recovery, or Byzantine? Does the deployment reality match?
- What is the quorum size? Does it match the claimed fault tolerance (
2f+1for CFT,3f+1for BFT)? - Does it
fsyncbefore acknowledging? If no, the durability claim is probably a lie. - How does it handle network partitions? Does a minority partition continue to serve reads? Writes? Does it refuse?
- Has it been Jepsen-tested? If yes, what issues were found and have they been fixed? If no, ask why.
- What are the read semantics? Linearizable reads? Bounded-staleness? Best-effort?
- How does it handle clock skew? Are any safety guarantees dependent on clocks?
- What is the leader election / view change strategy? How long can a cluster be leaderless?
- What happens in a rolling upgrade? Can the cluster tolerate one node at a time being down, for the time it takes to upgrade?
If the vendor cannot answer most of these, you are not getting a consensus system — you are getting a product with consensus-shaped marketing. Caveat emptor.
When in doubt
A few defaults that rarely lead you wrong:
- For new CFT work: Raft with a battle-tested library (etcd's raft, hashicorp/raft, or equivalent in your language). Five-node cluster, colocated in one region, multi-AZ. Linearizable reads via leader with read index.
fsyncon every commit. - For BFT work: HotStuff or a direct descendant. Don't write your own.
- For eventually consistent work: CRDT-based approach if the semantics admit it; otherwise, a battle-tested AP system (Cassandra, DynamoDB with eventual consistency).
- For cross-region: think twice; is regional sharding viable?
The reason these defaults are safe is that they represent enormous accumulated engineering effort. Choosing them instead of rolling your own isn't a cop-out; it is the right choice for almost everything.
A final word on consensus and humility
In writing this book, I've come to believe that the main thing distributed systems teach is a kind of humility. Each of these algorithms was designed by smart people over careful years. Each has been implemented dozens of times; most implementations had subtle bugs; many had catastrophic ones. The literature is a long chain of "we thought this was right, then we found a case where it wasn't."
If you are building a consensus system from scratch, you are joining this chain. Most likely you will fail in ways you cannot foresee. The antidote is not caution to the point of paralysis but genuine respect for the problem: read carefully, test exhaustively (TLA+, Jepsen, fault injection), and have experienced reviewers catch what you miss.
The algorithms here have earned their place by surviving decades of adversarial scrutiny. Start with one of them.
Further Reading
Every algorithm, theorem, and system in this book came out of someone else's work. Here are the places to go when you want the source material, the fuller treatment, or the currents of ongoing research.
Foundational papers
Consensus and state machine replication
- Leslie Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, 1978. The paper that put logical clocks on the map and framed SMR. Also the one that introduced much of the vocabulary we still use.
- Leslie Lamport, "The Part-Time Parliament," ACM TOCS, 1998. The original Paxos paper. Written as a parable, delightful once you have the context, but famously hard to learn from cold.
- Leslie Lamport, "Paxos Made Simple," SIGACT News, 2001. The gentler presentation. Read this before The Part-Time Parliament.
- Leslie Lamport, "Fast Paxos," Distributed Computing, 2006. The optimization. Usually more interesting to read about than to implement.
- Brian Oki and Barbara Liskov, "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems," PODC 1988. VR, contemporaneous with Paxos.
- Barbara Liskov and James Cowling, "Viewstamped Replication Revisited," MIT-CSAIL-TR-2012-021, 2012. VRR. If you read one VR paper, read this one.
- Diego Ongaro and John Ousterhout, "In Search of an Understandable Consensus Algorithm," USENIX ATC 2014. Raft. The readable paper that changed the field.
- Diego Ongaro, "Consensus: Bridging Theory and Practice," PhD dissertation, Stanford, 2014. Raft in more detail, with the full formal treatment and implementation lessons.
- Fred Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, 1990. The SMR survey that taught the community how to think about the pattern.
Impossibility results
- Michael Fischer, Nancy Lynch, and Michael Paterson, "Impossibility of Distributed Consensus with One Faulty Process," JACM, 1985. FLP. The paper itself is short and remarkably readable.
- Eric Brewer, "Towards Robust Distributed Systems," PODC 2000 keynote. The slides that launched CAP.
- Seth Gilbert and Nancy Lynch, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," SIGACT News, 2002. The formal CAP proof.
- Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer, "Consensus in the Presence of Partial Synchrony," JACM, 1988. Partial synchrony as a model. The backbone of "safe in asynchrony, live in partial synchrony."
Byzantine fault tolerance
- Leslie Lamport, Robert Shostak, and Marshall Pease, "The Byzantine Generals Problem," ACM TOPLAS, 1982. The original.
- Miguel Castro and Barbara Liskov, "Practical Byzantine Fault Tolerance," OSDI 1999. PBFT. Short, dense, worth multiple reads.
- Miguel Castro, "Practical Byzantine Fault Tolerance," PhD dissertation, MIT, 2001. The full treatment, with all the optimizations and engineering details the conference paper elided.
- Maofan Yin et al., "HotStuff: BFT Consensus in the Lens of Blockchain," PODC 2019. HotStuff.
- Shehar Bano et al., "SoK: Consensus in the Age of Blockchains," AFT 2019. A systematic survey of consensus through the blockchain-era lens.
Randomized consensus
- Michael Ben-Or, "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols," PODC 1983. Short, beautiful.
- Michael Rabin, "Randomized Byzantine Generals," FOCS 1983. The shared-coin approach.
- Christian Cachin, Klaus Kursawe, and Victor Shoup, "Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement Using Cryptography," PODC 2000. Threshold cryptography for asynchronous BFT.
- Andrew Miller et al., "The Honey Badger of BFT Protocols," CCS 2016. Practical asynchronous BFT.
Production systems papers
- James Corbett et al., "Spanner: Google's Globally-Distributed Database," OSDI 2012. The TrueTime paper.
- Patrick Hunt et al., "ZooKeeper: Wait-free Coordination for Internet-scale Systems," USENIX ATC 2010. ZooKeeper's design and Zab.
- Giuseppe DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store," SOSP 2007. The Dynamo paper. The alternate path.
- Jason Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services," CIDR 2011. Google's pre-Spanner Paxos-backed store.
Textbooks
- Christian Cachin, Rachid Guerraoui, and Luís Rodrigues, Introduction to Reliable and Secure Distributed Programming, 2nd ed., 2011. The textbook if you want formal treatment. Algorithms specified precisely; correctness proved. Chapter 5 on randomized consensus is especially good.
- Hagit Attiya and Jennifer Welch, Distributed Computing: Fundamentals, Simulations, and Advanced Topics, 2nd ed., 2004. A classical theory-of-distributed-computing textbook. Good for the formal foundations.
- Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly, 2017. The book to hand to a software engineer who wants one book on this space. Broader than consensus — covers databases, stream processing, batch — but the chapters on consistency, replication, and consensus are the best short treatment in the field.
- Maarten van Steen and Andrew Tanenbaum, Distributed Systems, 4th ed., 2023. A textbook in the classical mold; covers consensus alongside everything else.
- Peter Alvaro (editor), various lecture notes and reading lists from the distributed systems courses he has taught, which rotate through current material.
The Jepsen reports
Kyle Kingsbury's Jepsen analyses (jepsen.io/analyses) are essential reading. If you implement or operate distributed systems, read ten of these cover to cover. You will come away with a concrete sense of how claims fail under real pressure.
Recommended starting points:
- The early MongoDB reports (consistency evolution over the years).
- The etcd and Consul reports (what CFT correctness looks like when verified).
- The FoundationDB "passed" report (what thorough testing from the authors' side enables).
- The Redis reports (what happens when replication is bolted on later).
- The CockroachDB, YugabyteDB, TiDB, and FaunaDB reports (modern consensus databases in the crucible).
Each report follows a similar structure: "here is the system, here is the failure model I tested it under, here is what went wrong, here is the vendor's response." The cumulative effect of reading many of them is a kind of vaccination against overconfidence.
Formal methods
If you want to convince yourself (or others) of a protocol's correctness:
- Leslie Lamport's TLA+ suite — lamport.azurewebsites.net/tla. Specifications in TLA+, model checking with TLC. Nontrivial learning curve, enormous payoff.
- The Raft TLA+ specification — by Diego Ongaro, available with the Raft paper's artifacts. A small, readable reference.
- MongoDB's formal methods work — public write-ups of using TLA+ and p-based methods to find bugs before shipping. Useful case studies of formal methods in industrial practice.
Courses and lectures
- MIT 6.824, Distributed Systems — pdos.csail.mit.edu/6.824. Lecture videos and labs are on the open internet. The labs walk you through implementing Raft step by step. This is how most of the industry's younger generation actually learned Raft.
- Martin Kleppmann's distributed systems lectures (Cambridge) — on YouTube and github.com/mkleppmann/distsys-class. Excellent, Socratic.
The blogs and writeups
A rotating cast, but a few durable sources:
- Murat Demirbas's blog, Metadata. Paper summaries and commentary from a working distributed systems researcher.
- Aphyr's blog (Kingsbury's Jepsen home).
- High Scalability archives — historical glimpses of production architectures.
- Google's and Meta's engineering blogs — occasional deep dives on production systems.
Companion volume
How Blockchains Actually Work (Without the Hype) — for the other half of the consensus story, where membership is open and identities are pseudonymous. It covers Nakamoto consensus, proof-of-work, proof-of-stake, and the economic-security tradeoffs that define public blockchains.
Wrap
The literature on consensus is enormous and still growing. If you read only three papers: Lamport's Paxos Made Simple, Ongaro's Raft, and Castro and Liskov's PBFT. If you read only one book: Cachin, Guerraoui, and Rodrigues (for formal) or Kleppmann (for practical). If you read only one series: Jepsen.
Everything else is optional but rewarding. The field is one of the most intellectually honest corners of computer science — every bad paper gets attacked, every good paper gets extended, every algorithm gets tested. Join the conversation by reading widely and implementing carefully.
That's the book. Thank you for reading.
Acknowledgments
This book exists because Georgiy Treyvus, Product Manager at CloudStreet, asked for it. His scoping note set the tone: "Paxos, Raft, PBFT, Viewstamped Replication — compared and contrasted with brutal honesty. Understand the real tradeoffs so you can make informed decisions instead of just copying what Kafka does." That framing is the spine of everything that follows. Thank you, Georgiy.
Georgiy also originated the companion volume, How Blockchains Actually Work (Without the Hype). The two books are a pair, and the pair was his idea before it was anyone's writing.
Everything else — the algorithms, the theorems, the post-mortems — is the work of the researchers, engineers, and reviewers whose names appear throughout these chapters and in the further reading. This book is a reader's-guide to their field.