The Future of Consensus
Predicting the future of any technology field is a reliable way to look foolish in hindsight. Predicting the future of consensus algorithms, where the foundational impossibility result (FLP) is over forty years old and the most widely deployed protocol (Paxos, in various disguises) is over thirty, requires a particular kind of hubris.
Let’s proceed anyway.
The trends we’ll discuss in this chapter aren’t speculative — they’re research directions with working prototypes, some with production deployments. What’s uncertain isn’t whether these ideas work, but whether they’ll achieve the adoption necessary to displace the current generation of consensus protocols. History suggests that most won’t. But the few that do will reshape how we think about agreement in distributed systems.
Disaggregated Consensus: Separating Ordering from Execution
The traditional model of consensus-based state machine replication bundles two concerns: ordering (deciding the sequence of commands) and execution (applying those commands to the state machine). Every replica orders the commands and executes them. This is clean conceptually but wasteful practically — why should every replica burn CPU executing the same computation when only the ordering needs agreement?
Disaggregated consensus separates these concerns. A small consensus group handles ordering (producing a totally ordered log of commands), and a potentially larger set of execution nodes consumes this log and applies the commands. The ordering group doesn’t need to know what the commands mean. The execution nodes don’t need to participate in consensus.
This pattern has appeared in several forms:
Shared log architectures. Systems like Corfu (from Microsoft Research), Delos (from Facebook/Meta), and Virtual Consensus use a shared log as the foundation. The log provides total order. Applications built on top of the log get this ordering “for free” by reading from the log. Different applications can share the same log (amortizing the consensus cost) or use separate logs (isolating failure domains).
Separation in databases. Amazon Aurora separates compute from storage, with the storage layer handling replication. The compute layer doesn’t run a consensus protocol — it writes to the storage layer, which handles durability and replication using a quorum-based protocol. This means Aurora can scale compute independently of storage, and adding a read replica doesn’t affect the consensus protocol.
The Delos approach. Meta’s Delos system takes this further by making the shared log implementation pluggable. The VirtualLog abstraction presents a log API to applications, but the underlying log implementation can be swapped (from a ZooKeeper-based log to a custom NativeLoglet, for example) without changing the application. This is disaggregation not just of ordering and execution, but of the consensus protocol itself.
The appeal of disaggregation is operational: the consensus group can be small (3 or 5 nodes), simple (just ordering bytes, no application logic), and generic (shared across many applications). The execution layer can be scaled independently, can tolerate execution failures without affecting consensus, and can even use different execution engines for different use cases.
The challenge is latency. Adding an indirection layer between clients and the consensus group adds at least one additional network hop. For latency-sensitive applications, this overhead may not be acceptable. But for applications where throughput matters more than single-operation latency, disaggregation is a clear win.
This trend is likely to accelerate. As more systems adopt microservices architectures and as serverless computing grows, the idea of a shared, managed ordering service becomes increasingly natural. Why should each microservice run its own Raft group when they could share an ordering layer?
Hardware-Assisted Consensus
The most exciting (and most uncertain) trend in consensus research is the use of specialized hardware to accelerate consensus protocols. The basic idea: if the bottleneck in consensus is network round-trips and message processing, what if the network infrastructure itself participated in the protocol?
RDMA-Based Consensus
Remote Direct Memory Access (RDMA) allows one machine to read from or write to another machine’s memory without involving the remote CPU. This eliminates the kernel networking stack, context switches, and much of the software overhead of traditional RPC.
RDMA-based consensus protocols (like DARE and Hermes) exploit this by having the leader write directly to followers’ memory. The consensus round-trip becomes:
- Leader writes to followers’ memory via RDMA (one-sided write, no follower CPU involvement)
- Leader polls followers’ memory to check for acknowledgment
- Done
The latency for a single RDMA round-trip is typically 1-3 microseconds, compared to 50-200 microseconds for a traditional TCP round-trip. This means consensus commits in single-digit microseconds — two orders of magnitude faster than software-based consensus.
The limitations are significant:
- RDMA requires specialized network hardware (InfiniBand or RoCE-capable NICs). This hardware is common in HPC and cloud data centers but not ubiquitous.
- RDMA-based protocols are limited to a single network domain (typically a single data center). You can’t do RDMA over the public internet.
- The failure model is different. RDMA one-sided writes can succeed even if the remote node has crashed (the NIC performs the write independently of the CPU). Detecting failure requires additional mechanisms.
- The programming model is complex. RDMA requires careful memory management, and bugs in RDMA code can corrupt remote memory silently.
Despite these limitations, RDMA-based consensus is being adopted in production systems where latency matters more than generality. Microsoft’s FaRM (Fast Remote Memory) uses RDMA-based replication for an in-memory key-value store, achieving millions of operations per second with microsecond-level latency.
SmartNIC and Programmable Switch Consensus
Even more exotic: running parts of the consensus protocol on network devices themselves.
NetPaxos (from Dang et al.) implements Paxos’s acceptor logic on a programmable network switch (using P4, a language for programming network data planes). The idea is that the switch, which already sees every packet, can act as the acceptor — stamping each proposal with an acceptance as the packet passes through. This eliminates the round-trip to separate acceptor nodes entirely.
Speculative Paxos (from Ports et al.) uses the network to provide an ordering guarantee. If the network fabric can deliver messages to all replicas in the same order (which some data center network topologies can approximate), then replicas can speculatively execute commands in that order, and the consensus protocol only needs to handle the rare cases where the network order isn’t consistent.
P4xos and related systems push consensus logic into the switch ASIC, achieving consensus in nanoseconds rather than microseconds. The tradeoff: switch memory and computation are extremely limited, so only the critical path of the consensus protocol runs on the switch, with everything else (recovery, membership changes, state transfer) handled in software.
These approaches share a common theme: exploiting the data center network as a computational resource rather than a dumb pipe. The results are impressive — sub-microsecond consensus in the best case. But the limitations are equally significant:
| Approach | Latency Improvement | Limitations |
|---|---|---|
| RDMA one-sided writes | ~100x (microseconds → microseconds) | Special NICs, single DC, complex failure handling |
| SmartNIC offload | ~10-100x | NIC-specific code, limited processing power |
| Programmable switch | ~1000x (nanoseconds) | P4-capable switches, very limited state, vendor-specific |
| Network-ordered consensus | ~10x (eliminates ordering round-trip) | Requires network topology guarantees, fragile |
The question isn’t whether these approaches work — they do, in controlled environments. The question is whether the hardware assumptions will become common enough to make them practical for general use. The trend in data center networking (toward programmable switches, SmartNICs, and RDMA-capable fabrics) suggests yes, but the timeline is uncertain.
FPGAs and Custom ASICs
Some researchers have gone further, implementing consensus logic directly in FPGAs. Consensus Box and similar projects demonstrate that the hot path of Paxos can run entirely in hardware, with the FPGA accepting proposals, checking quorums, and producing decisions without software involvement.
This is the extreme end of the hardware spectrum: maximum performance, minimum flexibility. An FPGA-based consensus accelerator can commit in nanoseconds, but changing the protocol requires reprogramming the hardware. For use cases where the consensus protocol is fixed and performance is paramount (high-frequency trading, real-time control systems), this makes sense. For general-purpose distributed systems, it’s overkill.
The Move Toward Shared Log Architectures
The shared log pattern deserves deeper examination because it represents a genuine architectural shift, not just an optimization.
In a shared log architecture, consensus is a service, not a library. Applications don’t embed a consensus protocol — they connect to a log service and append entries. The log service handles ordering, replication, and durability. Applications handle everything else.
This separation has several architectural benefits:
Amortized consensus overhead. If ten applications share one log, the consensus cost (leader election, quorum management, replication) is paid once, not ten times. Each application adds incremental cost for its log entries, but the fixed overhead is shared.
Simplified application development. Application developers don’t need to understand consensus. They need to understand “append to log” and “read from log.” This is a much lower bar and eliminates an enormous class of bugs (incorrect consensus implementations, failure to handle leader changes, etc.).
Flexible consistency. Different applications can read the same log at different points, providing different consistency levels. An application that reads at the tail of the log sees the latest data (strong consistency). An application that reads at a lag sees slightly stale data (eventual consistency). The log itself provides the consistency, and applications choose where to read.
Time travel and debugging. Because the log is the source of truth, you can replay from any point to reconstruct state. This is invaluable for debugging (“what did the system look like at 3:47 AM when the incident started?”) and for building new derived views (materialize a new index by replaying the log).
The shared log architecture is essentially the architecture of databases (WAL-based recovery) elevated to the system level. Instead of each database having its own WAL, the entire system shares a distributed WAL. It’s an old idea, but one whose time may have come as the tooling for building on top of logs (Kafka Streams, Apache Flink, materialized views) has matured.
Meta’s Delos, Microsoft’s Tango/Corfu, and various “log-structured everything” proposals all point in this direction. The pattern is also visible in the way CockroachDB and TiDB use Raft groups as per-range logs, with the state machine (SQL engine) consuming from these logs.
Leaderless and Multi-Leader Consensus: The Next Generation
The leader bottleneck — a single node that sequences all operations — remains the most important practical limitation of deployed consensus protocols. Several research directions aim to eliminate or mitigate it.
Beyond EPaxos
EPaxos demonstrated that leaderless consensus is possible, but its complexity has limited adoption. Newer protocols attempt to capture EPaxos’s benefits with simpler designs.
Atlas (Enes et al., 2020) simplifies EPaxos by restricting the dependency tracking to a cleaner model. While EPaxos tracks arbitrary dependencies between commands, Atlas uses a more structured approach that reduces the complexity of the execution ordering algorithm — arguably the hardest part of EPaxos to get right.
Tempo (Enes et al., 2021) goes further, providing leaderless consensus with clock-based ordering. By using loosely synchronized clocks (not for correctness, but for performance — clock skew degrades performance, not safety), Tempo can order most commands without explicit dependency tracking.
These protocols represent a trend toward making leaderless consensus practical rather than merely possible. The question is whether any of them will achieve the implementation maturity and ecosystem support necessary to challenge Raft’s dominance. As of this writing, none has.
Multi-Leader for Geo-Distribution
For geo-distributed systems where a single leader creates unacceptable latency, multi-leader approaches are gaining traction:
WPaxos (Ailijiang et al.) extends Flexible Paxos with object-level leadership. Different objects can have leaders in different regions. When a region’s client accesses an object whose leader is remote, the leader can be migrated to the accessing region. This provides locality for frequently accessed objects while maintaining consensus.
Mencius (Mao et al.) partitions the consensus log among multiple leaders. Leader 1 owns slots 1, 4, 7, …; leader 2 owns slots 2, 5, 8, …; etc. Each leader can independently propose for its slots, providing multi-leader throughput. The challenge is handling “holes” when a leader has nothing to propose for its slot (resolved with no-op messages).
These approaches represent a middle ground between single-leader consensus (simple, high latency for remote clients) and full leaderlessness (low latency, complex). Whether the middle ground is the right tradeoff depends on the workload, which brings us back to the unsatisfying but accurate answer: it depends.
Consensus at the Edge
The edge computing trend — pushing computation closer to users, onto devices at the network edge — presents new challenges for consensus.
Resource constraints. Edge devices may have limited CPU, memory, and storage. Running a full Raft implementation on a device with 64MB of RAM requires careful engineering. Running PBFT is likely impractical.
Intermittent connectivity. Edge devices may lose connectivity to the cloud and to each other. Consensus protocols that require a quorum of always-connected nodes don’t work when connectivity is intermittent.
Heterogeneous nodes. Edge deployments may include devices with very different capabilities — a powerful edge server in a base station alongside resource-constrained IoT sensors. Symmetric consensus protocols (where all nodes play the same role) don’t fit well.
Geo-distribution. Edge nodes are, by definition, distributed across geographic locations. The latency between an edge node in Seattle and an edge node in Miami makes classical consensus impractical for coordination between them.
These constraints push toward a few directions:
Hierarchical consensus. Local consensus among co-located edge nodes (fast, small quorum), with asynchronous replication to a central cloud (for durability and global coordination). This is not new — it’s the multi-master replication model that databases have used for decades — but the edge context adds constraints around resource limits and intermittent connectivity.
CRDTs and eventual consistency. For many edge use cases (counters, sensor aggregation, last-writer-wins registers), eventual consistency with CRDTs is sufficient and avoids the need for consensus entirely. The CRDT model is a natural fit for edge/IoT: each device can operate independently and merge state when connectivity is available.
Lightweight consensus protocols. Research into consensus protocols specifically designed for resource-constrained environments (fewer messages, smaller state, lower CPU requirements) is ongoing. These aren’t fundamentally new algorithms — they’re typically optimized variants of Raft or Paxos — but the optimization target is different (minimize resource usage rather than maximize throughput).
The edge will probably not produce a fundamentally new consensus algorithm. Instead, it will produce new architectures that combine existing consensus protocols (for the parts that need strong consistency) with eventual consistency mechanisms (for the parts that don’t) in configurations that reflect the edge’s unique constraints.
Machine Learning and Consensus
The intersection of machine learning and consensus is nascent but intriguing.
Learned Protocol Configuration
Consensus protocols have many tunable parameters: timeout values, batch sizes, quorum configurations, leader placement. Traditionally, these are set by humans based on experience and benchmarking. ML-based approaches can potentially optimize these parameters automatically.
Timeout tuning. Raft’s election timeout must be long enough to avoid unnecessary elections (which cause disruption) but short enough to detect actual failures promptly. The optimal timeout depends on network conditions, which change over time. An ML model that observes network latency patterns and adjusts timeouts accordingly could improve both stability and failover speed.
Batch size optimization. The optimal batch size for consensus depends on message sizes, arrival rates, and latency targets. Too small and you waste per-batch overhead. Too large and you increase latency while waiting for the batch to fill. An RL (reinforcement learning) agent could learn the optimal batching strategy online.
Leader placement. In a geo-distributed consensus group, placing the leader near the majority of clients minimizes commit latency. As workload patterns shift (different regions are active at different times of day), the optimal leader placement changes. An ML model could predict workload patterns and trigger proactive leader migration.
Adaptive Protocol Selection
More speculatively: could a system automatically choose between different consensus protocols (or protocol configurations) based on the current workload?
Consider a system that uses EPaxos’s fast path when the workload has low contention (most commands don’t conflict) and falls back to a Raft-like leader-based approach when contention is high (too many commands conflict for EPaxos’s fast path to be effective). The switch point depends on the workload’s conflict rate, which an ML model could estimate in real-time.
This is technically feasible — the challenge is ensuring correctness during transitions. Switching between protocols mid-stream requires careful state synchronization, and any bug in the transition logic could violate safety. The engineering cost of getting this right probably outweighs the performance benefit for most systems today, but as consensus-as-a-service offerings mature, the cost-benefit calculation may shift.
Predictive Failure Detection
A more immediately practical application of ML is improving failure detection. Current consensus protocols use fixed timeouts to detect node failures — if a heartbeat isn’t received within T milliseconds, the node is presumed dead. This is crude: too short a timeout causes false positives (healthy nodes declared dead during a network hiccup), too long a timeout delays failover.
ML models can learn the distribution of heartbeat delays under normal conditions and flag anomalies. If heartbeat latency normally follows a pattern (lower during the day, higher during batch processing at night), an ML model can adjust the effective timeout dynamically. Systems like Microsoft’s Falcon and academic projects on ML-based failure detection have shown promising results — reducing false positive rates while maintaining fast detection of actual failures.
This is one area where ML provides unambiguous value: it’s a pure optimization of existing mechanisms, the safety properties don’t depend on the ML model being correct (a false positive just triggers an unnecessary but safe leader election), and the training data is readily available from production telemetry.
What ML Won’t Fix
It’s worth being clear about what machine learning cannot do for consensus:
- ML cannot eliminate the fundamental round-trip overhead. No amount of prediction can substitute for actual agreement among nodes.
- ML cannot fix the FLP impossibility result. An asynchronous system with even one faulty process cannot guarantee consensus termination, regardless of how clever the ML model is.
- ML cannot substitute for formal correctness. A consensus protocol must be safe under all executions, not just the ones the ML model has seen in training. Using ML to learn a consensus protocol from scratch (rather than to tune parameters of a correct protocol) is a terrible idea.
The role of ML in consensus is optimization at the margins, not fundamental improvement. The margins can matter — a 20% latency reduction from better timeout tuning is valuable — but they don’t change the game.
The “Death of Consensus” Argument
Every few years, someone publishes a provocative paper or blog post arguing that consensus is obsolete. The argument typically goes: CRDTs and causal consistency can handle most use cases without coordination, so strong consensus is an unnecessary overhead that we should eliminate.
Let’s take this argument seriously.
Where the Argument Is Right
CRDTs (Conflict-Free Replicated Data Types) have genuinely expanded the space of problems solvable without consensus. For data types with commutative, associative, and idempotent operations (counters, sets, LWW-registers, OR-sets, etc.), CRDTs provide eventual convergence without any coordination. This is not a theoretical curiosity — it’s used in production by systems like Riak, Redis, and various collaborative editing applications.
Causal consistency (where operations are ordered according to their causal dependencies, but causally independent operations can be ordered arbitrarily) is weaker than linearizability but strong enough for many applications. If you only need “read your own writes” and “monotonic reads,” causal consistency provides this without the latency and availability costs of consensus.
The CALM theorem (Consistency As Logical Monotonicity) provides a formal framework for identifying which computations require coordination and which don’t. Monotonic computations (where adding information never invalidates previous conclusions) can be done without coordination. This is a powerful insight that suggests many programs could be written to avoid consensus entirely.
Together, these results suggest that the domain of problems requiring strong consensus is smaller than we traditionally assumed. Many systems that use consensus don’t actually need it — they could achieve their requirements with weaker consistency models.
Where the Argument Is Wrong
The argument that consensus is obsolete breaks down in several places:
Leader election and mutual exclusion are inherently non-monotonic. Choosing one leader among many, or granting a lock to one process among many, requires agreement. No CRDT can implement a distributed lock. No eventually consistent system can provide mutual exclusion. If your system needs “exactly one process does this thing at this time,” you need consensus.
Transactions require ordering. If you need atomic multi-key updates (transfer money from account A to account B), you need agreement on the order of operations. CRDTs handle individual operations beautifully but don’t compose into transactions without additional coordination.
Configuration changes require agreement. The membership of a distributed system (which nodes are alive, which are primary, what’s the current schema) must be consistent across nodes. A system where different nodes disagree about who the leader is will behave incorrectly. Ironically, even a system that uses CRDTs for application data needs consensus for its own configuration.
Exactly-once semantics require deduplication state. If you need to ensure an operation is applied exactly once, you need consistent deduplication state. This requires consensus (or something equivalent, like a consistent hash ring backed by consensus).
The Numbers
To put some perspective on the “CRDTs will replace consensus” argument, consider what fraction of a typical system’s operations actually require strong consistency:
| System Type | Operations Requiring Consensus | Operations Suitable for Eventual Consistency |
|---|---|---|
| Social media feed | < 1% (account creation, deactivation) | > 99% (posts, likes, comments, reads) |
| E-commerce | ~5-10% (checkout, inventory decrement) | ~90-95% (browsing, cart updates, recommendations) |
| Banking | ~50-80% (transfers, balance updates) | ~20-50% (statement reads, notifications) |
| Configuration management | ~100% (all writes) | ~0% (all reads need strong consistency) |
| Collaborative editing | ~1-5% (document creation, permissions) | ~95-99% (edit operations via OT/CRDTs) |
For the social media and e-commerce cases, the “death of consensus” argument has merit — most operations don’t need it. For banking and configuration management, consensus remains essential. The future isn’t “consensus vs. no consensus” — it’s knowing which operations fall into which category.
The Realistic Prediction
Consensus won’t die. But its domain will shrink.
The pattern that’s emerging is a layered architecture:
-
Consensus layer (small, critical): Handles leader election, membership, configuration, and the small amount of data that requires strong consistency. Runs on a small number of nodes (3-5). Uses Raft or an equivalent protocol.
-
Coordination-free layer (large, performance-sensitive): Handles application data using CRDTs, eventual consistency, or causal consistency. Runs on many nodes. No consensus overhead.
-
Selective consensus (as needed): Some operations require strong consistency even on application data (e.g., unique username registration, inventory decrement below zero). These operations go through the consensus layer on demand.
This is essentially what many modern databases already do (consensus for the metadata layer, weaker consistency for the data layer with optional strong reads). The trend is toward making this architecture more explicit and more configurable.
The Modularity Trend: Consensus as a Pluggable Component
The traditional approach to building a consensus-based system is to deeply integrate the consensus protocol into the application. etcd’s Raft implementation is tightly coupled to etcd’s storage engine. ZooKeeper’s Zab implementation is tightly coupled to ZooKeeper’s data model.
A growing trend is toward modular, composable consensus:
Consensus libraries. Instead of building a complete system, use a consensus library and add your application logic on top. Examples include hashicorp/raft (Go), openraft (Rust), ratis (Java), and dragonboat (Go). These libraries handle leader election, log replication, and snapshot management, exposing a state machine interface that the application implements.
Consensus-as-a-service. Instead of running your own consensus group, use a managed ordering service. This is the shared log approach discussed earlier — the consensus is someone else’s problem, and your application just appends to and reads from a log.
Pluggable consensus in blockchain frameworks. Hyperledger Fabric allows plugging in different consensus protocols (Raft, BFT variants). Cosmos SDK, built on Tendermint, separates the consensus engine from the application via ABCI. This modularity lets the same application run with different consensus backends depending on the deployment requirements.
The modularity trend reduces the barrier to using consensus correctly. If you’re using a well-tested consensus library rather than implementing your own, you inherit the library’s correctness (and its bugs, but at least they’re shared, well-known bugs). If you’re using consensus-as-a-service, you don’t need to understand the protocol at all.
The risk is that modularity can create a false sense of security. A consensus library handles consensus, but it doesn’t handle the interaction between consensus and your application. Incorrect use of the state machine interface, improper handling of leadership changes, or misunderstanding of the consistency guarantees provided by the library can all lead to bugs that the library can’t prevent.
What Problems Remain Unsolved
Despite decades of research, several problems in consensus remain genuinely open:
Optimal Byzantine Fault Tolerance
We know that BFT requires 3f+1 nodes for f Byzantine faults, and we know that HotStuff achieves linear message complexity. But the constant factors in BFT protocols remain high. The throughput gap between CFT and BFT protocols is roughly 10x, even with optimized BFT implementations. Closing this gap — or proving that it can’t be closed — is an open problem.
Threshold signatures (used by HotStuff to aggregate votes) help with message complexity but add CPU overhead. Post-quantum threshold signatures will add even more overhead. Whether hardware acceleration (SmartNICs, FPGAs) can close the gap is an active research question.
Dynamic Membership with Formal Guarantees
Adding and removing nodes from a consensus group while maintaining safety is a solved problem in theory (Raft’s joint consensus, Lamport’s reconfigurable Paxos) but remains fragile in practice. Systems regularly hit edge cases during membership changes — nodes that don’t know they’ve been removed, configurations that briefly lack a quorum, state transfer that races with new proposals.
A consensus protocol that handles dynamic membership as cleanly as it handles normal operation (same safety proofs, same performance characteristics, same implementation simplicity) does not yet exist.
Consensus Under Partial Synchrony with Tight Bounds
The FLP impossibility result tells us that consensus is impossible in a fully asynchronous system. Practical protocols assume partial synchrony — eventually, messages are delivered within a bounded time. But the bound is unknown, and choosing it wrong has consequences (too short: unnecessary leader elections; too long: slow failure detection).
An adaptive protocol that provides optimal performance under current network conditions without requiring any timing assumptions as input is a holy grail that remains out of reach. Machine learning approaches (as discussed above) are a step in this direction but lack formal guarantees.
Consensus at Planetary Scale
Current consensus protocols work well within a data center (sub-millisecond latency) and tolerably across regions (tens to hundreds of milliseconds). But as we push toward truly global systems — satellite networks, interplanetary communication, systems spanning light-seconds of latency — current protocols break down.
The speed of light imposes a hard floor on latency. Earth to Mars is 3-22 minutes one-way. A Raft round-trip of 6-44 minutes is not a practical commit latency. New models of consistency — perhaps based on speculative execution, hierarchical consensus, or fundamentally different assumptions about agreement — will be needed for truly planetary-scale systems. This is admittedly a niche concern today, but it’s a fascinating theoretical question.
Making Consensus Understandable
This might be the most important unsolved problem. Despite Raft’s explicit goal of understandability, consensus algorithms remain difficult for most engineers to understand, implement correctly, and debug. The gap between reading a protocol description and building a correct implementation is enormous.
Better tools (model checkers like TLA+ and formal verification frameworks), better education (interactive visualizations, worked examples), and better abstractions (consensus libraries and services that hide the protocol details) all help. But the fundamental complexity of agreement in the presence of failures is irreducible. You can hide it behind an abstraction, but someone has to understand what’s behind the abstraction when things go wrong.
Formal Verification: Proving Consensus Correct
One trend that deserves dedicated attention is the increasing use of formal verification for consensus implementations — not just protocols on paper, but actual running code.
The State of the Art
The gap between a provably correct protocol and a provably correct implementation has historically been wide. Lamport proved Paxos correct on paper in 1989. The first formally verified implementation of Paxos (in the Verdi framework) appeared in 2015. That’s 26 years between “we know it’s correct” and “we can prove the code is correct.”
Several projects have pushed this boundary:
IronFleet (Microsoft Research) provided the first formally verified implementation of a practical Paxos-based replicated state machine, including the network layer, state transfer, and liveness. The verification covered not just the core consensus logic but the entire system stack down to the compiled binary. The cost: roughly 4x the development effort compared to an unverified implementation.
Verdi provides a framework for verified distributed systems in Coq, with consensus protocols as a primary use case. Verdi’s approach separates the protocol specification from the system-level concerns (network semantics, failure handling), allowing each to be verified independently.
CockroachDB’s use of TLA+ demonstrates a more pragmatic approach: specify the protocol in TLA+, model-check the specification to find bugs, but don’t formally verify the implementation. This catches protocol-level bugs (of which the CockroachDB team has found several) without the cost of full verification.
The Raft TLA+ specifications (both the original by Ongaro and subsequent refinements) have been used to find bugs in multiple Raft implementations by checking that the implementation’s behavior matches the specification. This isn’t full verification, but it’s far better than testing alone.
Why This Matters
Consensus bugs are among the most dangerous bugs in distributed systems because they violate the assumptions that everything else is built on. A bug in the application layer loses one user’s data. A bug in the consensus layer can corrupt the replicated state across all nodes, making the corruption durable and replicated — the opposite of what replication is supposed to do.
Real examples of consensus bugs that formal methods could have caught (or did catch):
| Bug | System | Impact | Found By |
|---|---|---|---|
| Raft pre-vote missing case | Multiple Raft impls | Disrupted clusters after partition healing | Testing + analysis |
| EPaxos execution ordering bug | Original EPaxos paper | Incorrect command ordering | Manual proof review, years after publication |
| ZooKeeper atomic broadcast edge case | ZooKeeper | Potential data inconsistency during leader change | Jepsen testing |
| etcd lease revocation race | etcd | Stale lease could grant access after revocation | Production incident |
| MongoDB replication rollback data loss | MongoDB | Committed writes lost during rollback | Jepsen testing |
The trend toward formal verification in consensus is not about theoretical purity — it’s about preventing these bugs. As consensus-based systems handle increasingly critical data (financial transactions, medical records, infrastructure control), the cost of consensus bugs increases, and the investment in formal verification becomes more justified.
The Practical Compromise
Full formal verification of a consensus implementation remains expensive — roughly 4-10x the development cost. For most teams, this isn’t practical. The practical compromise that’s emerging is a layered approach:
- Formally specify the protocol in TLA+ or a similar specification language
- Model-check the specification to find protocol-level bugs
- Test the implementation against the specification using trace checking or conformance testing
- Chaos-test the deployment using tools like Jepsen, Chaos Monkey, or Toxiproxy
- Monitor invariants in production (e.g., alert if a follower’s committed index exceeds the leader’s — this should never happen)
Each layer catches a different class of bugs at a different cost. Together, they provide much stronger assurance than testing alone, without the full cost of formal verification.
The Eternal Tension
Throughout this book, we’ve encountered the same tension repeatedly: the tension between theoretical elegance and production engineering.
The theory gives us impossibility results that define what can’t be done (FLP, CAP). It gives us protocols with provable safety and liveness properties (Paxos, PBFT, HotStuff). It gives us lower bounds on message complexity and fault tolerance. This theory is valuable — it prevents us from attempting the impossible and gives us confidence that our protocols are correct.
But the theory doesn’t build systems. The gap between a protocol on paper and a protocol in production is filled with engineering decisions that the theory doesn’t address: how to do state transfer, when to take snapshots, how to handle disk failures, what to do when the clock is wrong, how to upgrade without downtime, how to monitor health, how to debug a stuck consensus group, how to explain to management why the system is unavailable when two out of five nodes are down even though “we have replication.”
This gap is not a failure of theory or of engineering — it’s an inherent consequence of the fact that distributed systems exist at the intersection of mathematics and the physical world. The mathematics is clean. The physical world is not. Consensus algorithms are our best attempt to bridge the two, and the difficulty of that bridge is why they continue to cause agony.
Final Thoughts
When I started writing this book, I expected to arrive at a neat conclusion: here’s the best consensus algorithm, here’s when to use it, here’s how the field will evolve. What I found instead is that consensus — like most problems that have occupied great minds for decades — doesn’t have a neat conclusion. It has tradeoffs, context-dependent recommendations, and a healthy dose of “it depends.”
Here’s what I do believe, after spending far too long thinking about this:
Consensus will remain necessary. The arguments for eventually consistent alternatives are valid but bounded. As long as systems need leaders, locks, transactions, or configuration agreement, they need consensus in some form.
Raft will remain the default. Not because it’s optimal, but because it’s understood. The next decade’s consensus innovations will likely be built as optimizations on top of Raft-like foundations rather than as replacements for them.
Hardware will change the performance picture. RDMA, SmartNICs, and programmable switches will push consensus latency into the microsecond range for systems that can use them. This will enable new use cases but won’t eliminate the fundamental constraints.
The abstraction boundary will move up. Fewer teams will implement consensus protocols directly. More teams will use consensus through libraries, services, and managed offerings. This is unambiguously good — the fewer people who have to worry about the details of leader election, the fewer production incidents caused by getting leader election wrong.
The agony will continue. The gap between understanding consensus in theory and operating it in production will not close. New engineers will continue to discover, with dismay, that “just use Raft” is the beginning of the journey, not the end. They’ll struggle with leader elections, membership changes, state transfer, and the thousand other details that make consensus hard in practice.
And that’s fine. Not every problem should be easy. The agony of consensus algorithms is the price we pay for making multiple computers agree on something, and that capability — fragile, expensive, and maddening as it is — underpins nearly every reliable distributed system in existence. The computers that manage your bank account, route your network traffic, store your data, and coordinate your infrastructure all rely on some form of consensus. The fact that it works at all, given the theoretical impossibility results and the practical engineering challenges, is a minor miracle of computer science.
It’s a miracle that causes a lot of suffering. But it’s a miracle nonetheless.
If this book has done its job, you now understand not just the algorithms themselves, but the landscape in which they operate — the tradeoffs, the failure modes, the gap between theory and practice, and the reasons why every choice involves giving something up. The agony of consensus is not a problem to be solved. It is a condition to be managed, with clear eyes, good tools, and a healthy respect for the difficulty of making machines agree.
Good luck. You’ll need it.