When to Use What (and When to Give Up)
This is the chapter I wish had existed when I started building distributed systems. Not the theory — the theory is in every textbook. What’s missing is someone who’s been through the decision process multiple times saying, plainly, “given your constraints, here’s what to use and here’s why.”
So that’s what this chapter is. A decision framework built from scar tissue.
The Questions You Should Ask First
Before you even look at consensus algorithms, you need to answer some questions about your system. Most teams skip this step, jumping straight to “we’ll use Raft because we’ve heard of it.” That’s not always wrong (Raft is a fine default), but it’s not engineering — it’s reflex.
Question 1: What Is Your Failure Model?
This is the single most important question and the one that narrows the field the fastest.
Crash faults only: Your nodes may crash and restart, but when they’re running, they behave correctly. They don’t lie, they don’t send conflicting messages to different peers, they don’t get compromised. This is the assumption for Paxos, Raft, Zab, VR, EPaxos, Flexible Paxos, and Kafka ISR.
Byzantine faults: Your nodes may behave arbitrarily. They can send wrong data, refuse to participate, collude with other malicious nodes, or actively try to subvert the protocol. This is the assumption for PBFT, HotStuff, and Tendermint.
Here’s the practical test: do all your nodes run the same software, managed by the same team, in infrastructure you control? If yes, crash faults are almost certainly sufficient. If you’re running a multi-organization system, a public blockchain, or anything where a compromised node could lie to its peers, you need Byzantine tolerance.
The honest addendum: even “crash faults only” is an idealization. Disk corruption, memory bit-flips, and kernel bugs can cause nodes to behave incorrectly without crashing. Some teams add application-level checksums on top of a CFT protocol rather than paying the full BFT overhead. This is pragmatic but not formally justified, so don’t put it in your design document unless you’re prepared for that conversation.
Question 2: What Is Your Network Topology?
Single data center, low latency (< 1ms between nodes): You’re in the best case. Any protocol will perform well. Leader-based protocols have minimal latency overhead because leader-to-follower round-trips are fast.
Multi-data center, moderate latency (10-100ms between DCs): Leader placement matters enormously. If your leader is in US-East and your client is in US-West, every write pays a cross-country round-trip. Consider protocols that support follower reads (Raft read leases, Zab sync’d reads) or multi-leader approaches. Or accept the latency and put the leader near the majority of clients.
Global, high latency (100-300ms between regions): This is where consensus hurts the most. Two message delays at 200ms per hop means 400ms minimum commit latency. EPaxos can help if conflicts are rare (the nearest replica handles the request). Flexible Paxos with a commit quorum located near the clients can help. But honestly, at this scale, you should strongly consider whether you actually need synchronous consensus or whether eventual consistency with conflict resolution would suffice. We’ll come back to this.
Unreliable network (frequent partitions, packet loss, high jitter): All consensus protocols degrade under adverse network conditions, but some degrade more gracefully than others. Raft’s leader election, based on randomized timeouts, handles partition healing reasonably well. PBFT’s view change under persistent network instability can be painful. Kafka’s ISR model gracefully handles transient slow brokers by shrinking the ISR.
Question 3: What Is Your Read/Write Ratio?
Write-heavy (> 50% writes): The leader is under constant pressure. Throughput is limited by the leader’s ability to replicate. Batching becomes essential. Consider Kafka ISR (designed for this), or partitioning your data so different partitions have different leaders.
Read-heavy (> 90% reads): The consensus protocol is involved in a small fraction of operations. The interesting question is how you serve reads: from the leader only (simple, consistent, bottleneck), from any replica (fast, potentially stale), or from followers with read leases (a compromise). Raft and Zab both support linearizable reads via the leader and eventually consistent reads from followers. If you’re at 99% reads, the consensus algorithm barely matters — your read path architecture matters more.
Mixed with hot keys: The worst case. If a small number of keys receive a disproportionate share of writes, even partitioning doesn’t help (the hot partition’s leader is still a bottleneck). EPaxos can help with the non-hot keys, but the hot keys still serialize. This is a fundamental problem that no consensus algorithm solves — you need application-level solutions (write combining, buffering, CRDTs for commutative operations).
Question 4: How Many Nodes?
3 nodes (f=1): The minimum for any meaningful fault tolerance. Raft, Zab, and Kafka ISR all work well here. BFT protocols need at least 4 nodes (for f=1), so this rules them out.
5 nodes (f=2): The sweet spot for most deployments. Handles two simultaneous failures, which covers rolling upgrades with one unexpected failure. This is the most common configuration for etcd, ZooKeeper, and most consensus deployments.
7-13 nodes: Uncommon for CFT protocols (diminishing returns, increased replication overhead), but normal for BFT deployments where you need higher fault tolerance.
Hundreds or thousands of nodes: You’re either building a blockchain or doing something unusual. Classical consensus protocols don’t scale to this range. You need either a hierarchical approach (consensus within small groups, coordination between groups) or a protocol specifically designed for large validator sets (e.g., HotStuff variants used in blockchain systems).
Question 5: Can You Tolerate a Leader Bottleneck?
If yes (which is the case for most systems with moderate throughput requirements), use a leader-based protocol. They’re simpler to implement, simpler to reason about, and simpler to debug.
If no, your options are:
- EPaxos — eliminates the leader but adds complexity in dependency tracking
- Partitioned leader (Kafka model) — different leaders for different partitions
- Read-only replicas with consensus only for writes — offloads reads but writes still go through a leader
- Give up on consensus — use eventual consistency with CRDTs or last-writer-wins
Most teams that think they can’t tolerate a leader bottleneck actually can, once they add batching. A Raft leader on modern hardware with NVMe storage and 10Gbps networking can handle 50,000+ writes per second with batching. If you need more than that on a single consensus group, you likely need to partition regardless.
Question 6: What Are Your Durability Requirements?
This question is less about the consensus protocol and more about how you configure it, but it has important implications.
Every write must be durable before acknowledgment: This is the “correct” mode. Every node must fsync before responding. This is what the proofs assume. Latency cost: 0.05-0.2ms per fsync on NVMe, 5-15ms on spinning disk.
Batch durability (fsync every N ms): Many systems, including Kafka by default, batch fsync calls. This means a crash within the batch window loses data. If you’re okay with this (and for many use cases, you should be), it dramatically improves throughput.
Replication is sufficient (no fsync): If you replicate to three nodes and assume they won’t all crash simultaneously, you can skip fsync entirely and rely on replication for durability. This is formally incorrect (a correlated failure — like a data center power outage — can lose data) but is used in practice by systems that prioritize throughput over absolute durability.
The choice here doesn’t change which protocol you pick, but it changes what performance you get from it. Quoting throughput numbers without specifying durability settings is meaningless.
The Decision Tree
Here’s the decision framework, presented as a series of branching decisions. Follow the path that matches your requirements.
START: Do you need Byzantine fault tolerance?
│
├── YES: Are you building a blockchain or multi-organization system?
│ │
│ ├── YES (public blockchain): How many validators?
│ │ ├── < 100: Tendermint or HotStuff variant
│ │ └── > 100: HotStuff variant (linear message complexity essential)
│ │
│ ├── YES (private/consortium): How many organizations?
│ │ ├── < 10: PBFT is viable, Tendermint also works
│ │ └── > 10: HotStuff or Tendermint
│ │
│ └── NO (single org, but paranoid about compromised nodes):
│ └── Consider adding integrity checks on top of a CFT protocol
│ instead of paying the full BFT tax. It's not formally
│ sound but it's practical.
│
└── NO: What's your primary use case?
│
├── Metadata store / coordination service (low volume, strong consistency)
│ └── Use an existing system:
│ ├── etcd (Raft-based, good K8s integration)
│ ├── ZooKeeper (Zab-based, mature, Java ecosystem)
│ └── Consul (Raft-based, service mesh features)
│ DON'T build your own.
│
├── Distributed lock / leader election
│ └── Use an existing system:
│ ├── etcd with client library
│ ├── ZooKeeper with recipes
│ └── Redis (if you accept the caveats about Redlock)
│ DEFINITELY don't build your own.
│
├── Replicated database (state machine replication)
│ │
│ ├── Single-region:
│ │ ├── < 50K writes/sec: Raft (most tooling, best understood)
│ │ └── > 50K writes/sec: Partition across multiple Raft groups
│ │
│ └── Multi-region:
│ ├── Can tolerate leader in one region: Raft with geo-aware leader
│ ├── Need low-latency writes everywhere: EPaxos or Flexible Paxos
│ └── Can tolerate eventual consistency: CRDTs or last-writer-wins
│
├── Message queue / event streaming
│ └── Use Kafka or a Kafka-like system (Redpanda, Pulsar)
│ The ISR model is designed for this. Classical consensus
│ is overkill for append-only logs with consumer groups.
│
├── Replicated log / write-ahead log
│ ├── Raft (simplest mental model: the log IS the consensus)
│ └── Multi-Paxos (if you need gap tolerance for performance)
│
└── Something else:
└── Start with Raft. Seriously. You can always switch later
(you won't switch later, but the option is comforting).
Specific Recommendations
Let’s go through common scenarios with concrete recommendations.
Scenario: You’re Building a Metadata Store
Use case: Storing configuration, service discovery information, cluster membership, leader election state. Low write volume (tens to hundreds of writes per second), strong consistency required, high read volume.
Recommendation: Don’t build one. Use etcd or ZooKeeper.
If you’re in the Kubernetes ecosystem, etcd is the obvious choice — it’s what Kubernetes itself uses, it has good client libraries, and there’s a large community that has collectively discovered and fixed most of the operational foot-guns.
If you’re in the Java/JVM ecosystem, ZooKeeper is battle-tested at massive scale (it runs the configuration layer at LinkedIn, Twitter/X, and hundreds of other companies). Its API is quirky (ephemeral nodes, sequential znodes, watches with session semantics), but the recipes built on top of it (distributed locks, leader election, group membership) work.
When to break this recommendation: When your requirements include very low latency (sub-millisecond), very high read throughput (millions/sec), or you’re in an environment where running etcd or ZooKeeper is operationally impractical (embedded systems, resource-constrained environments). In these cases, embedding a Raft library (like hashicorp/raft in Go or openraft in Rust) into your application might be justified. But understand that you’re taking on the maintenance burden of a consensus-based system, and that burden is significant.
Scenario: You’re Building a Distributed Lock Service
Use case: Mutual exclusion across distributed processes. Correctness is critical — two processes holding the same lock simultaneously means data corruption.
Recommendation: Use etcd or ZooKeeper leases/ephemeral nodes.
Please do not build your own distributed lock using Redis. I know about Redlock. Martin Kleppmann wrote a detailed analysis of why Redlock doesn’t provide the guarantees it claims, and Salvatore Sanfilippo (Redis’s creator) wrote a rebuttal, and Kleppmann wrote a rebuttal to the rebuttal, and the fundamental issue remains: Redis is not a consensus system, and bolting lock semantics onto a non-consensus system requires assumptions about timing that don’t hold in practice.
If you need a distributed lock and you need it to actually be correct (not “correct unless a GC pause happens at the wrong time”), use something backed by a real consensus protocol.
When to break this recommendation: When a lock violation is annoying but not catastrophic. If the worst case of two holders is “we process a message twice” rather than “we corrupt financial records,” then Redlock-style approaches or even optimistic locking with CAS operations might be good enough. “Good enough” is underrated in engineering.
An important nuance: distributed locks have a fencing problem that most implementations ignore. Even with a consensus-backed lock, a process can acquire the lock, pause (GC, swap, CPU scheduling), and resume after the lock has expired and been acquired by another process. The original holder doesn’t know the lock has expired. The solution is fencing tokens — monotonically increasing tokens issued with each lock acquisition, which downstream resources check before accepting operations. Without fencing tokens, your distributed lock is a suggestion, not a guarantee. ZooKeeper’s sequential znodes provide this naturally. Most other lock implementations require you to build it yourself.
Scenario: You’re Building a Replicated Database
Use case: A database where writes are replicated to multiple nodes for fault tolerance, and clients need linearizable (or at least serializable) reads.
Recommendation: Raft, embedded via a library.
This is the use case that CockroachDB, TiDB, and YugabyteDB all chose Raft for. The reasons are practical:
- Raft has the best documentation and most reference implementations of any consensus protocol
- The log-based model maps directly to write-ahead logging, which every database already has
- Per-range Raft groups (CockroachDB’s approach) let you scale by partitioning while keeping per-partition consensus simple
- Leader leases enable low-latency reads without consensus
If you’re building a multi-region database and can’t tolerate the latency of going to a single leader, consider EPaxos — but only if you have the engineering depth to implement it correctly and the workload characteristics (low conflict rate) to benefit from it.
When to break this recommendation: If your database can tolerate eventual consistency for reads (with read-your-writes or causal consistency guarantees), you may not need consensus for the read path at all. Use consensus for writes and a gossip-based or anti-entropy protocol for read replicas. This is essentially what Amazon DynamoDB does — strong consistency is available but costs more because it goes through the consensus path, while eventually consistent reads go to any replica.
Scenario: You’re Building a Message Queue
Use case: Durable message delivery with ordering guarantees, consumer groups, replay capability.
Recommendation: Use Kafka. Or Redpanda. Or Pulsar. Do not build your own.
The temptation to build “a lightweight Kafka” is one of the most dangerous impulses in distributed systems engineering. You start with “we just need a simple pub/sub with persistence” and end up spending two years reimplementing consumer group coordination, exactly-once semantics, and partition rebalancing.
Kafka’s ISR model is specifically designed for the append-only, high-throughput, per-partition-ordering use case. It’s not academically elegant, but it’s operationally proven at scales that would make most consensus protocols weep.
When to break this recommendation: When you need stronger ordering guarantees than per-partition ordering (total order across all messages), when your messages are very small and Kafka’s per-message overhead is significant, or when you’re in an environment where running a Kafka cluster is impractical (embedded, edge, very small scale). In the last case, consider an embedded Raft log.
Scenario: You’re Building a Blockchain
Use case: A distributed ledger with Byzantine fault tolerance, potentially with untrusted participants.
Recommendation: Tendermint (for application-specific chains) or a HotStuff variant (for high-throughput chains).
Tendermint’s ABCI (Application BlockChain Interface) is a clean separation between consensus and application logic — you implement a state machine, Tendermint handles the consensus. This is genuinely well-designed and saves you from implementing BFT consensus yourself.
For permissionless blockchains with large validator sets, you need something with linear message complexity — HotStuff or its descendants. PBFT’s quadratic overhead makes it impractical beyond ~20 validators.
When to break this recommendation: When your “blockchain” is actually a permissioned system among a small number of known organizations. In that case, ask yourself whether you really need a blockchain or whether a replicated database with audit logging would suffice. The answer is usually the latter, but admitting that doesn’t generate as much funding.
Scenario: You’re Building a Geo-Distributed System
Use case: A system spanning multiple geographic regions, where users in each region expect low-latency access.
Recommendation: It depends on your consistency requirements, and this is where you need to be brutally honest with yourself.
Here’s the latency reality:
| Source → Destination | Round-trip Latency |
|---|---|
| Same data center | 0.1-0.5 ms |
| Same region, different AZ | 1-3 ms |
| US-East ↔ US-West | 60-80 ms |
| US-East ↔ EU-West | 80-100 ms |
| US-East ↔ Asia-Pacific | 150-250 ms |
A consensus commit with two message delays at 200ms round-trip latency is 400ms minimum. That’s visible to users. Your options:
-
Accept the latency. Put the leader in the region with the most users, accept that other regions pay cross-region latency for writes. Use follower reads (with stale reads) for the read path. This is the simplest option and often the right one.
-
Use EPaxos or Flexible Paxos. EPaxos can commit locally for non-conflicting operations. Flexible Paxos can be configured with commit quorums biased toward specific regions. Both reduce latency for some operations at the cost of complexity.
-
Use per-region consensus with cross-region reconciliation. Run independent consensus groups in each region and reconcile asynchronously. This requires application-level conflict resolution and is essentially eventual consistency with extra steps.
-
Use CRDTs or eventual consistency. If your data model permits it (counters, sets, LWW registers), skip consensus for the write path and use CRDTs. Convergence is guaranteed without coordination. This is the approach used by Riak, Redis CRDTs, and many mobile/edge systems.
-
Give up on geo-distribution. Run everything in one region. Use a CDN for read-heavy content. Accept that users far from the region see higher latency. This is what most systems actually do, and there’s no shame in it.
Scenario: You Need Exactly-Once Semantics
Use case: Each client operation must be applied exactly once, even in the presence of retries, leader changes, and network duplicates.
Recommendation: Build idempotency into your application layer, regardless of which consensus protocol you use.
No consensus protocol gives you exactly-once semantics out of the box. They give you at-most-once (if you don’t retry) or at-least-once (if you do retry). Exactly-once requires the application to deduplicate, typically by assigning unique IDs to operations and tracking which IDs have been applied.
Kafka’s exactly-once semantics (introduced in KIP-98, refined over several releases) is the most mature implementation of this in the consensus-adjacent space. It works through idempotent producers (per-partition sequence numbers) and transactional writes (two-phase commit across partitions). It took the Kafka team years to get right, which should calibrate your expectations for implementing it yourself.
The “Just Use etcd/ZooKeeper” Advice
You’ll hear this advice a lot: “just use etcd” or “just use ZooKeeper.” It’s often good advice. Let’s examine when it is and when it isn’t.
When It’s Good Advice
- You need a coordination primitive (locks, leader election, configuration) and your write volume is low (< 1000 writes/sec)
- You’re already running etcd (e.g., for Kubernetes) and adding another use case is operationally simple
- Your data fits in memory (etcd keeps all data in RAM; ZooKeeper has a configurable but typically in-memory data model)
- You need strong consistency and don’t want to think about consensus protocol details
When It’s Lazy Advice
- Your data is large (GB+). etcd and ZooKeeper are not databases — they’re coordination services. etcd has a default 2MB value size limit and recommends keeping the total data store under 8GB.
- Your write volume is high. etcd’s write throughput is typically 10,000-20,000 operations per second, which is a hard ceiling imposed by Raft’s single-leader design and boltdb’s write characteristics.
- You need per-key watches at massive scale. ZooKeeper’s watch mechanism doesn’t scale well to millions of watches, and etcd’s watch revision model has its own scaling constraints.
- You need complex queries. These are key-value stores with range scans, not SQL databases.
- You need different consistency levels for different operations. etcd and ZooKeeper give you strong consistency for everything, which is great for correctness and wasteful for operations that don’t need it.
The Hidden Costs
Even when “just use etcd” is the right answer, the hidden costs are worth acknowledging:
Operational overhead. Running a consensus-based system requires monitoring (leader elections, follower lag, disk usage, snapshot frequency), capacity planning (etcd performance degrades as data size grows), and upgrade procedures (rolling upgrades of a Raft cluster require care).
Client complexity. Using etcd or ZooKeeper correctly requires understanding sessions, leases, watches, and their failure modes. A ZooKeeper client that doesn’t handle session expiry correctly can hold a lock it no longer owns. An etcd client that doesn’t handle lease revocation can operate on stale data.
Blast radius. If your coordination service goes down, everything that depends on it goes down. etcd is a single point of failure for Kubernetes, and a struggling etcd cluster can take down an entire Kubernetes deployment. This is not hypothetical — it happens with distressing regularity.
When to Give Up on Consensus
Sometimes the right answer is: don’t use consensus at all.
Give Up and Use Eventual Consistency When:
-
Your data model supports it. If your operations are commutative (counters), idempotent (sets), or have a natural conflict resolution (last-writer-wins with timestamps), you don’t need consensus. CRDTs formalize this — they guarantee convergence without coordination.
-
Availability matters more than consistency. Consensus requires a majority quorum. If a network partition isolates a minority, that minority is unavailable. If your system must remain available during arbitrary partitions (e.g., mobile apps, edge devices, multi-region systems where “just wait for the partition to heal” isn’t acceptable), eventual consistency is the only option. This is the CAP theorem, and no amount of clever protocol design changes it.
-
Latency is more important than ordering. If serving a stale read in 1ms is better than serving a fresh read in 100ms, you don’t need consensus for reads. Many systems use consensus for writes but serve reads from local replicas without coordination.
-
Your operations are naturally partitioned and independent. If user A’s data never interacts with user B’s data, you can use per-user consensus (or per-user single-node ownership) without global consensus. This is the fundamental insight behind sharding, and it’s more widely applicable than people think.
Give Up and Use a Single Node When:
I’m serious about this one. The most underrated distributed systems strategy is “don’t distribute.”
A single modern server with:
- 128 cores
- 1TB RAM
- NVMe storage doing 1M+ IOPS
- 100Gbps networking
…can handle more load than most systems will ever see. A well-optimized single-node database can serve 100,000+ write transactions per second. If your system’s total throughput requirement is below this (and most are), a single node with good backups provides:
- Perfect consistency — no consensus needed, one node is the source of truth
- Lowest possible latency — no network hops for commits
- Simplest possible operations — no quorum management, no leader election, no split-brain
- Recovery via restore from backup — your RTO is “how fast can you restore a snapshot and replay a WAL,” which for most systems is minutes, not hours
The downsides are real: no automatic failover (you need a human or a script to promote a standby), and if the node truly dies (disk failure, motherboard failure), you lose data since the last backup. But for many systems, “5 minutes of downtime during failover + potential loss of a few seconds of data” is an acceptable tradeoff for eliminating all consensus complexity.
The psychological barrier to this approach is that it feels insufficiently distributed. We’ve been trained to believe that single points of failure are always unacceptable. But a single point of failure with 99.99% uptime (52 minutes of downtime per year) and a clear recovery procedure is often better than a distributed system with 99.9% uptime (8.7 hours of downtime per year, distributed across a dozen confusing partial-failure scenarios that your on-call engineer has never seen before).
Give Up and Use a Managed Service
The most underrated option of all: let someone else suffer.
| Need | Managed Service | What You Get |
|---|---|---|
| Metadata/coordination | Amazon DynamoDB (strong consistency mode), Google Cloud Spanner, Azure Cosmos DB | Consensus-backed storage without managing consensus |
| Message queue | Amazon Kinesis, Google Pub/Sub, Azure Event Hubs, Confluent Cloud (managed Kafka) | Kafka-like semantics without Kafka operations |
| Distributed lock | Amazon DynamoDB (conditional writes), Google Cloud Spanner (transactions) | Lock semantics via transactional storage |
| Replicated database | Google Cloud Spanner, CockroachDB Cloud, Amazon Aurora | Consensus-backed SQL without the agony |
| Blockchain | Hyperledger on managed services, various BaaS offerings | BFT consensus without the BFT operations |
The cost of a managed service is money and vendor lock-in. The cost of running your own consensus system is engineering time, on-call burden, and career years lost to debugging leader elections at 3 AM. For most organizations, the managed service is cheaper.
This isn’t a cop-out — it’s a recognition that implementing and operating consensus-based systems is genuinely hard, and that unless consensus is your core business, your engineering effort is better spent on the things that differentiate your product.
Common Mistakes in Protocol Selection
Before we close with the final framework, let’s catalog the mistakes I’ve seen teams make when choosing a consensus protocol. Each of these has cost real engineering time and real production incidents.
Mistake 1: Choosing Based on Microbenchmarks
“EPaxos is 2x faster than Raft in this benchmark, so we should use EPaxos.” The benchmark used 0-byte payloads, a conflict-free workload, and three nodes in the same rack. Your workload has 4KB payloads, 15% key contention, five nodes across two data centers, and a state machine that takes 2ms to apply a command. The protocol isn’t your bottleneck, and the benchmark doesn’t represent your workload.
Mistake 2: Over-Engineering the Fault Model
“We need Byzantine fault tolerance because what if a node gets hacked?” If all your nodes run the same software, managed by the same team, in the same cloud account, a compromised node has bigger implications than consensus failure. You probably need better infrastructure security, not BFT. The 3x overhead of BFT (in both nodes and messages) is a steep price for a threat model that doesn’t match your actual risks.
Mistake 3: Under-Engineering the Fault Model
The opposite mistake: “crash faults are fine because our nodes never behave byzantinely.” But then you run on cloud VMs with ephemeral storage, and a VM migration causes data corruption that your CFT protocol interprets as valid. Or you run on machines with faulty ECC memory, and bit-flips cause state divergence between replicas. The crash-fault model assumes that non-crashed nodes are correct. If your environment doesn’t guarantee this, you need additional safeguards (checksums, validation, periodic consistency checks) even with a CFT protocol.
Mistake 4: Ignoring Operational Complexity
“We’ll implement Multi-Paxos because it has better theoretical properties than Raft.” Have you considered who will operate it? Who will debug it at 3 AM? Who will explain to the next team member how it works? The team that can’t explain their consensus protocol to a new hire has a protocol they can’t safely operate.
Mistake 5: Premature Distribution
“We need consensus because we need high availability.” Do you? What’s your actual uptime requirement? If 99.9% is sufficient (8.7 hours of downtime per year), a single node with automated failover to a standby might achieve that. Consensus gives you sub-second failover, which is necessary for 99.99%+ uptime but overkill for 99.9%.
Mistake 6: Forgetting About Reads
Teams spend months optimizing the write path (consensus protocol, replication, durability) and then serve all reads from the leader. If your workload is 95% reads, the read path is 20x more important than the write path. Linearizable reads from the leader, read leases from followers, stale reads from any replica — these are the decisions that determine your system’s practical performance, and they’re largely orthogonal to the choice of consensus protocol.
Mistake 7: Assuming the Paper Is Complete
Every consensus protocol paper omits details that are essential for production. Snapshotting, state transfer, log compaction, membership changes, client interaction, exactly-once semantics — some papers address some of these, but no paper addresses all of them. Budget 2-3x the implementation effort you estimate from reading the paper. If you’re new to consensus implementation, budget 5x.
The Final Framework
If you’ve made it this far and still aren’t sure, here’s the simplest possible decision framework:
- Can you avoid distribution entirely? Use a single node with backups.
- Can you use a managed service? Do that.
- Can you use an existing system (etcd, ZooKeeper, Kafka)? Do that.
- Must you implement consensus yourself? Use Raft. Not because it’s the best protocol for every situation, but because it has the most documentation, the most reference implementations, the most battle-tested libraries, and the most community knowledge about what goes wrong. The theoretical advantages of other protocols rarely outweigh the practical advantages of Raft’s ecosystem.
- Does Raft not work for your specific requirements? Now, and only now, should you consider alternatives:
- Need BFT → Tendermint or HotStuff
- Need leaderless → EPaxos
- Need tunable quorums → Flexible Paxos
- Need high-throughput append-only log → Kafka ISR model
The fact that step 5 exists for a handful of scenarios doesn’t change the reality that most teams should stop at steps 1 through 4. The agony of consensus is real, but it’s optional for most of us. The trick is knowing whether you’re one of the few who genuinely need to experience it firsthand — or whether you can learn from those who already have and use their implementations instead.
A Decision Checklist
For the engineer who needs to make a decision by the end of the week, here’s the checklist version:
- Have I confirmed that I actually need distributed consensus? (Not caching, not pub/sub, not eventual consistency — actual consensus?)
- Have I checked whether a managed service solves my problem?
- Have I checked whether an existing open-source system (etcd, ZooKeeper, Kafka) solves my problem?
- Have I documented my failure model (crash vs. Byzantine)?
- Have I measured my expected write throughput and confirmed it exceeds what a single node can handle?
- Have I measured my latency requirements and confirmed they’re compatible with consensus round-trips?
- Have I considered how I’ll handle the operations burden (monitoring, upgrades, debugging)?
- Have I identified who on my team can debug consensus issues in production?
- Have I allocated time for testing with a tool like Jepsen or at least a chaos testing framework?
- Have I accepted that this will be harder than I think?
If you can check all of these boxes, you’re better prepared than 90% of teams that embark on building consensus-based systems. If you can’t check the last one, go back and check it. It will be harder than you think. It always is.