Failure Models and Assumptions
Every distributed systems paper begins with a model. This is not academic throat-clearing — the model is the contract between you and reality. Get it wrong, and your beautifully proven protocol provides guarantees about a universe that does not exist. Get it right, and you have a fighting chance.
The model specifies two things: what kinds of failures can occur, and what timing assumptions the system provides. These two axes — failure model and synchrony model — form a grid, and your position on that grid determines which consensus protocols are available to you, how many nodes you need, and how much performance you sacrifice.
Most engineers pick a model implicitly, by choosing a consensus library and hoping the library’s authors made reasonable assumptions. This chapter is about making those assumptions explicit, because the assumptions will bite you eventually, and it is better to know where the teeth are.
The Failure Model Spectrum
Failures range from polite to adversarial. The more adversarial the failure model, the more expensive your protocol.
Crash-Stop Failures
A crash-stop process executes correctly until it fails, at which point it stops forever. It does not send corrupted messages. It does not come back to life with amnesia. It simply ceases to exist.
// A crash-stop process
procedure CRASH_STOP_NODE():
while true:
msg = receive()
response = process(msg) // always correct
send(response) // always correct
// At some point, this loop simply stops executing.
// No partial messages. No corruption. Just silence.
This is the gentlest failure model, and it is the one assumed by the original Paxos paper and by Raft. It is also the least realistic. Real nodes crash and come back. Disks persist data across restarts. Networks drop some messages but not others.
The appeal of crash-stop is that it is simple to reason about. A node is either correct (executing the protocol faithfully) or dead (not participating at all). There is no middle ground, no weird states, no partial failures. Protocols designed for crash-stop failures are elegant. They are also insufficient for real systems, which is why nobody actually deploys them — they deploy crash-recovery protocols instead.
Crash-Recovery Failures
A crash-recovery process can crash and later resume execution. The critical question is: what does it remember?
If the process has stable storage (disk) that survives crashes, it can recover its state and rejoin the protocol. If not, it is equivalent to a new process that knows nothing — potentially dangerous if other nodes think it still holds promises from before the crash.
// A crash-recovery process with stable storage
procedure CRASH_RECOVERY_NODE():
state = recover_from_disk() // may be empty on first boot
while true:
msg = receive()
response, new_state = process(msg, state)
write_to_disk(new_state) // MUST be durable before sending
fsync() // yes, really
state = new_state
send(response)
// May crash at any point. After crash:
// - state on disk is consistent (due to fsync before send)
// - any message sent was backed by durable state
// - any unsent message is safe to re-derive from disk state
The fsync() before send() ordering is critical, and getting it wrong is one of the most common bugs in consensus implementations. If you send a response before the state is durable, a crash loses the state but the recipient already acted on the response. The protocol’s invariants are violated.
This is the model that practical consensus implementations use. Raft’s persistent state — currentTerm, votedFor, and the log — must be durable before any messages are sent. The Raft paper says this clearly. Many implementations get it wrong anyway, either by batching fsync calls (correct but adds latency) or by skipping them entirely (fast but will corrupt data on power loss).
The fsync tax. A single fsync to a spinning disk takes 5-10ms. To an SSD, 0.1-1ms. To an NVMe drive, 10-100 microseconds. This is the floor on your consensus latency. Every consensus decision requires at least one fsync on the leader and one on each follower in the majority. Many production systems “optimize” by disabling fsync and relying on battery-backed write caches or UPS systems. This works until it does not.
Omission Failures
An omission failure occurs when a process fails to send or receive a message it should have. The process itself continues executing correctly — it just misses some messages.
// A process with omission failures
procedure OMISSION_NODE():
while true:
msg = receive() // might silently drop the message
if msg != NULL:
response = process(msg) // always correct if msg received
send(response) // might silently fail to deliver
Omission failures model real network behavior: packets get dropped, TCP connections silently break, firewalls eat messages without sending RSTs. A process experiencing omission failures is harder to detect than a crashed process because it is still alive and partially functional.
Send-omission and receive-omission are sometimes distinguished. A process that can neither send nor receive is equivalent to a crashed process. A process that can send but not receive (or vice versa) creates interesting asymmetries: it may act on stale information while still appearing alive to its peers.
In practice, omission failures are often handled by treating them as crash failures with extra steps. If a node misses enough heartbeats (because it is not receiving them, or its responses are not being delivered), the rest of the cluster treats it as crashed. This is conservative but safe.
Byzantine Failures
A Byzantine process can do anything: send contradictory messages to different nodes, lie about its state, selectively delay messages, or behave correctly for years before turning malicious at the worst possible moment.
// A Byzantine process (from the protocol's perspective)
procedure BYZANTINE_NODE():
while true:
msg = receive()
// May do any of the following:
// - Process correctly and respond honestly
// - Respond with fabricated data
// - Send different responses to different nodes
// - Respond to some nodes and not others
// - Respond after an arbitrary delay
// - Coordinate with other Byzantine nodes
// - Do nothing
response = ADVERSARY_CHOOSES()
send(response)
Byzantine fault tolerance is the strongest (and most expensive) failure model. We will cover it in depth in Chapter 4. For now, the key fact is: tolerating f Byzantine faults requires 3f+1 nodes, compared to 2f+1 for crash faults. That extra f nodes is the price of not trusting anyone.
Most production distributed systems do not use Byzantine fault tolerance, and for good reason. Your servers are not actively trying to sabotage each other. (If they are, you have bigger problems than consensus.) Byzantine tolerance is primarily relevant in blockchain systems, multi-tenant systems with untrusted participants, and safety-critical systems where hardware bit-flips could corrupt behavior.
The Failure Model Comparison
| Property | Crash-Stop | Crash-Recovery | Omission | Byzantine |
|---|---|---|---|---|
| Nodes for f faults | 2f+1 | 2f+1 | 2f+1 | 3f+1 |
| Can node recover? | No | Yes (with stable storage) | N/A (still running) | N/A |
| Message integrity | Guaranteed | Guaranteed | Guaranteed | Not guaranteed |
| Node behavior | Correct or silent | Correct, silent, or recovering | Correct but lossy | Arbitrary |
| Real-world match | Power loss, kill -9 | Server reboot, process restart | Network issues, GC pauses | Bugs, hackers, bit-flips |
| Protocol complexity | Low | Medium | Medium | High |
| Performance overhead | Low | Medium (fsync costs) | Low | High (extra rounds, signatures) |
| Common protocols | Basic Paxos, Raft (simplified) | Multi-Paxos, Raft (production) | Paxos with retransmission | PBFT, HotStuff, Tendermint |
The Synchrony Spectrum
Orthogonal to the failure model is the timing model: what assumptions does the protocol make about how long things take?
Synchronous Model
In a synchronous system, there exist known upper bounds:
- Message delivery: Every message is delivered within delta time units.
- Processing time: Every computation step completes within phi time units.
- Clock drift: All clocks advance at the same rate (or within a known bound of each other).
These bounds are known to all participants and are never violated.
// In a synchronous system, this is safe:
procedure SYNC_FAILURE_DETECTION(node_j):
send PING to node_j
start timer(2 * delta + phi) // round-trip + processing
if receive PONG from node_j before timer expires:
node_j is alive
else:
node_j has DEFINITELY crashed // this conclusion is sound
The power of synchrony is that you can make definitive conclusions from the absence of messages. If a response does not arrive within the known bound, the sender has failed. No ambiguity.
Synchronous consensus is relatively straightforward. The classic algorithm by Dolev and Strong achieves Byzantine consensus in f+1 synchronous rounds using digital signatures. Without signatures, the lower bound is f+1 rounds for crash failures (proved by Fischer and Lynch, 1982) and 2f+1 rounds for Byzantine failures.
The problem: real systems are not synchronous. Your “known upper bound” is a fiction. GC pauses, page faults, noisy neighbors on shared hardware, congested network links — any of these can blow past your assumed bounds. And when they do, a synchronous protocol makes incorrect conclusions (“that node is dead”) that can violate safety.
The Jepsen testing project has found numerous real-world bugs caused by exactly this: systems that assume synchrony and break when the assumption is violated. Clock skew, in particular, is a perennial source of bugs in protocols that use timestamps for ordering.
Asynchronous Model
An asynchronous system makes no timing assumptions. Messages are eventually delivered, but there is no bound on delivery time. Processes take steps, but there is no bound on the time between steps.
// In an asynchronous system, this is the best you can do:
procedure ASYNC_FAILURE_DETECTION(node_j):
send PING to node_j
// wait... how long?
// If no response after T seconds:
// - Maybe node_j crashed
// - Maybe the network is slow
// - Maybe node_j is in a GC pause
// - Maybe our PING was delayed
// - We genuinely cannot tell
return MAYBE_FAILED // the best we can offer
The asynchronous model is brutally honest about what you can determine. You can never conclude that a node has failed — only that you have not heard from it recently. This makes failure detection fundamentally unreliable in asynchronous systems.
And, as the FLP impossibility result shows (Chapter 3), deterministic consensus is impossible in a purely asynchronous system, even with just one crash failure. Not “hard” — impossible.
Partially Synchronous Model
The partially synchronous model, introduced by Dwork, Lynch, and Stockmeyer in 1988, provides the practical middle ground. It comes in two flavors:
Flavor 1: Unknown bound. There exists a fixed upper bound delta on message delivery, but this bound is not known to the participants. The protocol must be correct regardless of delta’s value, but may use it implicitly (e.g., by assuming that eventually messages arrive before the timeout fires).
Flavor 2: Global Stabilization Time (GST). The system is asynchronous until some unknown time GST, after which it becomes synchronous with known bounds. The protocol must be safe at all times (even before GST) but is only required to make progress (liveness) after GST.
// Partially synchronous failure detection
procedure PARTIAL_SYNC_FAILURE_DETECTION(node_j):
timeout = INITIAL_TIMEOUT
while true:
send PING to node_j
start timer(timeout)
if receive PONG from node_j before timer:
node_j is SUSPECTED_ALIVE
timeout = max(timeout / 2, MIN_TIMEOUT) // decrease timeout
else:
node_j is SUSPECTED_FAILED // not certain!
timeout = timeout * 2 // back off: maybe we were too aggressive
// Continue pinging — it might come back
The key insight is that the protocol never makes safety-critical decisions based on timeout conclusions. Timeouts are used for liveness only — to trigger leader elections, retransmissions, and view changes. Safety is ensured by quorum-based voting that does not depend on timing.
This is the model used by Paxos, Raft, PBFT, and virtually every practical consensus protocol. They guarantee:
- Safety always. Even during arbitrary asynchrony (before GST), no two nodes decide different values.
- Liveness after GST. Once the network stabilizes, the protocol will eventually make progress.
In practice, “after GST” means “when the network is behaving reasonably.” During a network partition or severe congestion, the system stalls but does not produce incorrect results. When the network heals, it resumes. This is exactly the behavior you want from a distributed database or coordination service.
The Synchrony Comparison
| Property | Synchronous | Partially Synchronous | Asynchronous |
|---|---|---|---|
| Timing bounds | Known and fixed | Unknown or eventual | None |
| Failure detection | Perfect | Eventually accurate | Impossible |
| Deterministic consensus | Possible | Possible (after GST) | Impossible (FLP) |
| Safety guarantee | Depends on bound correctness | Always | Always |
| Liveness guarantee | Always (if bounds hold) | After GST | Not guaranteed |
| Real-world match | LAN (sometimes) | Most networks (usually) | Adversarial networks |
| Round complexity (crash) | f+1 rounds | Unbounded (but finite) | N/A |
Network Partitions
A network partition divides the nodes into two or more groups that can communicate within their group but not across groups. Partitions are the most common real-world failure that consensus must handle, and they are frequently misunderstood.
What Partitions Look Like
Before partition:
A <---> B <---> C <---> D <---> E
All nodes can reach all other nodes.
During partition:
A <---> B <---> C D <---> E
{A, B, C} can communicate. {D, E} can communicate.
No messages cross the partition boundary.
Asymmetric partition:
A ----> B (A can send to B)
A <-/-- B (B cannot send to A)
This happens more often than you'd think.
Asymmetric partitions are particularly nasty. Node B receives messages from A and believes A is alive. Node A never hears from B and suspects B is dead. If A is the leader, it might step down (believing it has lost its majority), while B still thinks A is leading. The system can oscillate between states as the asymmetry plays out.
How Consensus Handles Partitions
The majority quorum mechanism is the primary defense against partitions:
procedure PARTITION_SAFE_DECISION(value):
// To commit a value, need acks from a majority
acks = {self}
broadcast (PROPOSE, value) to all nodes
while |acks| <= N / 2:
msg = receive_with_timeout()
if msg is ACK:
acks = acks + {msg.sender}
if timeout:
// Cannot reach majority — stall, do not decide
retry or wait
// If we reach here, we have a majority
commit(value)
During a partition, at most one side has a majority. The minority side stalls. This is the correct behavior: it is better to be unavailable than inconsistent.
But partitions interact with leader election in subtle ways:
-
Leader is on the majority side: system continues operating. The minority side cannot elect a new leader (no majority). Everything is fine.
-
Leader is on the minority side: the minority side stalls (leader cannot commit, no majority for acks). The majority side eventually times out and elects a new leader. This works, but there is a window where the old leader might still be trying to commit entries that will never succeed. Raft handles this with term numbers — the old leader’s proposals are from an older term and will be rejected by nodes that have moved on.
-
Leader is on the partition boundary (asymmetric): this is where things get ugly. The leader might be able to send to some nodes but not receive from them, or vice versa. Different nodes have different views of whether the leader is alive. Multiple leader elections can trigger in rapid succession, leading to livelock if timeouts are not carefully randomized.
Message Delays, Reordering, and Duplication
Real networks mangle messages in every conceivable way.
Message Delays
TCP provides reliable, ordered delivery within a connection. But TCP connections break and are re-established. And at the application level, a node might batch messages, introduce processing delays, or buffer responses. The end-to-end delay between “application sends” and “application receives” is variable and unbounded.
Consensus protocols must be correct under arbitrary delays. This means:
- A message from round 1 might arrive after messages from round 5.
- A response to a proposal might arrive after the proposer has moved on to a new proposal.
- An “I voted for you” message might arrive after a different leader has already been elected.
Every message in a consensus protocol must carry enough context (round number, term, ballot, epoch — different protocols use different names) for the recipient to determine whether the message is still relevant.
Message Reordering
Even though TCP preserves order within a connection, consensus protocols use multiple connections (one to each peer). Messages sent to different peers travel different network paths and arrive in different orders. And if a node uses UDP or a custom transport, all ordering bets are off.
// This is unsafe:
procedure UNSAFE_LEADER():
send (PREPARE, ballot=5) to all followers
send (ACCEPT, ballot=5, value="X") to all followers
// A follower might receive ACCEPT before PREPARE!
// This is safe:
procedure SAFE_LEADER():
send (PREPARE, ballot=5) to all followers
wait for PROMISE responses from majority
// Only then:
send (ACCEPT, ballot=5, value="X") to all followers
The SAFE_LEADER version works not because messages cannot be reordered, but because the ACCEPT is only sent after receiving PROMISE responses, which creates a causal ordering. The follower must have processed the PREPARE (and sent PROMISE) before the leader sends ACCEPT, so even with reordering, the follower has already processed PREPARE by the time it needs to process ACCEPT.
This is a general pattern in consensus protocols: causal ordering is established by waiting for responses, not by assuming network ordering.
Message Duplication
Networks can duplicate messages. TCP usually prevents this, but at-least-once delivery semantics at the application level (retries after timeout) can cause duplicates. Consensus protocols must be idempotent: processing the same message twice must not change the outcome.
// Idempotent vote handling
procedure HANDLE_VOTE_REQUEST(ballot, candidate):
if ballot < current_ballot:
ignore // stale message, possibly duplicate
else if ballot == current_ballot and voted_for == candidate:
send VOTE_GRANTED // duplicate request, safe to re-ack
else if ballot == current_ballot and voted_for != candidate:
send VOTE_DENIED // already voted for someone else
else: // ballot > current_ballot
current_ballot = ballot
voted_for = candidate
persist(current_ballot, voted_for)
send VOTE_GRANTED
Which Model Matches Your System?
This is the practical question, and the answer is almost always: crash-recovery failures in a partially synchronous network.
Here is why:
Your servers crash and restart. They do not crash and stay dead forever (crash-stop). They have disks. They have persistent state. When they come back, they need to rejoin the protocol with their previous promises intact. This is crash-recovery.
Your servers are not malicious. They are running your code, in your datacenter, on your hardware. They might have bugs (which can look Byzantine), but the threat model of arbitrary adversarial behavior does not apply. Byzantine tolerance costs you an extra f nodes and significant performance overhead. Unless you are building a blockchain or a system with mutually untrusting participants, you do not need it.
Your network is eventually reliable. It drops packets sometimes. It has variable latency. Occasionally, a switch dies and creates a partition. But eventually, the network heals. This is partial synchrony. You do not have guaranteed bounds (synchronous), but you are not in a perpetual adversarial network (asynchronous).
There are exceptions:
- Safety-critical systems (aircraft, medical devices, nuclear reactors) might use synchronous models with validated timing bounds on dedicated hardware.
- Blockchain and cryptocurrency systems use Byzantine models because participants are untrusted.
- Systems crossing trust boundaries (multi-cloud, federated systems) might need Byzantine tolerance for the cross-boundary communication, even if each individual cluster uses crash-recovery.
How Failure Models Affect Protocol Behavior
Let us trace through the same simple protocol under different failure models to see how the model changes everything.
The Protocol: Simple Majority Vote
// A leader tries to commit a value
procedure COMMIT(value):
send (PROPOSE, value) to all followers
acks = {self}
while |acks| <= N/2:
response = receive()
if response == ACK:
acks = acks + {response.sender}
broadcast (COMMITTED, value) to all
Under Crash-Stop
Node C crashes before receiving the PROPOSE. The leader never receives C’s ACK, but still reaches majority from {A, B, D} (in a 5-node cluster). Node C never recovers. The remaining four nodes have a consistent view. Simple, clean.
Leader A: PROPOSE("X") --> B, C, D, E
Node C: ** CRASH ** (never receives)
Leader A: receives ACK from B, D, E (majority = 3 of 5)
Leader A: COMMITTED("X") --> B, D, E
Result: A, B, D, E agree on "X". C is gone forever.
No complications. C’s crash is permanent and clean. The system shrinks by one node.
Under Crash-Recovery
Node C crashes, but comes back 30 seconds later. Now what?
Leader A: PROPOSE("X") --> B, C, D, E
Node C: ** CRASH ** (after receiving PROPOSE but before sending ACK)
Leader A: receives ACK from B, D, E (majority). Commits.
Leader A: COMMITTED("X") --> B, D, E
Node C: ** RECOVERS **
Node C: reads disk... did I ack anything? What is the current state?
Node C must recover its state from disk. If it wrote the PROPOSE to disk before crashing, it knows about “X” and can catch up. If not, it is missing a committed entry and needs to be brought up to date. The protocol must have a mechanism for C to learn about committed values it missed.
In Raft, this is handled by the leader sending missing log entries to recovered followers. In Paxos, it is handled by the recovered node running the Paxos protocol for each missing slot (or by state transfer from another node).
The fsync ordering matters here. If C had acknowledged the PROPOSE before fsyncing and then crashed, the leader counted C’s ack toward the majority. On recovery, C has no record of its acknowledgment. If the leader’s majority depended on C’s ack (e.g., C was the deciding vote), and the leader also crashes, a new leader might not find a majority that accepted “X” and could choose a different value. Safety violated.
Under Omission Failures
Node C is alive but its network interface is dropping packets.
Leader A: PROPOSE("X") --> B, C, D, E
Node C: receives PROPOSE, sends ACK (but ACK is dropped by network)
Leader A: receives ACK from B, D, E (majority). Commits.
Leader A: COMMITTED("X") --> B, C, D, E
Node C: might or might not receive COMMITTED
Node C: believes it acked but never sees commit confirmation
C is in an awkward state: it accepted the proposal but does not know if it was committed. It cannot unilaterally decide “X” (maybe the leader chose a different value after another round). It must wait for more information, or proactively ask the leader for the current state.
Under omission failures, protocols need more aggressive retransmission and state synchronization. A node that suspects it is experiencing omission failures (because it is not receiving expected messages) should request retransmission from multiple peers, not just the leader.
Under Byzantine Failures
Node C is compromised and actively trying to break the protocol.
Leader A: PROPOSE("X") --> B, C, D, E
Node C: receives PROPOSE("X")
Node C: sends ACK("X") to Leader A (plays along)
Node C: sends PROPOSE("Y") to Nodes D and E (pretending to be leader)
Node C: sends NACK to Node B (trying to slow down the real commit)
In a crash-failure protocol, this causes chaos. Nodes D and E might accept “Y” from the fake leader. The protocol has no defense against forged messages.
A Byzantine fault tolerant protocol handles this through:
- Digital signatures: C cannot forge messages from A because it does not have A’s private key.
- Quorum intersection: The protocol requires 2f+1 out of 3f+1 nodes to agree. Even if f nodes are Byzantine, the remaining 2f+1 honest nodes’ quorums overlap sufficiently.
- View change protocol: If the leader is Byzantine, honest nodes can detect misbehavior and trigger a leader change.
The cost: more nodes, more messages per round, cryptographic operations on every message, and significantly more complex protocol logic.
The “Eventually Synchronous” Sweet Spot
Let us be concrete about why partial synchrony is the right model for most systems.
Consider a 5-node Raft cluster running across three availability zones in a cloud provider. Under normal conditions:
- Intra-zone message latency: 0.1-1ms
- Cross-zone message latency: 1-5ms
- fsync latency: 0.1-1ms (SSD)
- Leader heartbeat interval: 150ms
- Election timeout: 300-500ms (randomized)
This system is effectively synchronous 99.9% of the time. Messages arrive in under 5ms. Heartbeats arrive well before the election timeout. Consensus decisions complete in under 10ms.
But 0.1% of the time:
- A GC pause stalls a node for 200ms, causing it to miss heartbeats.
- A cross-zone link drops packets for 2 seconds during a routing reconvergence.
- An NVMe drive stalls for 500ms due to wear leveling.
During these events, the system triggers unnecessary leader elections, commits stall, and latency spikes. But — and this is the crucial part — no incorrect decisions are made. The system’s safety relies on quorum voting, not on timing. The timing assumptions only affect liveness: the system might stall during the disruption, but it never produces inconsistent results.
When the disruption ends (the network stabilizes, the GC pause completes, the drive finishes wear leveling), the system resumes normal operation automatically. No manual intervention, no data reconciliation, no split-brain.
This is partial synchrony in action, and it is why every serious consensus implementation targets this model.
Common Misconceptions
“Crash failures are enough because our servers are reliable.” Your servers are reliable 99.99% of the time. The consensus protocol exists for the 0.01%. And “server” is not the only failure domain — NICs, switches, power supplies, cables, hypervisors, kernels, and your own application code all fail in ways that consensus must handle.
“We don’t need to worry about Byzantine failures because our nodes are trusted.” Mostly true, but bugs can cause Byzantine-like behavior. A bug that causes a node to send different values to different peers is, from the protocol’s perspective, a Byzantine failure. If you have ever seen a bug where a serialization library produces different output on different platforms, you have seen a non-malicious Byzantine failure.
“Our network is synchronous because we use TCP.” TCP provides reliable delivery, not bounded-time delivery. A TCP connection can be alive but stalled for arbitrary periods due to congestion control, buffer bloat, or retransmission backoff. And TCP connections break, requiring re-establishment, during which messages are delayed.
“Partial synchrony means the network is usually good.” Not quite. Partial synchrony means there exists a time after which the network behaves synchronously. It says nothing about how long the asynchronous period lasts or how long the synchronous period lasts. The guarantee is existential, not probabilistic.
“We can detect failures with heartbeats.” You can detect suspected failures with heartbeats. In an asynchronous system, you cannot distinguish a crashed node from a slow one. Your heartbeat timeout is a guess — make it too short and you get false positives (unnecessary elections, wasted work); make it too long and you get slow failure detection (long unavailability windows).
Practical Recommendations
-
Assume crash-recovery with persistent state. Your protocol must handle nodes that crash and rejoin. Every promise, vote, and log entry must be durable before being acted upon.
-
Assume partial synchrony. Never make safety depend on timing. Use timeouts for liveness only.
-
Design for network partitions. Assume your network will partition. Test under partition conditions. Jepsen-style testing is not optional — it is how you find the bugs that only manifest under failure.
-
Use Byzantine tolerance only when you need it. The overhead is real: 3f+1 instead of 2f+1 nodes, more message rounds, cryptographic overhead, and dramatically more complex code. Most internal systems do not need it.
-
Validate your assumptions. If your protocol assumes fsync is durable, test it. (It is not always durable — some filesystems lie, some disks have buggy firmware, and virtual machines add another layer of uncertainty.) If your protocol assumes clocks are roughly synchronized, measure the actual skew. If your protocol assumes a maximum message size, enforce it.
-
Test the transitions. The steady state is easy. The transitions — node joins, node leaves, leader changes, partition forms, partition heals — are where bugs live. Every combination of “these nodes are up, these are down, this link works, that one does not” is a potential test case. You will not test all of them. Test the ones that have caused outages before.
The failure model is not a theoretical exercise. It is the foundation on which your system’s correctness rests. Choose it deliberately, validate it continuously, and be prepared for reality to exceed your model’s assumptions — because it will.