Flexible Paxos and Quorum Relaxation
The Insight That Was Hiding in Plain Sight
In 2016, Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman published a paper that made the distributed systems community collectively facepalm. The paper, “Flexible Paxos: Quorum Intersection Revisited,” presented an insight so simple, so obvious in retrospect, that you have to wonder how it escaped notice for nearly three decades.
Here it is: in Paxos, Phase 1 (Prepare) quorums and Phase 2 (Accept) quorums do not need to be the same size. They do not even need to be majorities. The only requirement is that every Phase 1 quorum intersects with every Phase 2 quorum.
That is it. That is the whole insight.
Classic Paxos uses majority quorums for both phases because majorities trivially intersect — any two majorities of the same set must share at least one member. But majority-majority is merely the most symmetric configuration that satisfies the intersection requirement. It is not the only one, and it is not always the best one.
The implications are significant. By adjusting the quorum sizes for each phase, you can trade Phase 1 availability for Phase 2 availability, or optimize for read-heavy versus write-heavy workloads, or reduce the number of replicas that need to participate in the common-case operation (Phase 2, since Phase 1 only runs during leader election).
How this remained unformalized for decades is a question that says something uncomfortable about how we read our own foundational papers.
The Mathematics of Quorum Intersection
Let us be precise. Consider a system with N replicas. Define:
- Q1 = the set of valid Phase 1 (Prepare) quorums
- Q2 = the set of valid Phase 2 (Accept) quorums
Classic Paxos requirement: |q1| > N/2 and |q2| > N/2 for all q1 in Q1 and q2 in Q2.
Flexible Paxos requirement: For all q1 in Q1 and for all q2 in Q2, q1 intersection q2 is non-empty.
That is: q1 ∩ q2 ≠ ∅ for every valid pair (q1, q2).
Why does this work? Recall what the quorum intersection actually provides in Paxos:
-
Phase 1 (Prepare): A proposer contacts a Phase 1 quorum to discover any previously accepted values. The proposer must learn about any value that might have been chosen in Phase 2.
-
Phase 2 (Accept): A proposer contacts a Phase 2 quorum to accept its proposed value. If a Phase 2 quorum accepts a value, that value is potentially chosen.
The intersection ensures that if a value v was accepted by a Phase 2 quorum q2, then any future Phase 1 quorum q1 will contain at least one member of q2 — and that member will report that it accepted v. This is what prevents conflicting values from being chosen.
The size of the quorums does not matter. What matters is intersection.
The Simple Quorum Arithmetic
For the simple case where all Phase 1 quorums have size Q1 and all Phase 2 quorums have size Q2, the intersection requirement reduces to:
Q1 + Q2 > N
This is because if Q1 + Q2 > N, then by the pigeonhole principle, any Q1-sized subset and any Q2-sized subset of N elements must overlap.
Classic Paxos sets Q1 = Q2 = floor(N/2) + 1, giving Q1 + Q2 = N + 2 > N (with room to spare). But consider the alternatives for N = 10:
| Configuration | Q1 (Phase 1) | Q2 (Phase 2) | Q1 + Q2 | Phase 1 fault tolerance | Phase 2 fault tolerance |
|---|---|---|---|---|---|
| Classic | 6 | 6 | 12 | 4 failures | 4 failures |
| Write-optimized | 9 | 2 | 11 | 1 failure | 8 failures |
| Read-optimized | 2 | 9 | 11 | 8 failures | 1 failure |
| Balanced-asymmetric | 7 | 4 | 11 | 3 failures | 6 failures |
| Extreme write | 10 | 1 | 11 | 0 failures | 9 failures |
The “Write-optimized” configuration is remarkable: Phase 2 (the common case during normal operation) requires only 2 out of 10 replicas. This means you can tolerate 8 replica failures during steady-state operation. The cost is that Phase 1 (leader election) requires 9 out of 10 replicas — you can tolerate at most 1 failure during a leader election. But leader elections are rare, so this may be an excellent trade.
The “Extreme write” row is degenerate but instructive: if Phase 1 requires all 10 replicas, Phase 2 requires only 1. This means the leader can commit decisions unilaterally — no replication needed during normal operation. Of course, any single failure makes leader election impossible, so this is only useful in very specific scenarios.
Why This Took Decades to Formalize
Lamport’s original Paxos paper uses majority quorums. The “Paxos Made Simple” paper uses majority quorums. Every textbook uses majority quorums. The majority requirement is stated so confidently and so consistently that it reads as fundamental rather than as a design choice.
But go back and read the proof carefully. The proof never actually requires majorities. It requires intersection. Majorities are presented as the way to achieve intersection, not as the requirement itself. The conflation of mechanism (majorities) with requirement (intersection) persisted for decades.
There are a few reasons this happened:
Symmetry is natural. When you are designing a protocol, having the same quorum size for both phases is the obvious default. It is symmetric, easy to reason about, and easy to explain. Asymmetric quorums feel like they need justification.
The proof sketches are brief. Lamport’s proofs are famously concise. The majority requirement is stated as a lemma, and the lemma is proved in a sentence or two. Nobody stopped to ask whether a weaker condition would suffice because the stronger condition was easy to verify and clearly correct.
Practice preceded theory. Systems like ZooKeeper, etcd, and Consul all use majority quorums because that is what the papers said to use. By the time someone thought to question the requirement, an entire ecosystem was built on the assumption.
It did not matter enough. For most practical deployments (3 or 5 nodes in a single datacenter), the distinction between majority quorums and flexible quorums is irrelevant. The optimization only matters at larger scales or in geo-distributed settings — exactly the scenarios that became more common in the 2010s.
Flexible Paxos: Modified Protocol
The modifications to standard Paxos are minimal. The protocol logic is identical; only the quorum sizes change.
// Configuration
Structure FlexiblePaxosConfig:
N: Integer // total number of replicas
Q1_size: Integer // Phase 1 quorum size
Q2_size: Integer // Phase 2 quorum size
// INVARIANT: Q1_size + Q2_size > N
Procedure Propose(config, value):
// Phase 1: Prepare — contact Q1_size replicas
ballot = NextBallot()
promises = {}
// Send Prepare to ALL replicas, wait for Q1_size responses
SendToAll(Prepare{ballot: ballot})
promises = WaitForResponses(count = config.Q1_size)
// Find highest-numbered accepted value, if any
highest_accepted = None
for each promise in promises:
if promise.accepted_ballot > highest_accepted.ballot:
highest_accepted = promise
if highest_accepted is not None:
value = highest_accepted.value // must propose this value
// Phase 2: Accept — contact Q2_size replicas
SendToAll(Accept{ballot: ballot, value: value})
accepts = WaitForResponses(count = config.Q2_size)
if |accepts| >= config.Q2_size:
// Value is chosen
SendToAll(Decide{value: value})
Procedure HandlePrepare(msg):
if msg.ballot > promised_ballot:
promised_ballot = msg.ballot
Reply(Promise{
ballot: msg.ballot,
accepted_ballot: accepted_ballot,
accepted_value: accepted_value
})
else:
Reply(Nack{ballot: promised_ballot})
Procedure HandleAccept(msg):
if msg.ballot >= promised_ballot:
promised_ballot = msg.ballot
accepted_ballot = msg.ballot
accepted_value = msg.value
Reply(Accepted{ballot: msg.ballot})
else:
Reply(Nack{ballot: promised_ballot})
The code is virtually identical to standard Paxos. The only difference is in the WaitForResponses calls: Phase 1 waits for Q1_size responses, and Phase 2 waits for Q2_size responses.
Multi-Paxos with Flexible Quorums
The real payoff comes when you apply Flexible Paxos to Multi-Paxos, where Phase 1 is run once during leader election and Phase 2 is run for every command.
Procedure MultiPaxosLeaderElection(config):
// Phase 1 for all future slots — requires Q1 quorum
ballot = NextBallot()
SendToAll(Prepare{ballot: ballot})
promises = WaitForResponses(count = config.Q1_size)
// Process promises for all slots that have accepted values
for each promise in promises:
for each (slot, accepted_ballot, accepted_value) in promise.accepted_slots:
UpdateSlot(slot, accepted_ballot, accepted_value)
// Now we are the leader — Phase 2 uses Q2 quorum
self.is_leader = true
self.leader_ballot = ballot
Procedure MultiPaxosReplicate(config, slot, command):
// Only Phase 2 — runs for every command
assert self.is_leader
SendToAll(Accept{
ballot: self.leader_ballot,
slot: slot,
value: command
})
accepts = WaitForResponses(count = config.Q2_size)
if |accepts| >= config.Q2_size:
CommitSlot(slot, command)
With a write-optimized configuration (say N=10, Q1=9, Q2=2), every command only needs one additional replica’s acknowledgment to commit. Leader election is painful (requires 9 out of 10), but it is rare. The steady-state performance is dramatically better.
Practical Applications
WAN-Optimized Deployments
Consider five datacenters: New York, London, Tokyo, Sydney, and Sao Paulo. With classic Paxos (majority = 3), every write must wait for acknowledgment from the leader’s datacenter plus two others. If the leader is in New York, a write must traverse at least to London and one other DC.
With Flexible Paxos (Q1 = 4, Q2 = 2), steady-state writes only need the leader plus one other replica. The leader in New York needs only London’s acknowledgment (the closest replica) for each write. Leader election requires four out of five replicas, but that is acceptable for a rare operation.
The latency improvement for the common case:
Classic (N=5, Q2=3):
Write latency = RTT to 2nd-closest replica from leader
Example: Leader in NYC, must reach London AND Tokyo
Latency ~= max(NYC->London, NYC->Tokyo) ~= 180ms
Flexible (N=5, Q1=4, Q2=2):
Write latency = RTT to closest replica from leader
Example: Leader in NYC, must reach London only
Latency ~= NYC->London ~= 75ms
Improvement: 2.4x for this configuration
Read-Heavy Workloads
For read-heavy workloads where reads must be linearizable (via Paxos-based reads or read leases), you might want the opposite trade: small Phase 1 quorums (for fast reads that require a prepare-like check) and large Phase 2 quorums.
Actually, let me be more careful here. The relationship between Flexible Paxos quorums and read optimizations depends on the specific read protocol. If reads go through the leader (which holds a lease), the Flexible Paxos configuration does not directly affect read performance. But if you implement reads as separate Paxos instances (Phase 1 to discover the latest committed value), then a small Q1 directly speeds up reads.
Tuning for Failure Probability
In practice, simultaneous failure of multiple replicas is rare. If your failure model says “at most 1 replica fails simultaneously with high probability,” you can set Q1 = N and Q2 = 1 (or close to it). The system operates with minimal replication overhead in the common case and relies on the assumption that catastrophic multi-failure scenarios are rare enough to accept the risk.
This is a legitimate engineering trade. The question is not “is this safe?” but “is the failure probability acceptable for my use case?” A system that commits with Q2 = 2 out of 10 can tolerate 8 simultaneous failures during steady state but only 1 failure during leader election. If your risk analysis says multi-failure during the brief leader election window is acceptable, this is a valid configuration.
Generalized Quorum Systems
Flexible Paxos uses simple quorum systems (any subset of the right size). But the intersection requirement enables much richer quorum structures.
Grid Quorums
Arrange N = r x c replicas in a grid with r rows and c columns. Define:
- Phase 1 quorum: one full column plus one replica from every other column
- Phase 2 quorum: one full row
For a 3x3 grid (9 replicas):
- Phase 1 quorum size: 3 + 2 = 5
- Phase 2 quorum size: 3
Intersection is guaranteed because a full row and a full column in a grid always share exactly one cell. Any set containing a full column intersects with any set containing a full row.
Grid arrangement (9 replicas):
Col1 Col2 Col3
Row1: A B C
Row2: D E F
Row3: G H I
Phase 2 quorum (Row 1): {A, B, C}
Phase 1 quorum (Col 2 + one from each other): {B, E, H, A, I}
Intersection: {B} (and possibly more)
Grid quorums are useful when replicas have different latency characteristics. You can arrange the grid so that Phase 2 quorums (rows) correspond to geographically close replicas, minimizing common-case latency.
Hierarchical Quorums
Organize replicas into groups (e.g., by datacenter). Define quorums as “a majority of groups, with a majority within each chosen group.”
For 3 datacenters with 3 replicas each (9 total):
- Phase 2 quorum: 2 out of 3 DCs, with 2 out of 3 replicas in each chosen DC = 4 replicas
- Phase 1 quorum: must intersect with every such Phase 2 quorum
This structure lets you reason about failures at the datacenter level (losing an entire DC) rather than the individual replica level.
Structure HierarchicalQuorumConfig:
groups: List<List<ReplicaId>> // replicas grouped by datacenter
group_quorum: Integer // how many groups needed
intra_group_quorum: Integer // how many replicas within a group
Function IsPhase2Quorum(replicas, config):
groups_satisfied = 0
for each group in config.groups:
replicas_in_group = |replicas INTERSECT group|
if replicas_in_group >= config.intra_group_quorum:
groups_satisfied++
return groups_satisfied >= config.group_quorum
// Phase 1 quorum must intersect with every possible Phase 2 quorum
// This requires careful calculation based on the specific configuration
Weighted Quorums
Assign weights to replicas (e.g., based on reliability, proximity, or capacity). Define quorums as any set whose total weight exceeds a threshold.
- Phase 2 threshold: W2
- Phase 1 threshold: W1
- Requirement: W1 + W2 > total weight
This is useful when replicas are not equally valuable. A replica in the same datacenter as most clients might receive a higher weight, reducing the number of remote replicas needed for a quorum.
Structure WeightedQuorumConfig:
weights: Map<ReplicaId, Integer>
total_weight: Integer // sum of all weights
phase1_threshold: Integer // W1
phase2_threshold: Integer // W2
// INVARIANT: phase1_threshold + phase2_threshold > total_weight
Function IsQuorum(replicas, threshold, config):
weight_sum = 0
for each r in replicas:
weight_sum += config.weights[r]
return weight_sum >= threshold
The Raft Connection
Raft, as originally specified, uses majority quorums for both leader election (RequestVote, analogous to Phase 1) and log replication (AppendEntries, analogous to Phase 2). Can we apply Flexible Paxos to Raft?
Yes, with caveats.
Raft’s leader election is not exactly Paxos Phase 1. In Paxos, Phase 1 discovers previously accepted values. In Raft, RequestVote serves a dual purpose: it elects a leader and ensures the leader has the most up-to-date log (via the log completeness check). Applying flexible quorums to Raft requires ensuring that the election quorum (Q1) intersects with every replication quorum (Q2) to maintain the Log Matching Property.
The arithmetic is the same: Q1_election + Q2_replication > N. But the implementation requires modifying Raft’s election and commitment logic, which is not trivial given that Raft’s design prioritizes simplicity. Adding flexible quorums to Raft somewhat defeats the purpose of choosing Raft in the first place.
That said, some Raft implementations (notably those used in CockroachDB’s evaluations) have experimented with flexible quorums. The results confirm the theoretical predictions: steady-state latency improves at the cost of longer or less fault-tolerant elections.
Why Adoption Has Been Slow
Given that Flexible Paxos is a strict generalization that subsumes classic Paxos as a special case, why is it not universally adopted?
Configuration complexity. With classic Paxos, the user specifies N and the system computes the quorum size automatically (floor(N/2) + 1). With Flexible Paxos, the user must choose Q1 and Q2, which requires understanding the performance and availability implications. Most operators do not want to think about this.
Existing implementations. Changing the quorum logic in a production consensus system is a high-risk modification. The intersection invariant must be maintained during configuration changes, including during the configuration change itself. This is subtle enough that most teams decide the risk is not worth the benefit.
Marginal benefit for small clusters. For N = 3 (the most common deployment), the options are:
- Classic: Q1 = Q2 = 2
- Write-optimized: Q1 = 3, Q2 = 1
Q2 = 1 means no replication during normal operation — a single replica failure loses data. Q1 = 3 means all replicas must participate in leader election. For most teams running 3-node clusters, neither alternative is attractive.
For N = 5:
- Classic: Q1 = Q2 = 3
- Write-optimized: Q1 = 4, Q2 = 2
- Read-optimized: Q1 = 2, Q2 = 4
Here the trade-offs become more interesting, but N = 5 is already less common in practice.
Mental model mismatch. Engineers reason about consensus in terms of “majority.” The majority rule is simple, intuitive, and easy to check. Flexible quorums require reasoning about two different quorum sizes and their intersection, which is harder to hold in your head during an incident at 3 AM.
Safety during reconfiguration. Changing quorum sizes at runtime — say, moving from (Q1=4, Q2=2) to (Q1=3, Q2=3) — requires ensuring that the old and new configurations’ quorums intersect correctly during the transition. This is a generalization of Raft’s joint consensus, but more complex because you have two quorum sizes to change instead of one.
// Safe reconfiguration requires:
// Old Q1 intersects Old Q2 (existing invariant)
// New Q1 intersects New Q2 (new invariant)
// During transition:
// Phase 1 quorums must satisfy BOTH old Q1 and new Q1
// Phase 2 quorums must satisfy BOTH old Q2 and new Q2
// This ensures safety regardless of which configuration is "active"
Procedure ReconfigureQuorums(old_config, new_config):
// Use joint quorums during transition
joint_q1 = max(old_config.Q1_size, new_config.Q1_size)
joint_q2 = max(old_config.Q2_size, new_config.Q2_size)
// This is conservative — joint quorums satisfy both configurations
// but may temporarily reduce availability
assert joint_q1 + old_config.Q2_size > old_config.N
assert joint_q1 + new_config.Q2_size > new_config.N
assert old_config.Q1_size + joint_q2 > old_config.N
assert new_config.Q1_size + joint_q2 > new_config.N
// Step 1: Switch to joint quorums
ApplyConfig(joint_q1, joint_q2)
// Step 2: Commit a marker entry under joint quorums
CommitMarker("reconfiguration in progress")
// Step 3: Switch to new quorums
ApplyConfig(new_config.Q1_size, new_config.Q2_size)
// Step 4: Commit a marker entry under new quorums
CommitMarker("reconfiguration complete")
Connection to FPaxos in Practice
Despite slow direct adoption, the ideas from Flexible Paxos have influenced several systems:
CockroachDB has explored flexible quorums in its Raft-based replication layer, particularly for geo-distributed deployments where minimizing Phase 2 quorum size can significantly reduce write latency.
Windows Azure Storage uses a witness-based replication scheme that is essentially flexible quorums in disguise: a write is durable once it reaches the primary plus one secondary, but reconfiguration requires all replicas.
Bookkeeper (used in Apache Pulsar) uses a quorum system where writes require Qw acknowledgments and reads require Qr, with Qw + Qr > N. This is exactly the Flexible Paxos insight applied to a storage system, though it was developed independently.
The pattern is clear: the idea of flexible quorums is spreading, even if the specific Flexible Paxos protocol is not being adopted wholesale.
The Deeper Lesson
Flexible Paxos teaches us something important about distributed systems research: sometimes the biggest insights are the simplest. The quorum intersection requirement was always the fundamental invariant. Majority quorums were always just one way to achieve it. But the field was so focused on building on top of the majority assumption that nobody thought to question it.
This should make us wonder: what other simplifying assumptions in our foundational protocols are unnecessarily restrictive? What other degrees of freedom are we leaving on the table because “that is how it has always been done”?
The answer, almost certainly, is “several.” The history of distributed systems is a history of slowly relaxing unnecessary constraints while preserving essential safety properties. Flexible Paxos is one such relaxation. There will be more.
In the meantime, the next time someone tells you that consensus requires a majority, you can smile and say, “Well, actually…” Just be prepared for the blank stares. Some insights, no matter how simple, take decades to sink in.