Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

EPaxos and Leaderless Consensus

The Tyranny of the Leader

Every consensus protocol we have examined so far shares a common architectural assumption: someone has to be in charge. Multi-Paxos has its distinguished proposer. Raft has its leader. ZAB has its primary. Even Viewstamped Replication, despite its name suggesting something more democratic, funnels all decisions through a single node.

This works. It simplifies reasoning, reduces conflicts, and makes implementation tractable. It also creates a bottleneck that, in geo-distributed deployments, becomes the kind of performance problem that makes you question your career choices.

Consider a five-node cluster spread across US-East, US-West, Europe, Asia, and South America. With Raft, the leader sits in one of these regions. Every write must travel to the leader, then the leader must replicate to a majority. If the leader is in US-East, a client in Asia pays the Asia-to-US-East round trip plus the replication latency. The theoretical minimum of two message delays becomes, in practice, a transcontinental odyssey.

The dream of leaderless consensus is simple: any node can propose a command, and the system will figure out a consistent ordering without routing everything through a single point. EPaxos — Egalitarian Paxos, published by Iulian Moraru, David Andersen, and Michael Kaminsky in 2013 — is the most ambitious attempt at realizing this dream.

It is also, as we shall see, a cautionary tale about the distance between a brilliant idea and a correct implementation.

Why Leaders Are a Bottleneck

Before we dive into EPaxos, let us be precise about what the leader bottleneck actually costs us.

Latency asymmetry. In a geo-distributed Multi-Paxos deployment with the leader in region A, clients in region A enjoy low-latency writes (one local round trip plus replication). Clients in region E endure the full cross-region penalty. This asymmetry is not just annoying — it can violate SLA requirements for globally distributed applications.

Throughput ceiling. The leader must process every proposal, serialize it into the log, and coordinate replication. A single node’s CPU, memory bandwidth, and network capacity bound the system’s throughput. You can shard, of course, but then you are no longer solving the same problem.

Failover latency. When the leader fails, the system goes through an election. During this period — which can range from hundreds of milliseconds to several seconds depending on timeout configuration — the system is unavailable for writes. In a leaderless protocol, a single node failure does not create a global availability gap.

Load imbalance. The leader does more work than followers. It must handle client requests, manage the log, send AppendEntries, process responses, and advance the commit index. Followers mostly just respond to RPCs. This asymmetry in resource utilization is wasteful.

The appeal of leaderless consensus, then, is a system where any replica can handle any client request with optimal latency — one round trip to a fast-path quorum — and where load is naturally distributed across all replicas.

EPaxos: The Core Idea

EPaxos begins with a deceptively simple insight: if two commands do not interfere with each other — that is, they operate on different keys or different state — then their relative order does not matter. Only conflicting commands need to be ordered consistently across replicas.

This is the key departure from leader-based protocols, which impose a total order on all commands regardless of whether ordering is necessary. EPaxos imposes a total order only on commands that must be ordered (those that conflict) and allows non-conflicting commands to be executed in any order.

The mechanism for achieving this is dependency tracking. When a replica proposes a command, it collects information about which other commands the new command depends on — that is, which existing commands it conflicts with. These dependencies form a directed graph, and the execution order is determined by topologically sorting this graph (with a specific tie-breaking rule for cycles).

The Instance Space

EPaxos organizes commands into an instance space indexed by (replica, instance_number). Each replica R maintains its own monotonically increasing instance counter. When replica R wants to propose a command, it assigns it to instance (R, i) for the next available i.

This is different from Multi-Paxos, where there is a single global log with a single sequence of slots. In EPaxos, each replica has its own “column” of instances, and the execution order is determined by the dependency graph, not by slot position.

Structure Instance:
    command: Command
    deps: Set<(ReplicaId, InstanceNumber)>  // dependencies
    seq: Integer                             // sequence number for ordering
    status: {PreAccepted, Accepted, Committed, Executed}
    ballot: BallotNumber

// Each replica maintains:
Structure ReplicaState:
    id: ReplicaId
    instances: Map<(ReplicaId, InstanceNumber), Instance>
    next_instance: Map<ReplicaId, InstanceNumber>  // next available instance per replica
    committed_up_to: Map<ReplicaId, InstanceNumber>

The Fast Path

The fast path is the common case — when there are no conflicts or when all replicas in the fast quorum agree on the same set of dependencies. It completes in a single round trip.

Procedure ProposeCommand(command):
    // Step 1: Leader assigns instance, computes initial dependencies
    inst_num = next_instance[self.id]++
    deps = {}
    seq = 0

    // Find all instances in our log that conflict with this command
    for each (replica, inst) in instances:
        if instances[(replica, inst)].command conflicts_with command:
            deps.add((replica, inst))
            seq = max(seq, instances[(replica, inst)].seq + 1)

    instance = Instance{
        command: command,
        deps: deps,
        seq: seq,
        status: PreAccepted,
        ballot: current_ballot
    }
    instances[(self.id, inst_num)] = instance

    // Step 2: Send PreAccept to fast quorum
    // Fast quorum = floor(N/2) + floor((floor(N/2) + 1) / 2) replicas
    // For N=5, fast quorum = 3 (2 + ceil(1.5) = 2 + 2... actually it's N - 1 = 4)
    // Correction: For N=5, fast path quorum is (N-1)/2 + (N-1)/2 = ...
    // Actually: fast quorum size for N=5 is F+floor(F/2)+1 where F = floor((N-1)/2)
    // Let's just say: for N=5, fast quorum = 3 out of 4 other replicas
    replies = SendToFastQuorum(PreAcceptMessage{
        instance: (self.id, inst_num),
        command: command,
        deps: deps,
        seq: seq,
        ballot: current_ballot
    })

    // Step 3: Check if all replies agree on deps and seq
    all_agree = true
    for each reply in replies:
        if reply.deps != deps or reply.seq != seq:
            all_agree = false
            break

    if all_agree:
        // FAST PATH: commit directly
        instance.status = Committed
        SendToAll(CommitMessage{
            instance: (self.id, inst_num),
            command: command,
            deps: deps,
            seq: seq
        })
    else:
        // SLOW PATH: need another round
        GoToSlowPath(inst_num, replies)

Handling PreAccept on a Replica

When a replica receives a PreAccept message, it must do its own dependency computation. This is where the subtlety begins.

Procedure HandlePreAccept(msg):
    // Compute our own view of dependencies
    local_deps = msg.deps  // start with leader's deps
    local_seq = msg.seq

    for each (replica, inst) in instances:
        if instances[(replica, inst)].command conflicts_with msg.command:
            local_deps.add((replica, inst))
            local_seq = max(local_seq, instances[(replica, inst)].seq + 1)

    // Store the instance
    instances[(msg.sender, msg.inst_num)] = Instance{
        command: msg.command,
        deps: local_deps,
        seq: local_seq,
        status: PreAccepted,
        ballot: msg.ballot
    }

    // Reply with our computed deps and seq
    Reply(PreAcceptReply{
        deps: local_deps,
        seq: local_seq,
        instance: (msg.sender, msg.inst_num)
    })

The critical point: if a replica has seen commands that the proposing replica has not, it will include additional dependencies. If all replicas in the fast quorum agree on the same dependencies (including any additions), the fast path succeeds. If they disagree — because different replicas have seen different sets of commands — the slow path is needed.

The Slow Path

The slow path adds one more round of communication. The proposing replica takes the union of all dependencies reported by the fast quorum replicas and runs a Paxos-like Accept phase.

Procedure GoToSlowPath(inst_num, preaccept_replies):
    // Union all dependencies from all replies and our own
    merged_deps = instances[(self.id, inst_num)].deps
    merged_seq = instances[(self.id, inst_num)].seq

    for each reply in preaccept_replies:
        merged_deps = merged_deps UNION reply.deps
        merged_seq = max(merged_seq, reply.seq)

    // Update our instance
    instances[(self.id, inst_num)].deps = merged_deps
    instances[(self.id, inst_num)].seq = merged_seq
    instances[(self.id, inst_num)].status = Accepted

    // Phase 2: Accept (classic Paxos majority quorum)
    replies = SendToMajority(AcceptMessage{
        instance: (self.id, inst_num),
        command: instances[(self.id, inst_num)].command,
        deps: merged_deps,
        seq: merged_seq,
        ballot: current_ballot
    })

    // If majority accepts, commit
    if MajorityAccepted(replies):
        instances[(self.id, inst_num)].status = Committed
        SendToAll(CommitMessage{
            instance: (self.id, inst_num),
            command: instances[(self.id, inst_num)].command,
            deps: merged_deps,
            seq: merged_seq
        })

The slow path requires two round trips total (PreAccept + Accept), which is the same as standard Paxos. So in the worst case, EPaxos is no worse than Multi-Paxos. In the common case (no conflicts), it completes in one round trip from any replica — a genuine improvement for geo-distributed systems.

Execution Ordering: Where the Fun Really Begins

Getting commands committed is only half the battle. The other half — and this is where EPaxos gets genuinely tricky — is determining the execution order.

Each committed instance has a set of dependencies. These form a directed graph. To execute commands consistently across all replicas, every replica must compute the same execution order from this graph.

The algorithm proceeds as follows:

  1. Build the dependency graph for committed instances.
  2. Find strongly connected components (SCCs) using Tarjan’s algorithm.
  3. Execute SCCs in reverse topological order.
  4. Within each SCC, break ties using the sequence number (seq) and instance ID.

Tarjan’s Algorithm in EPaxos

The use of Tarjan’s algorithm is not arbitrary — it is necessary because the dependency graph can contain cycles. Command A might depend on command B (because replica 1 saw B before A’s PreAccept), while command B depends on A (because replica 2 saw A before B’s PreAccept). When dependencies are unioned in the slow path, both dependency edges survive, creating a cycle.

Procedure ExecuteCommands():
    // Build dependency graph from committed instances
    graph = BuildDependencyGraph()

    // Find SCCs using Tarjan's algorithm
    sccs = TarjanSCC(graph)

    // sccs is returned in reverse topological order by Tarjan's
    for each scc in sccs:
        // Sort instances within SCC by (seq, replica_id, instance_number)
        sorted_instances = Sort(scc, key = (inst.seq, inst.replica_id, inst.inst_num))

        for each instance in sorted_instances:
            if instance.status != Executed:
                // Must wait until all dependencies outside this SCC are executed
                for each dep in instance.deps:
                    if dep not in scc:
                        WaitUntilExecuted(dep)

                Execute(instance.command)
                instance.status = Executed

Structure TarjanState:
    index: Integer = 0
    stack: Stack<InstanceId>
    on_stack: Set<InstanceId>
    indices: Map<InstanceId, Integer>
    lowlinks: Map<InstanceId, Integer>
    sccs: List<List<InstanceId>>

Procedure TarjanSCC(graph):
    state = TarjanState{}

    for each node in graph:
        if node not in state.indices:
            StrongConnect(state, graph, node)

    return state.sccs

Procedure StrongConnect(state, graph, v):
    state.indices[v] = state.index
    state.lowlinks[v] = state.index
    state.index++
    state.stack.push(v)
    state.on_stack.add(v)

    for each w in graph.successors(v):
        if w not in state.indices:
            // w has not been visited
            StrongConnect(state, graph, w)
            state.lowlinks[v] = min(state.lowlinks[v], state.lowlinks[w])
        else if w in state.on_stack:
            state.lowlinks[v] = min(state.lowlinks[v], state.indices[w])

    if state.lowlinks[v] == state.indices[v]:
        // v is the root of an SCC
        scc = []
        repeat:
            w = state.stack.pop()
            state.on_stack.remove(w)
            scc.append(w)
        until w == v
        state.sccs.append(scc)

The Execution Blocking Problem

Here is a practical headache the paper does not dwell on: to execute an instance, you must first know the status of all its dependencies. If a dependency is not yet committed — perhaps the proposing replica is slow, or the PreAccept messages have not arrived — you are stuck. You cannot execute the instance, and you cannot execute anything that depends on it.

In the worst case, a single slow instance can block a cascade of executions. The solution is explicit commit: if you need to know whether instance (R, i) is committed and you have not heard, you must run a Paxos round to force the decision. This recovery protocol adds significant implementation complexity.

Procedure WaitUntilCommitted(instance_id):
    if instances[instance_id].status >= Committed:
        return  // already committed

    // Try to learn the committed value
    // Option 1: ask the owner replica
    reply = Ask(instance_id.replica, StatusRequest{instance: instance_id})
    if reply.status >= Committed:
        instances[instance_id] = reply.instance
        return

    // Option 2: run explicit Prepare (Paxos Phase 1) to recover
    RunExplicitPrepare(instance_id)

This recovery path is essentially a full Paxos round, which means that in the presence of failures or slow replicas, EPaxos execution can be significantly delayed. The paper presents this as a straightforward extension; implementers describe it as a major source of bugs.

The Correctness Story: A Humbling Episode

In 2020 — seven years after publication — Sutra et al. published a paper titled “On the correctness of Egalitarian Paxos” that identified a bug in the EPaxos execution algorithm. The specific issue was in how dependencies were handled during recovery after a replica failure.

The problem was subtle. Consider this scenario:

  1. Replica 1 proposes command A at instance (1, 1).
  2. Replica 2 proposes command B at instance (2, 1), with B conflicting with A.
  3. A’s PreAccept reaches some replicas but not others before replica 1 fails.
  4. A recovery procedure is initiated for instance (1, 1).

The bug manifested when the recovery procedure could commit instance (1, 1) with a set of dependencies that was inconsistent with what had been committed at other instances. This could lead to different replicas computing different execution orders — a violation of the fundamental safety property.

The fix required modifications to the recovery protocol, adding additional checks to ensure dependency consistency. The corrected protocol, sometimes called EPaxos*, is what practitioners should implement.

This episode is worth dwelling on for a moment. EPaxos was published at SOSP 2013, one of the most prestigious systems conferences. It was peer-reviewed, formally described, and accompanied by a proof sketch. And it still had a correctness bug that went undetected for seven years.

This is not an indictment of the authors — it is an indictment of the inherent difficulty of getting leaderless consensus right. The state space explosion that comes from allowing any replica to propose, combined with the dependency tracking mechanism, creates a protocol whose correctness is extraordinarily hard to verify by inspection.

Performance: The Geo-Distributed Sweet Spot

EPaxos was designed for one specific scenario: geo-distributed deployments where the leader bottleneck creates unacceptable latency asymmetry.

Non-conflicting commands, fast path (common case):

Client    Replica_R    Replica_A    Replica_B    Replica_C    Replica_D
  |           |            |            |            |            |
  |--Cmd----->|            |            |            |            |
  |           |--PreAcc--->|            |            |            |
  |           |--PreAcc-------------->|            |            |
  |           |--PreAcc-------------------------->|            |
  |           |--PreAcc------------------------------------>|
  |           |<--OK-------|            |            |            |
  |           |<--OK------------------|            |            |
  |           |<--OK-----------------------------|            |
  |           |  (3 matching replies = fast quorum for N=5)    |
  |<--Done----|            |            |            |            |
  |           |--Commit--->|            |            |            |
  |           |--Commit------------>|            |            |
  |           |--Commit------------------------->|            |
  |           |--Commit----------------------------------->|

Total latency: one round trip to the nearest fast-quorum replicas. For a 5-node cluster, we need 3 out of 4 replicas to respond identically. In a geo-distributed deployment, this means the latency is determined by the third-closest replica, not by the leader’s location.

Conflicting commands, slow path:

Client    Replica_R    Replica_A    Replica_B
  |           |            |            |
  |--Cmd----->|            |            |
  |           |--PreAcc--->|            |
  |           |--PreAcc------------>|
  |           |<-PreAccOK--|  (different deps!)
  |           |<-PreAccOK---------|
  |           |            |            |
  |           |  (deps disagree, go to slow path)
  |           |            |            |
  |           |--Accept--->|            |
  |           |--Accept------------>|
  |           |<--AccOK----|            |
  |           |<--AccOK-----------|
  |<--Done----|            |            |
  |           |--Commit--->|            |
  |           |--Commit------------>|

Total latency: two round trips. Same as Multi-Paxos.

Comparison with Multi-Paxos

MetricMulti-PaxosEPaxos (no conflict)EPaxos (conflict)
Round trips1 (from leader)1 (from any replica)2 (from any replica)
Optimal client latencyNear-zero (if co-located with leader)Near-zero (always, via nearest replica)One RTT to farthest quorum member
Worst-case client latency2x cross-region RTT1x cross-region RTT2x cross-region RTT
Throughput bottleneckLeader nodeNone (distributed)Conflict-heavy keys
Message complexityO(N) per commandO(N) per commandO(N) per command
Implementation complexityModerateVery HighVery High

The latency advantage is most dramatic for the common case of non-conflicting commands in geo-distributed settings. If your workload is 95% non-conflicting and your nodes are spread across continents, EPaxos can roughly halve your average latency compared to Multi-Paxos.

If your workload is heavily conflicting, or if your nodes are in the same datacenter (where cross-node latency is microseconds, not milliseconds), the advantage largely disappears and you are left paying the complexity tax for no benefit.

The Conflict Rate Problem

EPaxos’s fast path depends on low conflict rates. But what counts as a conflict?

In the original formulation, two commands conflict if they access the same key and at least one is a write. This means that a hot key — one that receives a disproportionate share of writes — will push commands to the slow path frequently.

Worse, the conflict check happens at the instance level, not the command level. If replica A has seen 1000 commands since the last time it synchronized with replica B, then any new command from B that conflicts with any of those 1000 commands will generate different dependency sets. The more out-of-sync replicas are, the more conflicts occur, even if the actual command conflict rate is low.

This creates an unfortunate feedback loop: high load leads to more in-flight commands, which leads to more dependency divergence, which leads to more slow-path executions, which increases latency, which increases the number of in-flight commands.

In practice, workloads with even moderate contention on popular keys spend a surprising amount of time on the slow path.

Atlas, Caesar, and the Leaderless Zoo

EPaxos inspired a family of leaderless consensus protocols, each trying to address different limitations.

Atlas (2020) simplifies EPaxos’s dependency tracking by using a different approach to ordering. Instead of tracking per-command dependencies, Atlas uses a “timestamp” approach where each replica assigns a timestamp to each command, and the final timestamp is the maximum across a quorum. This eliminates the complex dependency graph and Tarjan’s algorithm but requires all replicas in the quorum to respond (not just a majority within the quorum agreeing on dependencies).

Caesar (2017) introduces a technique for handling conflicts more gracefully. When conflicts are detected, Caesar uses a “wait-free” mechanism that avoids the slow path in more cases. The key insight is that if the conflicting commands can be ordered by their timestamps, no additional round trip is needed — the timestamp ordering is sufficient. Caesar only falls back to the slow path when commands have identical timestamps (rare in practice).

Mencius (2008, predating EPaxos) takes a different approach entirely. Instead of dependency tracking, Mencius pre-assigns log slots to replicas in a round-robin fashion. Replica 0 owns slots 0, 3, 6, …; replica 1 owns slots 1, 4, 7, …; and so on. Each replica runs Paxos independently for its own slots. This provides load balancing without the complexity of dependency tracking, but it means every replica must participate in every “round” — a slow replica slows everyone down.

Tempo (2021) attempts to achieve the best of both worlds: leaderless operation with simpler execution ordering. Tempo uses a clock-based approach where each command is assigned a timestamp, and execution order follows timestamp order. The protocol ensures that conflicting commands always receive ordered timestamps by having replicas propose timestamps and taking the maximum.

Comparative Summary

ProtocolLeader?Fast path RTTsHandles conflicts?Execution ordering
Multi-PaxosYes1 (from leader)N/A (total order)Log order
EPaxosNo1Slow path (2 RTTs)Dependency graph + Tarjan
AtlasNo1Better than EPaxosTimestamp-based
CaesarNo1Timestamp orderingTimestamp-based
MenciusNo1N/A (pre-assigned slots)Slot order

When Leaderless Consensus Actually Helps

After all this complexity, let us be honest about when leaderless consensus is worth the trouble.

It helps when:

  • Nodes are geo-distributed across multiple regions (cross-region latency >> intra-region latency).
  • The workload has low to moderate conflict rates.
  • Latency symmetry matters — all clients should see similar latency regardless of their region.
  • You have a team capable of implementing, testing, and debugging a protocol of this complexity.

It does not help when:

  • All nodes are in the same datacenter (the leader bottleneck is negligible).
  • The workload is heavily conflicting (you will spend most of your time on the slow path).
  • You need simplicity and auditability (Raft is dramatically easier to understand and verify).
  • Your team is not prepared for the implementation complexity.

It is complexity for complexity’s sake when:

  • You could achieve the same result with Multi-Paxos and client-side routing to the nearest leader in a multi-group setup.
  • Your actual bottleneck is not consensus latency but application logic, storage I/O, or network bandwidth.
  • You are building a prototype or a system that will be maintained by a small team.

Implementation Reality Check

The distance between the EPaxos paper and a production implementation is vast. Here are the things the paper does not adequately address:

Garbage collection. The instance space grows without bound. You need a mechanism to prune old instances once they have been executed by all replicas. This requires its own protocol — essentially a distributed garbage collection scheme with its own consistency requirements.

Snapshotting and recovery. When a new replica joins or a failed replica recovers, it needs to reconstruct the current state. With Multi-Paxos, this is relatively straightforward: transfer the log and snapshot. With EPaxos, you need to transfer the dependency graph, the execution state, and all un-garbage-collected instances across all replica columns.

Configuration changes. Adding or removing replicas changes the quorum sizes, including the fast-path quorum. EPaxos’s quorum requirements are more complex than Raft’s joint consensus, and the paper does not provide a complete configuration change protocol.

Read leases and linearizable reads. In Multi-Paxos, the leader can serve linearizable reads locally (with a lease or by confirming leadership). In EPaxos, there is no leader. Linearizable reads require either running a full consensus round or implementing a read protocol that checks with a quorum, which partly negates the latency advantage.

Testing. The state space of EPaxos is enormous. Every possible interleaving of PreAccept, Accept, and Commit messages across all replicas, combined with the dependency tracking, creates a combinatorial explosion. Jepsen-style testing is essential but not sufficient. Model checking (TLA+ or similar) is practically mandatory.

A Final Assessment

EPaxos is a genuinely brilliant protocol. The insight that non-conflicting commands can be fast-path committed from any replica in a single round trip is both correct and useful. The dependency tracking mechanism is elegant.

But brilliance and practicality are different things. The protocol’s complexity — particularly the execution ordering algorithm, the recovery protocol, and the correctness bugs found years after publication — means that for most practitioners, the safer choice is a well-implemented Multi-Paxos or Raft with thoughtful deployment topology.

If you are operating at a scale where geo-distributed consensus latency is your actual bottleneck, and you have a team with deep distributed systems expertise, EPaxos (or one of its descendants like Tempo or Atlas) may be worth the investment. For everyone else, the agony is not worth the optimization.

The leaderless dream is real. The leaderless implementation is a nightmare. Choose your suffering accordingly.