Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Raft: Paxos for Humans (Mostly)

In 2014, Diego Ongaro and John Ousterhout published “In Search of an Understandable Consensus Algorithm,” and the world of distributed systems let out a collective sigh of relief. Finally, someone had said the quiet part out loud: Paxos was too hard to understand, and this was a problem.

Raft was designed with a single overriding goal: understandability. Not performance. Not generality. Not minimality. Understandability. Ongaro and Ousterhout’s thesis was that if you cannot understand a consensus algorithm, you cannot implement it correctly, and if you cannot implement it correctly, it doesn’t matter how elegant the theory is. This is a radical claim in a field that prizes theoretical elegance, and the fact that it needed to be made tells you something about the state of the field in 2014.

The result is a protocol that is genuinely easier to understand than Paxos. It is also a protocol that is genuinely harder to understand than its proponents sometimes suggest. This chapter covers both halves of that truth.

Design Philosophy: Decomposition Over Minimality

Raft’s key design decision was to decompose consensus into three relatively independent subproblems:

  1. Leader election — How do you pick a leader?
  2. Log replication — How does the leader replicate its log to followers?
  3. Safety — How do you ensure that the log stays consistent?

Paxos, by contrast, interleaves these concerns in a way that is theoretically minimal but pedagogically opaque. Raft’s decomposition means you can understand each piece independently and then see how they fit together. This is not just a pedagogical trick — it also makes the implementation modular in a way that Paxos is not.

The other major design decision was to reduce the state space wherever possible. Raft eliminates log gaps (unlike Multi-Paxos), uses randomized timeouts instead of a separate leader election protocol, and enforces the leader completeness property — the elected leader always has the most complete log, so the leader never needs to learn about committed entries it doesn’t have. Each of these constraints reduces the number of cases the implementation must handle.

Terms and Roles

Raft divides time into terms, each identified by a monotonically increasing integer. Each term begins with an election and (if the election succeeds) is followed by normal operation under a single leader. Terms act as a logical clock — they tell you whether the information you’re looking at is current or stale.

There are three roles:

  • Leader — Handles all client requests, replicates log entries, sends heartbeats.
  • Follower — Passive. Responds to requests from leaders and candidates. If it doesn’t hear from a leader for a while, it becomes a candidate.
  • Candidate — A follower that is trying to become the leader by running an election.

Every server starts as a follower. This is the steady state for most servers most of the time.

Leader Election

Raft’s leader election uses randomized timeouts, and it is one of the protocol’s genuine contributions to the field.

The Mechanism

Each follower maintains an election timer. When the timer expires without hearing from a leader (via heartbeat or AppendEntries), the follower:

  1. Increments its current term.
  2. Transitions to candidate state.
  3. Votes for itself.
  4. Sends RequestVote to all other servers.
class RaftNode:
    // Persistent state (MUST survive restarts)
    persistent:
        current_term: int = 0
        voted_for: NodeId = null
        log: List<LogEntry> = []

    // Volatile state
    volatile:
        commit_index: int = 0
        last_applied: int = 0
        role: Leader | Follower | Candidate = Follower
        election_timer: Timer

    function on_election_timeout():
        self.role = Candidate
        self.current_term += 1
        self.voted_for = self.my_id
        persist(self.current_term, self.voted_for)

        self.votes_received = {self.my_id}  // Vote for self

        last_log_index = len(self.log)
        last_log_term = self.log[last_log_index].term if self.log else 0

        for server in self.all_servers:
            if server != self.my_id:
                send(server, RequestVote {
                    term: self.current_term,
                    candidate_id: self.my_id,
                    last_log_index: last_log_index,
                    last_log_term: last_log_term
                })

        // Reset election timer with new random timeout
        self.reset_election_timer()

Voting Rules

A server grants its vote if and only if:

  1. The candidate’s term is at least as large as the voter’s current term.
  2. The voter hasn’t already voted for someone else in this term.
  3. The candidate’s log is at least as up-to-date as the voter’s log.

Rule 3 is the election restriction and is crucial for safety. “At least as up-to-date” means: the candidate’s last log entry has a higher term than the voter’s last log entry, OR the terms are equal and the candidate’s log is at least as long.

function on_request_vote(msg):
    if msg.term < self.current_term:
        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: false
        })
        return

    if msg.term > self.current_term:
        self.step_down(msg.term)  // Become follower

    // Check if we can grant the vote
    can_vote = (self.voted_for == null or self.voted_for == msg.candidate_id)

    // Check log up-to-date-ness
    my_last_term = self.log[-1].term if self.log else 0
    my_last_index = len(self.log)

    log_ok = (msg.last_log_term > my_last_term or
              (msg.last_log_term == my_last_term and
               msg.last_log_index >= my_last_index))

    if can_vote and log_ok:
        self.voted_for = msg.candidate_id
        persist(self.current_term, self.voted_for)
        self.reset_election_timer()  // Grant implies reset

        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: true
        })
    else:
        reply(msg.from, VoteResponse {
            term: self.current_term,
            vote_granted: false
        })

function step_down(new_term):
    self.current_term = new_term
    self.role = Follower
    self.voted_for = null
    persist(self.current_term, self.voted_for)
    self.reset_election_timer()

Why Randomized Timeouts Work

The election timeout is chosen randomly from a range, typically [150ms, 300ms]. This randomization serves two purposes:

  1. Breaks symmetry. Without randomization, all followers would time out simultaneously and split the vote. With randomization, one follower typically times out first and wins the election before others even start.

  2. Avoids livelock. If two candidates keep splitting the vote, the random timeouts ensure that eventually one will start its election early enough to win before the other starts.

This is a much simpler solution to the leader election problem than Paxos’s approach (which doesn’t really have one) or VR’s deterministic rotation (which requires a separate mechanism to skip over failed leaders). The downside is that it’s probabilistic — in theory, you could get unlucky and have repeated split votes. In practice, this essentially never happens because the probability drops exponentially with each round.

Handling the Election Response

function on_vote_response(msg):
    if msg.term > self.current_term:
        self.step_down(msg.term)
        return

    if self.role != Candidate or msg.term != self.current_term:
        return  // Stale response

    if msg.vote_granted:
        self.votes_received.add(msg.from)

        if |self.votes_received| > len(self.all_servers) / 2:
            // Won the election!
            self.become_leader()

function become_leader():
    self.role = Leader

    // Initialize leader state
    for server in self.all_servers:
        self.next_index[server] = len(self.log) + 1
        self.match_index[server] = 0

    // Send initial empty AppendEntries (heartbeat) to assert leadership
    self.send_heartbeats()

    // Optionally: append a no-op entry to commit entries from previous terms
    // (This is an important optimization discussed later)

Log Replication

Once a leader is elected, it handles all client requests. Each request is appended to the leader’s log and replicated to followers via AppendEntries RPCs.

AppendEntries: The Workhorse

AppendEntries serves double duty: it replicates log entries AND serves as a heartbeat (when sent with no entries). The leader sends it periodically to all followers.

function leader_send_append_entries(follower_id):
    prev_log_index = self.next_index[follower_id] - 1
    prev_log_term = self.log[prev_log_index].term if prev_log_index > 0 else 0

    // Entries to send: everything from next_index onward
    entries = self.log[self.next_index[follower_id]:]

    send(follower_id, AppendEntries {
        term: self.current_term,
        leader_id: self.my_id,
        prev_log_index: prev_log_index,
        prev_log_term: prev_log_term,
        entries: entries,
        leader_commit: self.commit_index
    })

The Consistency Check

Each AppendEntries message includes prev_log_index and prev_log_term — the index and term of the log entry immediately preceding the new entries. The follower checks that its log matches at this point. If it doesn’t, the follower rejects the AppendEntries, and the leader decrements next_index and retries.

This is the log matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index. It’s enforced inductively: the base case is the empty log (trivially matching), and each AppendEntries extends the induction by checking the predecessor.

function on_append_entries(msg):
    // Reset election timer — we heard from a leader
    self.reset_election_timer()

    if msg.term < self.current_term:
        reply(msg.from, AppendEntriesResponse {
            term: self.current_term,
            success: false,
            match_index: 0
        })
        return

    if msg.term > self.current_term:
        self.step_down(msg.term)

    self.role = Follower
    self.current_leader = msg.leader_id

    // Consistency check
    if msg.prev_log_index > 0:
        if msg.prev_log_index > len(self.log):
            // We don't have the predecessor entry
            reply(msg.from, AppendEntriesResponse {
                term: self.current_term,
                success: false,
                // Optimization: tell leader our log length
                // so it can skip ahead
                match_index: len(self.log)
            })
            return

        if self.log[msg.prev_log_index].term != msg.prev_log_term:
            // Predecessor entry exists but has wrong term — conflict
            // Optimization: find the first index of the conflicting term
            // and tell the leader to skip back to there
            conflict_term = self.log[msg.prev_log_index].term
            first_index = msg.prev_log_index
            while first_index > 1 and self.log[first_index - 1].term == conflict_term:
                first_index -= 1

            // Delete the conflicting entries
            self.log = self.log[:msg.prev_log_index]

            reply(msg.from, AppendEntriesResponse {
                term: self.current_term,
                success: false,
                match_index: first_index - 1
            })
            return

    // Append new entries (overwriting any conflicting entries)
    for i, entry in enumerate(msg.entries):
        index = msg.prev_log_index + 1 + i
        if index <= len(self.log):
            if self.log[index].term != entry.term:
                // Conflict — truncate and append
                self.log = self.log[:index]
                self.log.append(entry)
            // else: already have this entry, skip
        else:
            self.log.append(entry)

    persist(self.log)

    // Update commit index
    if msg.leader_commit > self.commit_index:
        self.commit_index = min(msg.leader_commit, len(self.log))
        self.apply_committed_entries()

    reply(msg.from, AppendEntriesResponse {
        term: self.current_term,
        success: true,
        match_index: msg.prev_log_index + len(msg.entries)
    })

Leader Handling of Responses

function on_append_entries_response(msg, from):
    if msg.term > self.current_term:
        self.step_down(msg.term)
        return

    if self.role != Leader or msg.term != self.current_term:
        return  // Stale

    if msg.success:
        self.next_index[from] = msg.match_index + 1
        self.match_index[from] = msg.match_index
        self.maybe_advance_commit_index()
    else:
        // Follower's log was inconsistent — back up
        self.next_index[from] = max(1, msg.match_index + 1)
        // Retry immediately
        self.leader_send_append_entries(from)

The Commit Mechanism

A log entry is committed when the leader has replicated it to a majority of servers. But there’s a critical subtlety: a leader can only commit entries from its own term.

This is the “Figure 8 problem” from the Raft paper, and it catches almost everyone off guard. Consider this scenario:

Time:  T1    T2    T3    T4    T5
S1:   [1]   [1,2] [1,2] [1,2] -- crashes --
S2:   [1]   [1,2] [1,2]  ...  [1,2,4]  -- becomes leader term 4
S3:   [1]   [1]   [1,3]  ...  [1,3]    -- was leader term 3
S4:   [1]   [1]   [1]    ...  [1,2,4]
S5:   [1]   [1]   [1]    ...  [1,2,4]

If S1 was leader in term 2, replicated entry 2 to S2, then crashed, and S3 became leader in term 3 but only replicated to itself before crashing, and then S1 comes back as leader in term 4… S1 might try to “commit” entry 2 (from term 2) by replicating it to S3. Now a majority (S1, S2, S3) has entry 2. But if S1 crashes again, S3 could become leader and overwrite entry 2 with entry 3!

The fix: a leader never commits entries from previous terms directly. It only commits them indirectly, by committing a new entry from its own term. Once a current-term entry is committed at a given index, all preceding entries are also committed (by the log matching property).

function maybe_advance_commit_index():
    // Find the highest index replicated to a majority
    for n in range(len(self.log), self.commit_index, -1):
        if self.log[n].term == self.current_term:  // CRITICAL: only current term
            count = 1  // Count self
            for server in self.all_servers:
                if server != self.my_id and self.match_index[server] >= n:
                    count += 1

            if count > len(self.all_servers) / 2:
                self.commit_index = n
                self.apply_committed_entries()
                return

function apply_committed_entries():
    while self.last_applied < self.commit_index:
        self.last_applied += 1
        result = self.state_machine.apply(self.log[self.last_applied].command)

        // If we're the leader, respond to the client
        if self.role == Leader:
            respond_to_client(self.log[self.last_applied].client_info, result)

Safety: Intuition for the Proof

Raft’s safety property is: if a log entry is committed at a given index, no other server will ever have a different entry at that index.

The proof relies on two properties:

Leader Completeness. If a log entry is committed in a given term, that entry is present in the logs of all leaders of higher-numbered terms.

Why? Because:

  1. A committed entry is on a majority of servers.
  2. A new leader must receive votes from a majority.
  3. Any two majorities overlap.
  4. The election restriction ensures the new leader’s log is at least as up-to-date as any voter’s.
  5. Therefore, the new leader has the committed entry.

Log Matching. If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.

Why? Because AppendEntries checks the predecessor’s index and term before appending. This check creates an inductive chain: if entry i matches, then entry i-1 was checked when entry i was replicated, so entry i-1 matches, and so on back to the beginning.

Together, these two properties ensure that once a value is committed, it is permanent and agreed upon by all future leaders.

The Things That Make Raft NOT as Easy as Advertised

The core protocol — leader election and log replication — is genuinely simpler than Paxos. But building a production Raft system requires solving several problems that the paper either addresses briefly or punts on entirely.

Membership Changes

The original Raft paper proposed joint consensus for membership changes: when changing from configuration C_old to C_new, the system transitions through an intermediate configuration C_old,new that requires majorities from BOTH configurations.

Time:   ─────────────────────────────────────────>
Config: C_old ──> C_old,new ──> C_new
                  │              │
            Committed here   Committed here

Joint consensus is safe but complex. You need to handle the case where the leader is not in the new configuration, where the joint configuration spans a leader change, and where a new server needs to catch up before it can vote.

Because of this complexity, the Raft authors later proposed single-server changes: only add or remove one server at a time. This is simpler because any two majorities of configurations that differ by one server will overlap.

function add_server(new_server):
    // First, bring the new server up to date
    while not is_caught_up(new_server):
        send_snapshot_or_entries(new_server)
        wait(CATCHUP_CHECK_INTERVAL)

    // Propose the configuration change as a regular log entry
    new_config = self.current_config + {new_server}
    self.replicate(ConfigChange { config: new_config })

    // Configuration takes effect as soon as the entry is appended
    // (not when it's committed!)

function remove_server(old_server):
    new_config = self.current_config - {old_server}
    self.replicate(ConfigChange { config: new_config })

    // If we're removing ourselves, step down after committing
    if old_server == self.my_id:
        self.step_down_after_commit()

The single-server approach seems simpler, but it has its own subtleties. You cannot have two pending configuration changes at the same time (you must wait for each to commit). And there’s a tricky edge case when removing a server that is the current leader.

Log Compaction

The log grows without bound. You need to compact it. Raft describes snapshotting: periodically, take a snapshot of the state machine and discard log entries up to the snapshot point.

function take_snapshot():
    if self.last_applied - self.last_snapshot_index < SNAPSHOT_THRESHOLD:
        return

    snapshot = Snapshot {
        last_included_index: self.last_applied,
        last_included_term: self.log[self.last_applied].term,
        state: serialize(self.state_machine),
        config: self.current_config
    }

    write_snapshot_to_disk(snapshot)

    // Discard log entries up to the snapshot
    self.log = self.log[self.last_applied + 1:]
    self.last_snapshot_index = snapshot.last_included_index
    self.last_snapshot_term = snapshot.last_included_term

When a leader needs to send entries to a follower that has fallen behind the snapshot point, it sends the snapshot instead:

function send_install_snapshot(follower_id):
    send(follower_id, InstallSnapshot {
        term: self.current_term,
        leader_id: self.my_id,
        last_included_index: self.last_snapshot_index,
        last_included_term: self.last_snapshot_term,
        data: self.snapshot_data,
        // May be chunked for large snapshots
        offset: 0,
        done: true
    })

The follower replaces its state with the snapshot and discards its entire log up to the snapshot point. This is simple in concept but complex in implementation — snapshotting a large state machine while continuing to process requests requires copy-on-write semantics or a quiescent period, and transferring a multi-gigabyte snapshot is a non-trivial network operation.

The Pre-Vote Protocol

Raft has a problem: a partitioned server keeps incrementing its term (because its election timer keeps firing and it keeps starting elections it can never win). When the partition heals, this server has a very high term number. When it contacts the cluster, other servers see the high term and step down from leadership, causing unnecessary leader changes.

The pre-vote protocol (proposed in Ongaro’s dissertation but not in the original paper) addresses this. Before starting a real election, a candidate sends a “pre-vote” request that doesn’t increment the term. Other servers respond based on whether they would vote for this candidate — but they don’t actually record a vote or change their term. Only if the pre-vote succeeds does the candidate proceed with a real election.

function on_election_timeout_with_prevote():
    // Phase 0: Pre-vote
    pre_vote_term = self.current_term + 1  // Hypothetical next term

    pre_votes = {self.my_id}
    for server in self.all_servers:
        if server != self.my_id:
            send(server, PreVote {
                term: pre_vote_term,
                candidate_id: self.my_id,
                last_log_index: len(self.log),
                last_log_term: self.log[-1].term if self.log else 0
            })

    // Collect pre-votes
    pre_vote_responses = wait_for_responses(ELECTION_TIMEOUT)

    if count_granted(pre_vote_responses) + 1 > len(self.all_servers) / 2:
        // Pre-vote succeeded — proceed with real election
        self.start_real_election()
    else:
        // Pre-vote failed — we're probably partitioned
        self.reset_election_timer()  // Try again later

function on_pre_vote(msg):
    // Respond based on whether we WOULD vote, but don't record anything
    would_vote = (msg.term >= self.current_term and
                  is_log_up_to_date(msg.last_log_index, msg.last_log_term) and
                  (self.current_leader == null or
                   now() - self.last_heartbeat > ELECTION_TIMEOUT))

    reply(msg.from, PreVoteResponse {
        term: msg.term,
        vote_granted: would_vote
    })
    // NOTE: we do NOT update current_term or voted_for

Pre-vote is now considered essential for production Raft implementations. etcd added it. CockroachDB uses it. TiKV uses it. Its absence from the original paper is one of those cases where the protocol works fine in the common case but has a sharp edge in a scenario that’s not uncommon in production (network partitions).

Learner Nodes (Non-Voting Members)

When adding a new server to a Raft cluster, the new server has an empty log and needs to catch up. During catchup, it can’t usefully participate in consensus — it would just slow things down. Worse, if you add it to the configuration immediately, it changes the majority requirement, potentially making the cluster unable to commit new entries.

Learner nodes (also called non-voting members or observers) are servers that receive log replication but don’t vote in elections or count toward the commit quorum. They’re used to stage new servers until they’re caught up.

function add_server_with_learner(new_server):
    // Step 1: Add as learner (non-voting)
    self.learners.add(new_server)

    // Step 2: Replicate log to learner (same as a follower)
    while not is_caught_up(new_server):
        send_append_entries(new_server)
        wait(CHECK_INTERVAL)

    // Step 3: Promote to voting member
    self.learners.remove(new_server)
    new_config = self.current_config + {new_server}
    self.replicate(ConfigChange { config: new_config })

Read-Only Operations

Linearizable reads are harder than they look. The naive approach — just read from the leader — is unsafe because the leader might be partitioned and a new leader might have been elected.

Raft offers two approaches:

ReadIndex. The leader records its commit index, confirms it’s still the leader by sending heartbeats to a majority, and then serves the read once the recorded commit index has been applied.

function linearizable_read(query):
    if self.role != Leader:
        redirect_to_leader(query)
        return

    read_index = self.commit_index

    // Confirm we're still leader
    heartbeat_acks = send_heartbeats_and_wait()
    if |heartbeat_acks| < majority:
        // We might not be leader anymore
        return Error("not leader")

    // Wait for state machine to catch up
    wait_until(self.last_applied >= read_index)

    return self.state_machine.query(query)

Lease-based reads. If the leader has received heartbeat responses from a majority within the last election timeout period, it assumes it’s still the leader and serves reads locally. This is faster but depends on bounded clock drift, which not everyone is comfortable assuming.

Real Implementations

etcd/raft

etcd’s Raft implementation (in Go) is probably the most widely used Raft library. It’s used by etcd itself, Kubernetes (via etcd), CockroachDB, and TiKV.

Key characteristics:

  • Implements the core Raft protocol faithfully.
  • Adds pre-vote, learner nodes, and leader transfer.
  • Does NOT implement the transport layer — it’s a library that produces messages, and the application is responsible for sending them. This is an excellent design decision that makes it adaptable to different network stacks.
  • Does NOT implement persistence — the application provides a storage interface. Same rationale.

CockroachDB

CockroachDB uses etcd/raft but adds significant extensions:

  • Range-level Raft. Each data range (a contiguous keyspace) is a separate Raft group. A single CockroachDB cluster might have tens of thousands of Raft groups.
  • Multi-Raft. To avoid the overhead of thousands of independent Raft groups each sending their own heartbeats, CockroachDB batches Raft messages between nodes.
  • Joint consensus. CockroachDB uses joint consensus for membership changes rather than single-server changes.
  • Epoch-based leases. Range leases are based on epochs rather than wall-clock time, avoiding clock-dependency issues.

TiKV

TiKV (the storage engine for TiDB) also uses etcd/raft with its own extensions:

  • Batching and pipelining. TiKV aggressively batches Raft messages and pipelines requests.
  • Async apply. The state machine application is asynchronous — committed entries are applied in a separate thread from the Raft protocol thread. This improves throughput but requires careful handling of read requests.
  • Multi-Raft with region-based partitioning, similar to CockroachDB.

Performance Comparison with Multi-Paxos

In steady state (stable leader, no failures):

MetricRaftMulti-Paxos
Messages per write2(n-1)2(n-1)
Round trips per write11
Fsyncs per write1 (leader) + f (followers)Same
Read latency (linearizable)1 RTT (ReadIndex) or 0 (lease)Same (lease-based)
Leader change latency~election timeout (~300ms)~Phase 1 RTT (~2ms)

The one notable difference is leader change latency. Raft’s randomized election timeout means there’s a 150-300ms delay before a new leader is elected after a failure. Multi-Paxos can elect a new leader in a single round trip (the Phase 1 Prepare/Promise), which might be only a few milliseconds in a LAN.

In practice, this difference rarely matters because leader failures are (should be) rare events. But in systems that are extremely sensitive to failover latency, Multi-Paxos has an advantage.

Why Raft Won the Mindshare War

Raft’s dominance in the consensus algorithm mindshare is not primarily a technical achievement. The protocol is good — clean, well-specified, and practical. But its success is primarily a pedagogical and community achievement.

The paper is readable. It’s 18 pages, clearly structured, with detailed examples and figures. The authors explicitly optimized for understandability and it shows.

The visualization. The Raft visualization (thesecretlivesofdata.com/raft) is one of the best algorithm visualizations ever created. It lets you interactively step through leader election, log replication, and failure scenarios. This single resource has probably done more for consensus algorithm education than any paper.

Reference implementations. The Raft paper was accompanied by reference implementations, and the clear specification encouraged many more. The Raft website lists over 60 implementations in various languages.

Timing. Raft arrived at a time when the industry desperately needed an understandable consensus algorithm. Docker and Kubernetes were emerging, etcd needed a consensus protocol, and the distributed database movement was accelerating. Raft was the right protocol at the right time.

Explicit system design. Unlike Multi-Paxos, Raft specifies a complete system: leader election, log replication, safety, membership changes, log compaction. You can implement Raft from the paper alone (with effort). You cannot implement Multi-Paxos from the papers alone (without also inventing significant parts of the system yourself).

The combination of these factors created a self-reinforcing cycle: more people understood Raft, so more people implemented it, so more production systems used it, so more blog posts were written about it, so more people learned it. Paxos, for all its theoretical depth, could not compete with this cycle.

Where Raft Falls Short

Raft is not perfect. The design choices that make it understandable also constrain it:

Strong leader. All writes go through the leader. In a geo-distributed deployment, this means all writes incur the latency to the leader’s region. Multi-decree Paxos variants like EPaxos (Chapter 13) can commit writes at any replica.

No log gaps. The contiguous log simplifies reasoning but means a slow follower blocks commit of later entries (since commit index advances sequentially). This is rarely a problem in practice but is a theoretical limitation.

Leader bottleneck. In a large cluster (5+ nodes), the leader must send AppendEntries to all followers and process all responses. The leader’s network bandwidth and CPU become the bottleneck before the followers’.

Rigid term structure. Raft’s term-based reasoning is clean but inflexible. Certain optimizations that are natural in Multi-Paxos (like out-of-order commits or flexible quorums) don’t fit naturally into Raft’s model.

These limitations are real but usually acceptable. For most systems, the benefits of understandability and implementation quality outweigh the theoretical performance advantages of more flexible protocols.

The Honest Assessment

Raft is not “Paxos for humans.” It is a well-designed consensus protocol with excellent documentation that solves the same problem as Multi-Paxos with similar performance. It makes some design choices that simplify understanding at the cost of flexibility, and it was accompanied by an unprecedented pedagogical effort that made it accessible to a broad audience.

If you are building a new system that needs consensus, Raft is almost certainly the right choice. Not because it’s the best consensus algorithm (there is no “best”), but because it has the largest community, the most reference implementations, the most operational experience, and the most educational resources. In distributed systems, being well-understood is a feature that trumps almost every theoretical advantage.

Paxos is more general. VR is more complete. EPaxos is more flexible. But Raft is the one your team can implement correctly, debug effectively, and operate confidently. In the agony of consensus algorithms, that might be the thing that matters most.