Raft: Paxos for Humans (Mostly)
In 2014, Diego Ongaro and John Ousterhout published “In Search of an Understandable Consensus Algorithm,” and the world of distributed systems let out a collective sigh of relief. Finally, someone had said the quiet part out loud: Paxos was too hard to understand, and this was a problem.
Raft was designed with a single overriding goal: understandability. Not performance. Not generality. Not minimality. Understandability. Ongaro and Ousterhout’s thesis was that if you cannot understand a consensus algorithm, you cannot implement it correctly, and if you cannot implement it correctly, it doesn’t matter how elegant the theory is. This is a radical claim in a field that prizes theoretical elegance, and the fact that it needed to be made tells you something about the state of the field in 2014.
The result is a protocol that is genuinely easier to understand than Paxos. It is also a protocol that is genuinely harder to understand than its proponents sometimes suggest. This chapter covers both halves of that truth.
Design Philosophy: Decomposition Over Minimality
Raft’s key design decision was to decompose consensus into three relatively independent subproblems:
- Leader election — How do you pick a leader?
- Log replication — How does the leader replicate its log to followers?
- Safety — How do you ensure that the log stays consistent?
Paxos, by contrast, interleaves these concerns in a way that is theoretically minimal but pedagogically opaque. Raft’s decomposition means you can understand each piece independently and then see how they fit together. This is not just a pedagogical trick — it also makes the implementation modular in a way that Paxos is not.
The other major design decision was to reduce the state space wherever possible. Raft eliminates log gaps (unlike Multi-Paxos), uses randomized timeouts instead of a separate leader election protocol, and enforces the leader completeness property — the elected leader always has the most complete log, so the leader never needs to learn about committed entries it doesn’t have. Each of these constraints reduces the number of cases the implementation must handle.
Terms and Roles
Raft divides time into terms, each identified by a monotonically increasing integer. Each term begins with an election and (if the election succeeds) is followed by normal operation under a single leader. Terms act as a logical clock — they tell you whether the information you’re looking at is current or stale.
There are three roles:
- Leader — Handles all client requests, replicates log entries, sends heartbeats.
- Follower — Passive. Responds to requests from leaders and candidates. If it doesn’t hear from a leader for a while, it becomes a candidate.
- Candidate — A follower that is trying to become the leader by running an election.
Every server starts as a follower. This is the steady state for most servers most of the time.
Leader Election
Raft’s leader election uses randomized timeouts, and it is one of the protocol’s genuine contributions to the field.
The Mechanism
Each follower maintains an election timer. When the timer expires without hearing from a leader (via heartbeat or AppendEntries), the follower:
- Increments its current term.
- Transitions to candidate state.
- Votes for itself.
- Sends
RequestVoteto all other servers.
class RaftNode:
// Persistent state (MUST survive restarts)
persistent:
current_term: int = 0
voted_for: NodeId = null
log: List<LogEntry> = []
// Volatile state
volatile:
commit_index: int = 0
last_applied: int = 0
role: Leader | Follower | Candidate = Follower
election_timer: Timer
function on_election_timeout():
self.role = Candidate
self.current_term += 1
self.voted_for = self.my_id
persist(self.current_term, self.voted_for)
self.votes_received = {self.my_id} // Vote for self
last_log_index = len(self.log)
last_log_term = self.log[last_log_index].term if self.log else 0
for server in self.all_servers:
if server != self.my_id:
send(server, RequestVote {
term: self.current_term,
candidate_id: self.my_id,
last_log_index: last_log_index,
last_log_term: last_log_term
})
// Reset election timer with new random timeout
self.reset_election_timer()
Voting Rules
A server grants its vote if and only if:
- The candidate’s term is at least as large as the voter’s current term.
- The voter hasn’t already voted for someone else in this term.
- The candidate’s log is at least as up-to-date as the voter’s log.
Rule 3 is the election restriction and is crucial for safety. “At least as up-to-date” means: the candidate’s last log entry has a higher term than the voter’s last log entry, OR the terms are equal and the candidate’s log is at least as long.
function on_request_vote(msg):
if msg.term < self.current_term:
reply(msg.from, VoteResponse {
term: self.current_term,
vote_granted: false
})
return
if msg.term > self.current_term:
self.step_down(msg.term) // Become follower
// Check if we can grant the vote
can_vote = (self.voted_for == null or self.voted_for == msg.candidate_id)
// Check log up-to-date-ness
my_last_term = self.log[-1].term if self.log else 0
my_last_index = len(self.log)
log_ok = (msg.last_log_term > my_last_term or
(msg.last_log_term == my_last_term and
msg.last_log_index >= my_last_index))
if can_vote and log_ok:
self.voted_for = msg.candidate_id
persist(self.current_term, self.voted_for)
self.reset_election_timer() // Grant implies reset
reply(msg.from, VoteResponse {
term: self.current_term,
vote_granted: true
})
else:
reply(msg.from, VoteResponse {
term: self.current_term,
vote_granted: false
})
function step_down(new_term):
self.current_term = new_term
self.role = Follower
self.voted_for = null
persist(self.current_term, self.voted_for)
self.reset_election_timer()
Why Randomized Timeouts Work
The election timeout is chosen randomly from a range, typically [150ms, 300ms]. This randomization serves two purposes:
-
Breaks symmetry. Without randomization, all followers would time out simultaneously and split the vote. With randomization, one follower typically times out first and wins the election before others even start.
-
Avoids livelock. If two candidates keep splitting the vote, the random timeouts ensure that eventually one will start its election early enough to win before the other starts.
This is a much simpler solution to the leader election problem than Paxos’s approach (which doesn’t really have one) or VR’s deterministic rotation (which requires a separate mechanism to skip over failed leaders). The downside is that it’s probabilistic — in theory, you could get unlucky and have repeated split votes. In practice, this essentially never happens because the probability drops exponentially with each round.
Handling the Election Response
function on_vote_response(msg):
if msg.term > self.current_term:
self.step_down(msg.term)
return
if self.role != Candidate or msg.term != self.current_term:
return // Stale response
if msg.vote_granted:
self.votes_received.add(msg.from)
if |self.votes_received| > len(self.all_servers) / 2:
// Won the election!
self.become_leader()
function become_leader():
self.role = Leader
// Initialize leader state
for server in self.all_servers:
self.next_index[server] = len(self.log) + 1
self.match_index[server] = 0
// Send initial empty AppendEntries (heartbeat) to assert leadership
self.send_heartbeats()
// Optionally: append a no-op entry to commit entries from previous terms
// (This is an important optimization discussed later)
Log Replication
Once a leader is elected, it handles all client requests. Each request is appended to the leader’s log and replicated to followers via AppendEntries RPCs.
AppendEntries: The Workhorse
AppendEntries serves double duty: it replicates log entries AND serves as a heartbeat (when sent with no entries). The leader sends it periodically to all followers.
function leader_send_append_entries(follower_id):
prev_log_index = self.next_index[follower_id] - 1
prev_log_term = self.log[prev_log_index].term if prev_log_index > 0 else 0
// Entries to send: everything from next_index onward
entries = self.log[self.next_index[follower_id]:]
send(follower_id, AppendEntries {
term: self.current_term,
leader_id: self.my_id,
prev_log_index: prev_log_index,
prev_log_term: prev_log_term,
entries: entries,
leader_commit: self.commit_index
})
The Consistency Check
Each AppendEntries message includes prev_log_index and prev_log_term — the index and term of the log entry immediately preceding the new entries. The follower checks that its log matches at this point. If it doesn’t, the follower rejects the AppendEntries, and the leader decrements next_index and retries.
This is the log matching property: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index. It’s enforced inductively: the base case is the empty log (trivially matching), and each AppendEntries extends the induction by checking the predecessor.
function on_append_entries(msg):
// Reset election timer — we heard from a leader
self.reset_election_timer()
if msg.term < self.current_term:
reply(msg.from, AppendEntriesResponse {
term: self.current_term,
success: false,
match_index: 0
})
return
if msg.term > self.current_term:
self.step_down(msg.term)
self.role = Follower
self.current_leader = msg.leader_id
// Consistency check
if msg.prev_log_index > 0:
if msg.prev_log_index > len(self.log):
// We don't have the predecessor entry
reply(msg.from, AppendEntriesResponse {
term: self.current_term,
success: false,
// Optimization: tell leader our log length
// so it can skip ahead
match_index: len(self.log)
})
return
if self.log[msg.prev_log_index].term != msg.prev_log_term:
// Predecessor entry exists but has wrong term — conflict
// Optimization: find the first index of the conflicting term
// and tell the leader to skip back to there
conflict_term = self.log[msg.prev_log_index].term
first_index = msg.prev_log_index
while first_index > 1 and self.log[first_index - 1].term == conflict_term:
first_index -= 1
// Delete the conflicting entries
self.log = self.log[:msg.prev_log_index]
reply(msg.from, AppendEntriesResponse {
term: self.current_term,
success: false,
match_index: first_index - 1
})
return
// Append new entries (overwriting any conflicting entries)
for i, entry in enumerate(msg.entries):
index = msg.prev_log_index + 1 + i
if index <= len(self.log):
if self.log[index].term != entry.term:
// Conflict — truncate and append
self.log = self.log[:index]
self.log.append(entry)
// else: already have this entry, skip
else:
self.log.append(entry)
persist(self.log)
// Update commit index
if msg.leader_commit > self.commit_index:
self.commit_index = min(msg.leader_commit, len(self.log))
self.apply_committed_entries()
reply(msg.from, AppendEntriesResponse {
term: self.current_term,
success: true,
match_index: msg.prev_log_index + len(msg.entries)
})
Leader Handling of Responses
function on_append_entries_response(msg, from):
if msg.term > self.current_term:
self.step_down(msg.term)
return
if self.role != Leader or msg.term != self.current_term:
return // Stale
if msg.success:
self.next_index[from] = msg.match_index + 1
self.match_index[from] = msg.match_index
self.maybe_advance_commit_index()
else:
// Follower's log was inconsistent — back up
self.next_index[from] = max(1, msg.match_index + 1)
// Retry immediately
self.leader_send_append_entries(from)
The Commit Mechanism
A log entry is committed when the leader has replicated it to a majority of servers. But there’s a critical subtlety: a leader can only commit entries from its own term.
This is the “Figure 8 problem” from the Raft paper, and it catches almost everyone off guard. Consider this scenario:
Time: T1 T2 T3 T4 T5
S1: [1] [1,2] [1,2] [1,2] -- crashes --
S2: [1] [1,2] [1,2] ... [1,2,4] -- becomes leader term 4
S3: [1] [1] [1,3] ... [1,3] -- was leader term 3
S4: [1] [1] [1] ... [1,2,4]
S5: [1] [1] [1] ... [1,2,4]
If S1 was leader in term 2, replicated entry 2 to S2, then crashed, and S3 became leader in term 3 but only replicated to itself before crashing, and then S1 comes back as leader in term 4… S1 might try to “commit” entry 2 (from term 2) by replicating it to S3. Now a majority (S1, S2, S3) has entry 2. But if S1 crashes again, S3 could become leader and overwrite entry 2 with entry 3!
The fix: a leader never commits entries from previous terms directly. It only commits them indirectly, by committing a new entry from its own term. Once a current-term entry is committed at a given index, all preceding entries are also committed (by the log matching property).
function maybe_advance_commit_index():
// Find the highest index replicated to a majority
for n in range(len(self.log), self.commit_index, -1):
if self.log[n].term == self.current_term: // CRITICAL: only current term
count = 1 // Count self
for server in self.all_servers:
if server != self.my_id and self.match_index[server] >= n:
count += 1
if count > len(self.all_servers) / 2:
self.commit_index = n
self.apply_committed_entries()
return
function apply_committed_entries():
while self.last_applied < self.commit_index:
self.last_applied += 1
result = self.state_machine.apply(self.log[self.last_applied].command)
// If we're the leader, respond to the client
if self.role == Leader:
respond_to_client(self.log[self.last_applied].client_info, result)
Safety: Intuition for the Proof
Raft’s safety property is: if a log entry is committed at a given index, no other server will ever have a different entry at that index.
The proof relies on two properties:
Leader Completeness. If a log entry is committed in a given term, that entry is present in the logs of all leaders of higher-numbered terms.
Why? Because:
- A committed entry is on a majority of servers.
- A new leader must receive votes from a majority.
- Any two majorities overlap.
- The election restriction ensures the new leader’s log is at least as up-to-date as any voter’s.
- Therefore, the new leader has the committed entry.
Log Matching. If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.
Why? Because AppendEntries checks the predecessor’s index and term before appending. This check creates an inductive chain: if entry i matches, then entry i-1 was checked when entry i was replicated, so entry i-1 matches, and so on back to the beginning.
Together, these two properties ensure that once a value is committed, it is permanent and agreed upon by all future leaders.
The Things That Make Raft NOT as Easy as Advertised
The core protocol — leader election and log replication — is genuinely simpler than Paxos. But building a production Raft system requires solving several problems that the paper either addresses briefly or punts on entirely.
Membership Changes
The original Raft paper proposed joint consensus for membership changes: when changing from configuration C_old to C_new, the system transitions through an intermediate configuration C_old,new that requires majorities from BOTH configurations.
Time: ─────────────────────────────────────────>
Config: C_old ──> C_old,new ──> C_new
│ │
Committed here Committed here
Joint consensus is safe but complex. You need to handle the case where the leader is not in the new configuration, where the joint configuration spans a leader change, and where a new server needs to catch up before it can vote.
Because of this complexity, the Raft authors later proposed single-server changes: only add or remove one server at a time. This is simpler because any two majorities of configurations that differ by one server will overlap.
function add_server(new_server):
// First, bring the new server up to date
while not is_caught_up(new_server):
send_snapshot_or_entries(new_server)
wait(CATCHUP_CHECK_INTERVAL)
// Propose the configuration change as a regular log entry
new_config = self.current_config + {new_server}
self.replicate(ConfigChange { config: new_config })
// Configuration takes effect as soon as the entry is appended
// (not when it's committed!)
function remove_server(old_server):
new_config = self.current_config - {old_server}
self.replicate(ConfigChange { config: new_config })
// If we're removing ourselves, step down after committing
if old_server == self.my_id:
self.step_down_after_commit()
The single-server approach seems simpler, but it has its own subtleties. You cannot have two pending configuration changes at the same time (you must wait for each to commit). And there’s a tricky edge case when removing a server that is the current leader.
Log Compaction
The log grows without bound. You need to compact it. Raft describes snapshotting: periodically, take a snapshot of the state machine and discard log entries up to the snapshot point.
function take_snapshot():
if self.last_applied - self.last_snapshot_index < SNAPSHOT_THRESHOLD:
return
snapshot = Snapshot {
last_included_index: self.last_applied,
last_included_term: self.log[self.last_applied].term,
state: serialize(self.state_machine),
config: self.current_config
}
write_snapshot_to_disk(snapshot)
// Discard log entries up to the snapshot
self.log = self.log[self.last_applied + 1:]
self.last_snapshot_index = snapshot.last_included_index
self.last_snapshot_term = snapshot.last_included_term
When a leader needs to send entries to a follower that has fallen behind the snapshot point, it sends the snapshot instead:
function send_install_snapshot(follower_id):
send(follower_id, InstallSnapshot {
term: self.current_term,
leader_id: self.my_id,
last_included_index: self.last_snapshot_index,
last_included_term: self.last_snapshot_term,
data: self.snapshot_data,
// May be chunked for large snapshots
offset: 0,
done: true
})
The follower replaces its state with the snapshot and discards its entire log up to the snapshot point. This is simple in concept but complex in implementation — snapshotting a large state machine while continuing to process requests requires copy-on-write semantics or a quiescent period, and transferring a multi-gigabyte snapshot is a non-trivial network operation.
The Pre-Vote Protocol
Raft has a problem: a partitioned server keeps incrementing its term (because its election timer keeps firing and it keeps starting elections it can never win). When the partition heals, this server has a very high term number. When it contacts the cluster, other servers see the high term and step down from leadership, causing unnecessary leader changes.
The pre-vote protocol (proposed in Ongaro’s dissertation but not in the original paper) addresses this. Before starting a real election, a candidate sends a “pre-vote” request that doesn’t increment the term. Other servers respond based on whether they would vote for this candidate — but they don’t actually record a vote or change their term. Only if the pre-vote succeeds does the candidate proceed with a real election.
function on_election_timeout_with_prevote():
// Phase 0: Pre-vote
pre_vote_term = self.current_term + 1 // Hypothetical next term
pre_votes = {self.my_id}
for server in self.all_servers:
if server != self.my_id:
send(server, PreVote {
term: pre_vote_term,
candidate_id: self.my_id,
last_log_index: len(self.log),
last_log_term: self.log[-1].term if self.log else 0
})
// Collect pre-votes
pre_vote_responses = wait_for_responses(ELECTION_TIMEOUT)
if count_granted(pre_vote_responses) + 1 > len(self.all_servers) / 2:
// Pre-vote succeeded — proceed with real election
self.start_real_election()
else:
// Pre-vote failed — we're probably partitioned
self.reset_election_timer() // Try again later
function on_pre_vote(msg):
// Respond based on whether we WOULD vote, but don't record anything
would_vote = (msg.term >= self.current_term and
is_log_up_to_date(msg.last_log_index, msg.last_log_term) and
(self.current_leader == null or
now() - self.last_heartbeat > ELECTION_TIMEOUT))
reply(msg.from, PreVoteResponse {
term: msg.term,
vote_granted: would_vote
})
// NOTE: we do NOT update current_term or voted_for
Pre-vote is now considered essential for production Raft implementations. etcd added it. CockroachDB uses it. TiKV uses it. Its absence from the original paper is one of those cases where the protocol works fine in the common case but has a sharp edge in a scenario that’s not uncommon in production (network partitions).
Learner Nodes (Non-Voting Members)
When adding a new server to a Raft cluster, the new server has an empty log and needs to catch up. During catchup, it can’t usefully participate in consensus — it would just slow things down. Worse, if you add it to the configuration immediately, it changes the majority requirement, potentially making the cluster unable to commit new entries.
Learner nodes (also called non-voting members or observers) are servers that receive log replication but don’t vote in elections or count toward the commit quorum. They’re used to stage new servers until they’re caught up.
function add_server_with_learner(new_server):
// Step 1: Add as learner (non-voting)
self.learners.add(new_server)
// Step 2: Replicate log to learner (same as a follower)
while not is_caught_up(new_server):
send_append_entries(new_server)
wait(CHECK_INTERVAL)
// Step 3: Promote to voting member
self.learners.remove(new_server)
new_config = self.current_config + {new_server}
self.replicate(ConfigChange { config: new_config })
Read-Only Operations
Linearizable reads are harder than they look. The naive approach — just read from the leader — is unsafe because the leader might be partitioned and a new leader might have been elected.
Raft offers two approaches:
ReadIndex. The leader records its commit index, confirms it’s still the leader by sending heartbeats to a majority, and then serves the read once the recorded commit index has been applied.
function linearizable_read(query):
if self.role != Leader:
redirect_to_leader(query)
return
read_index = self.commit_index
// Confirm we're still leader
heartbeat_acks = send_heartbeats_and_wait()
if |heartbeat_acks| < majority:
// We might not be leader anymore
return Error("not leader")
// Wait for state machine to catch up
wait_until(self.last_applied >= read_index)
return self.state_machine.query(query)
Lease-based reads. If the leader has received heartbeat responses from a majority within the last election timeout period, it assumes it’s still the leader and serves reads locally. This is faster but depends on bounded clock drift, which not everyone is comfortable assuming.
Real Implementations
etcd/raft
etcd’s Raft implementation (in Go) is probably the most widely used Raft library. It’s used by etcd itself, Kubernetes (via etcd), CockroachDB, and TiKV.
Key characteristics:
- Implements the core Raft protocol faithfully.
- Adds pre-vote, learner nodes, and leader transfer.
- Does NOT implement the transport layer — it’s a library that produces messages, and the application is responsible for sending them. This is an excellent design decision that makes it adaptable to different network stacks.
- Does NOT implement persistence — the application provides a storage interface. Same rationale.
CockroachDB
CockroachDB uses etcd/raft but adds significant extensions:
- Range-level Raft. Each data range (a contiguous keyspace) is a separate Raft group. A single CockroachDB cluster might have tens of thousands of Raft groups.
- Multi-Raft. To avoid the overhead of thousands of independent Raft groups each sending their own heartbeats, CockroachDB batches Raft messages between nodes.
- Joint consensus. CockroachDB uses joint consensus for membership changes rather than single-server changes.
- Epoch-based leases. Range leases are based on epochs rather than wall-clock time, avoiding clock-dependency issues.
TiKV
TiKV (the storage engine for TiDB) also uses etcd/raft with its own extensions:
- Batching and pipelining. TiKV aggressively batches Raft messages and pipelines requests.
- Async apply. The state machine application is asynchronous — committed entries are applied in a separate thread from the Raft protocol thread. This improves throughput but requires careful handling of read requests.
- Multi-Raft with region-based partitioning, similar to CockroachDB.
Performance Comparison with Multi-Paxos
In steady state (stable leader, no failures):
| Metric | Raft | Multi-Paxos |
|---|---|---|
| Messages per write | 2(n-1) | 2(n-1) |
| Round trips per write | 1 | 1 |
| Fsyncs per write | 1 (leader) + f (followers) | Same |
| Read latency (linearizable) | 1 RTT (ReadIndex) or 0 (lease) | Same (lease-based) |
| Leader change latency | ~election timeout (~300ms) | ~Phase 1 RTT (~2ms) |
The one notable difference is leader change latency. Raft’s randomized election timeout means there’s a 150-300ms delay before a new leader is elected after a failure. Multi-Paxos can elect a new leader in a single round trip (the Phase 1 Prepare/Promise), which might be only a few milliseconds in a LAN.
In practice, this difference rarely matters because leader failures are (should be) rare events. But in systems that are extremely sensitive to failover latency, Multi-Paxos has an advantage.
Why Raft Won the Mindshare War
Raft’s dominance in the consensus algorithm mindshare is not primarily a technical achievement. The protocol is good — clean, well-specified, and practical. But its success is primarily a pedagogical and community achievement.
The paper is readable. It’s 18 pages, clearly structured, with detailed examples and figures. The authors explicitly optimized for understandability and it shows.
The visualization. The Raft visualization (thesecretlivesofdata.com/raft) is one of the best algorithm visualizations ever created. It lets you interactively step through leader election, log replication, and failure scenarios. This single resource has probably done more for consensus algorithm education than any paper.
Reference implementations. The Raft paper was accompanied by reference implementations, and the clear specification encouraged many more. The Raft website lists over 60 implementations in various languages.
Timing. Raft arrived at a time when the industry desperately needed an understandable consensus algorithm. Docker and Kubernetes were emerging, etcd needed a consensus protocol, and the distributed database movement was accelerating. Raft was the right protocol at the right time.
Explicit system design. Unlike Multi-Paxos, Raft specifies a complete system: leader election, log replication, safety, membership changes, log compaction. You can implement Raft from the paper alone (with effort). You cannot implement Multi-Paxos from the papers alone (without also inventing significant parts of the system yourself).
The combination of these factors created a self-reinforcing cycle: more people understood Raft, so more people implemented it, so more production systems used it, so more blog posts were written about it, so more people learned it. Paxos, for all its theoretical depth, could not compete with this cycle.
Where Raft Falls Short
Raft is not perfect. The design choices that make it understandable also constrain it:
Strong leader. All writes go through the leader. In a geo-distributed deployment, this means all writes incur the latency to the leader’s region. Multi-decree Paxos variants like EPaxos (Chapter 13) can commit writes at any replica.
No log gaps. The contiguous log simplifies reasoning but means a slow follower blocks commit of later entries (since commit index advances sequentially). This is rarely a problem in practice but is a theoretical limitation.
Leader bottleneck. In a large cluster (5+ nodes), the leader must send AppendEntries to all followers and process all responses. The leader’s network bandwidth and CPU become the bottleneck before the followers’.
Rigid term structure. Raft’s term-based reasoning is clean but inflexible. Certain optimizations that are natural in Multi-Paxos (like out-of-order commits or flexible quorums) don’t fit naturally into Raft’s model.
These limitations are real but usually acceptable. For most systems, the benefits of understandability and implementation quality outweigh the theoretical performance advantages of more flexible protocols.
The Honest Assessment
Raft is not “Paxos for humans.” It is a well-designed consensus protocol with excellent documentation that solves the same problem as Multi-Paxos with similar performance. It makes some design choices that simplify understanding at the cost of flexibility, and it was accompanied by an unprecedented pedagogical effort that made it accessible to a broad audience.
If you are building a new system that needs consensus, Raft is almost certainly the right choice. Not because it’s the best consensus algorithm (there is no “best”), but because it has the largest community, the most reference implementations, the most operational experience, and the most educational resources. In distributed systems, being well-understood is a feature that trumps almost every theoretical advantage.
Paxos is more general. VR is more complete. EPaxos is more flexible. But Raft is the one your team can implement correctly, debug effectively, and operate confidently. In the agony of consensus algorithms, that might be the thing that matters most.