Viewstamped Replication: The One Nobody Reads
In the pantheon of consensus algorithms, Viewstamped Replication occupies a peculiar position: it is older than Paxos (in some formulations), more practical than Paxos (in its original description), more complete than Paxos (as a system design), and less famous than Paxos by approximately three orders of magnitude.
Brian Oki and Barbara Liskov published “Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems” in 1988 — a year before Lamport’s first submission of “The Part-Time Parliament.” The protocol describes a complete replicated state machine: normal operation, leader failure, leader election, and recovery of crashed replicas. It does all of this in a single coherent paper, without fictional Greek islands, without separating the work into a dozen under-specified follow-up papers, and without leaving the reader to invent their own membership change protocol.
And yet, if you ask a room of distributed systems engineers what consensus protocol they know, ninety percent will say Paxos, nine percent will say Raft, and the remaining one percent will say “Viewstamped what?”
This chapter is for the one percent, and for the ninety-nine percent who should know better.
Historical Context: The Naming War
Why did Paxos win the naming war? Several reasons, none of them technical:
Lamport’s stature. By the time Paxos was published (1998, though written in 1989), Lamport was already a towering figure in distributed systems. His name carried weight. Oki and Liskov were respected, but Paxos became associated with Lamport’s broader body of work on distributed systems, including logical clocks and the Byzantine generals problem.
Paxos came to symbolize the problem. Through historical accident and persistent citation, “Paxos” became synonymous with “consensus protocol” in the way that “Kleenex” became synonymous with “tissue.” When people said “Paxos,” they often meant “any leader-based crash-fault-tolerant consensus protocol.”
Google. When Google published papers about Chubby, Spanner, and Megastore in the 2000s, they described their systems as using Paxos. This cemented Paxos as the industrial standard. If Google had happened to describe their systems as using Viewstamped Replication, the landscape would look very different.
The TCS/systems divide. Paxos was embraced by the theoretical computer science community, which values elegance and minimality. VR was designed by systems researchers, who value completeness and practicality. The TCS community writes more papers, and those papers get cited more, creating a self-reinforcing citation network.
The result is that Viewstamped Replication is the Betamax of consensus protocols: arguably better in several ways, but history doesn’t care about “arguably better.”
The VR System Model
VR assumes:
- A set of 2f+1 replicas, tolerating f crash failures.
- Asynchronous network with reliable delivery (messages may be delayed but not lost). In practice, VR implementations add retransmission, but the protocol description assumes eventual delivery.
- Non-Byzantine faults only: replicas either follow the protocol or crash.
- Replicas have persistent storage (for recovery).
The system provides a replicated state machine service to clients. Clients send requests to the primary (leader), and the primary coordinates replication before responding.
Core Concepts
VR organizes time into views. Each view has a designated primary (the leader). The primary for view v is typically replica v mod n, where n is the number of replicas. When the primary fails, the system moves to a new view with a new primary. This deterministic leader selection is a notable simplification — there’s no election campaign, no randomized timeout. Everyone knows who the leader should be for any given view number.
Each request is assigned a view-stamp: a pair (view-number, op-number). The view-number identifies which primary issued the request, and the op-number is the sequence number within that view. View-stamps are totally ordered: compare first by view-number, then by op-number.
struct ViewStamp:
view_number: int
op_number: int
function compare(other: ViewStamp) -> int:
if self.view_number != other.view_number:
return self.view_number - other.view_number
return self.op_number - other.op_number
Normal Operation
Normal operation in VR is clean and efficient. Here is the full protocol for processing a client request:
Client Primary (P) Backup 1 (B1) Backup 2 (B2)
| | | |
|-- Request(op, client-id, req-num) ->| |
| | | |
| |-- Prepare(v, op-num, op, commit-num) ->|
| |-- Prepare(v, op-num, op, commit-num) ---------->|
| | | |
| |<-- PrepareOK(v, op-num, replica-id) ---|
| |<-- PrepareOK(v, op-num, replica-id) ------------|
| | | |
| | (Got f PrepareOKs — can commit) |
| | | |
|<-- Reply(v, req-num, result) --------| |
| | | |
Let’s walk through each step in detail.
Step 1: Client Sends Request
The client sends Request(operation, client-id, request-number) to the primary. The request-number is a monotonically increasing number that the client assigns to each request. This is critical for exactly-once semantics — the primary uses the (client-id, request-number) pair to detect duplicate requests.
Step 2: Primary Prepares
The primary assigns the next op-number, appends the operation to its log, and sends Prepare(view-number, op-number, operation, commit-number) to all backups.
The commit-number is piggybacked on the Prepare message — it tells the backups the latest operation that has been committed (accepted by f+1 replicas). This is how backups learn about commits without a separate message.
Step 3: Backups Respond
Each backup, upon receiving the Prepare:
- Checks that the view-number matches its current view (otherwise discards the message).
- Checks that the op-number is the expected next op-number (VR requires contiguous logs — no gaps).
- Appends the operation to its log.
- Updates its commit-number based on the piggybacked value from the primary.
- Sends
PrepareOK(view-number, op-number, replica-id)to the primary.
Step 4: Primary Commits
When the primary receives PrepareOK from f backups (plus itself, that’s f+1 total), the operation is committed. The primary:
- Applies the operation to the state machine.
- Sends
Reply(view-number, request-number, result)to the client. - Increments its commit-number (which will be sent with the next Prepare).
class VRPrimary:
state:
view_number: int
op_number: int = 0
commit_number: int = 0
log: List<LogEntry> = []
client_table: Map<ClientId, (RequestNumber, Reply)>
function on_client_request(op, client_id, request_num):
// Check for duplicate request
if client_id in self.client_table:
last_req, last_reply = self.client_table[client_id]
if request_num <= last_req:
// Duplicate — resend cached reply
send_to_client(client_id, last_reply)
return
// Assign op-number and append to log
self.op_number += 1
entry = LogEntry {
view: self.view_number,
op_num: self.op_number,
operation: op,
client_id: client_id,
request_num: request_num
}
self.log.append(entry)
// Send Prepare to all backups
for backup in self.backups:
send(backup, Prepare {
view: self.view_number,
op_num: self.op_number,
operation: op,
commit_num: self.commit_number
})
function on_prepare_ok(view, op_num, from_replica):
if view != self.view_number:
return // Stale message
self.log[op_num].acks.add(from_replica)
// Check if we can commit this and any subsequent entries
while self.commit_number < self.op_number:
next = self.commit_number + 1
if |self.log[next].acks| + 1 >= f + 1: // +1 for self
self.commit_number = next
result = self.state_machine.apply(self.log[next].operation)
// Update client table and respond
entry = self.log[next]
reply = Reply {
view: self.view_number,
request_num: entry.request_num,
result: result
}
self.client_table[entry.client_id] = (entry.request_num, reply)
send_to_client(entry.client_id, reply)
else:
break // Can't commit yet — waiting for acks
Backup Logic
class VRBackup:
state:
view_number: int
op_number: int = 0
commit_number: int = 0
log: List<LogEntry> = []
client_table: Map<ClientId, (RequestNumber, Reply)>
function on_prepare(view, op_num, operation, commit_num):
if view != self.view_number:
return // Wrong view — ignore
if op_num != self.op_number + 1:
// Gap in log — we missed messages
// Request state transfer from primary
self.request_state_transfer()
return
// Append to log
self.op_number = op_num
self.log.append(LogEntry {
view: view,
op_num: op_num,
operation: operation
})
// Apply any newly committed operations
while self.commit_number < commit_num:
self.commit_number += 1
result = self.state_machine.apply(self.log[self.commit_number].operation)
// Update client table
entry = self.log[self.commit_number]
self.client_table[entry.client_id] = (entry.request_num, result)
// Acknowledge
send(primary, PrepareOK {
view: self.view_number,
op_num: op_num,
replica_id: self.my_id
})
View Changes: When the Primary Fails
This is where VR shines. The view change protocol is arguably the clearest leader-change protocol in the consensus literature. It’s specified completely, with no hand-waving about “use a failure detector” or “left as an exercise.”
Triggering a View Change
A backup suspects the primary has failed when it hasn’t heard from it for a timeout period. When this happens:
- The backup increments its view-number to v+1.
- It stops accepting Prepare messages from the old primary.
- It sends a
StartViewChange(v+1, replica-id)message to all other replicas.
Phase 1: StartViewChange
When a replica receives f StartViewChange messages for view v+1 (from f different replicas), it sends a DoViewChange message to the new primary (replica (v+1) mod n) containing:
- Its log
- Its current view-number
- The view-number of the last normal view it participated in (the “last normal view”)
- Its op-number and commit-number
function on_start_view_change(new_view, from_replica):
if new_view <= self.view_number and self.status == NORMAL:
return // Stale or we're already in a newer view
self.status = VIEW_CHANGE
self.view_number = new_view
self.start_view_change_count[new_view].add(from_replica)
// Also count our own StartViewChange
if self.my_id not in self.start_view_change_count[new_view]:
self.start_view_change_count[new_view].add(self.my_id)
broadcast(StartViewChange {
view: new_view,
replica_id: self.my_id
})
if |self.start_view_change_count[new_view]| >= f:
// Send DoViewChange to the new primary
new_primary = new_view mod self.num_replicas
send(new_primary, DoViewChange {
view: new_view,
log: self.log,
last_normal_view: self.last_normal_view,
op_number: self.op_number,
commit_number: self.commit_number,
replica_id: self.my_id
})
Phase 2: DoViewChange
The new primary collects f+1 DoViewChange messages (including its own) and selects the “best” log — the one from the replica with the highest last-normal-view, breaking ties by highest op-number. This is the log that becomes the new primary’s log.
function on_do_view_change(msg):
self.do_view_change_msgs[msg.view].append(msg)
if |self.do_view_change_msgs[msg.view]| >= f + 1:
// Select the best log
best = null
for dvc in self.do_view_change_msgs[msg.view]:
if best == null or
dvc.last_normal_view > best.last_normal_view or
(dvc.last_normal_view == best.last_normal_view and
dvc.op_number > best.op_number):
best = dvc
// Install the best log
self.log = best.log
self.op_number = best.op_number
self.view_number = msg.view
self.last_normal_view = msg.view
self.status = NORMAL
// Determine the highest commit number from all DoViewChange messages
max_commit = max(dvc.commit_number for dvc in self.do_view_change_msgs[msg.view])
// Commit any uncommitted operations up to max_commit
while self.commit_number < max_commit:
self.commit_number += 1
self.state_machine.apply(self.log[self.commit_number].operation)
// Announce the new view to all replicas
broadcast(StartView {
view: msg.view,
log: self.log,
op_number: self.op_number,
commit_number: self.commit_number
})
Phase 3: StartView
When backups receive a StartView message, they:
- Update their log to match the new primary’s log.
- Update their view-number, op-number, and commit-number.
- Apply any newly committed operations.
- Send PrepareOK messages for any uncommitted operations in the new log (so the new primary can commit them).
- Resume normal operation in the new view.
function on_start_view(msg):
self.log = msg.log
self.view_number = msg.view
self.op_number = msg.op_number
self.last_normal_view = msg.view
self.status = NORMAL
// Apply any newly committed operations
while self.commit_number < msg.commit_number:
self.commit_number += 1
self.state_machine.apply(self.log[self.commit_number].operation)
// Send PrepareOK for uncommitted operations
// so new primary can commit them
new_primary = msg.view mod self.num_replicas
for i in range(msg.commit_number + 1, msg.op_number + 1):
send(new_primary, PrepareOK {
view: msg.view,
op_num: i,
replica_id: self.my_id
})
The view change protocol is one round trip in the common case: StartViewChange messages fan out, DoViewChange messages converge on the new primary, and StartView fans out again. The total number of messages is O(n^2) in the worst case (because every replica sends to every other replica in Phase 1), but in practice the new primary collects the messages and the broadcast is O(n).
Recovery: Bringing Back a Crashed Replica
VR also specifies how a replica recovers after a crash. This is another area where VR is more complete than Paxos, which simply says “read your state from disk and resume.”
VR’s recovery protocol ensures that a recovering replica gets a consistent state even if its disk is unreliable (or even if it has no disk at all — the VR Revisited paper discusses diskless operation).
function recover():
// Generate a unique nonce to prevent replay
nonce = generate_random_nonce()
// Ask all replicas for their current state
broadcast(Recovery { replica_id: self.my_id, nonce: nonce })
// Wait for f+1 responses, including one from the current primary
responses = collect_recovery_responses(nonce)
// The primary's response includes the full log and state
primary_response = find_primary_response(responses)
self.view_number = primary_response.view
self.log = primary_response.log
self.op_number = primary_response.op_number
self.commit_number = primary_response.commit_number
self.last_normal_view = primary_response.view
// Replay committed operations to rebuild state machine
for i in range(1, self.commit_number + 1):
self.state_machine.apply(self.log[i].operation)
self.status = NORMAL
The nonce is important: it prevents a recovering replica from accepting stale Recovery responses from a previous recovery attempt. Without it, a replica that crashes and recovers twice might accept a response from the first recovery during the second recovery, potentially getting stale state.
VR vs. Multi-Paxos: A Detailed Comparison
Let us be direct: VR and Multi-Paxos are more similar than different. Both are leader-based protocols that replicate a log of commands. Both require a majority quorum for progress. Both handle leader failure through a view change / leader election mechanism.
The differences are in the details, and they matter:
| Aspect | Multi-Paxos | Viewstamped Replication |
|---|---|---|
| Leader selection | Unspecified (whoever wins Phase 1) | Deterministic: replica v mod n for view v |
| Log gaps | Allowed | Not allowed (contiguous log) |
| Normal operation messages | Accept / Accepted | Prepare / PrepareOK (different names, same semantics) |
| Leader change | Run Phase 1 with higher proposal number | View change protocol with explicit phases |
| Recovery | Read from disk and resume | Explicit recovery protocol with nonce |
| Specification completeness | Protocol only | Complete system (including client interaction) |
| Commit notification | Unspecified | Piggybacked on next Prepare |
| Membership changes | Exercise for reader | Discussed (briefly) in VR Revisited |
The Naming Confusion
The names are unfortunate. VR calls its normal-operation messages “Prepare” and “PrepareOK,” which in Paxos terminology means something completely different (Phase 1). VR’s “Prepare” is equivalent to Paxos Phase 2 “Accept.” This naming collision has confused every person who has tried to learn both protocols, and it will continue to confuse people until the heat death of the universe.
For clarity, here is the mapping:
| VR Term | Paxos Equivalent |
|---|---|
| Prepare (normal operation) | Accept (Phase 2) |
| PrepareOK | Accepted |
| StartViewChange + DoViewChange | Prepare + Promise (Phase 1) |
| StartView | (New leader announces itself — no direct equivalent) |
| View | Ballot / Proposal number |
| Primary | Leader / Distinguished proposer |
What VR Gets Right
Deterministic leader selection. VR’s “replica v mod n” approach eliminates the need for a separate leader election protocol. Everyone knows who the leader should be. If the leader fails, everyone knows who the next leader should be. This removes an entire class of bugs related to multiple nodes believing they are the leader simultaneously.
There is a downside: if the designated leader for view v+1 is also down, the system must quickly detect this and move to view v+2. In practice, this means the view change protocol needs good timeout tuning — too aggressive and you skip over healthy leaders, too conservative and you waste time trying to contact dead ones.
No log gaps. VR’s insistence on a contiguous log (like Raft, and unlike Multi-Paxos) simplifies everything downstream: commit tracking, state machine application, snapshotting, follower catchup. You never need to fill gaps with no-ops, because gaps don’t exist.
Integrated client interaction. VR specifies how clients interact with the system, including exactly-once semantics via the client table. This is a crucial part of any real system that Paxos papers don’t address.
Complete recovery protocol. VR specifies how to bring back a crashed replica. Paxos says “read from disk.” VR says “here is a protocol that works even if the disk is unreliable.”
The VR Revisited Paper
In 2012, Liskov published “Viewstamped Replication Revisited,” which updated the original protocol with several improvements:
-
Cleaner presentation. The original 1988 paper was written in a different academic style and used different terminology. The revisited version is more accessible.
-
Explicit state transfer. The revisited version clearly specifies how a lagging replica catches up, including snapshot-based state transfer.
-
Reconfiguration. The revisited version includes a reconfiguration protocol for changing the replica set. This uses an “epoch” mechanism where the old configuration and new configuration coordinate through a special reconfiguration request.
-
Optimizations. The revisited version discusses batching, pipelining, and other performance optimizations.
The VR Revisited paper is, in some ways, the paper that Multi-Paxos never got — a complete, self-contained description of a replicated state machine protocol with all the practical details included. It is 20 pages long and covers everything from normal operation to recovery to reconfiguration. Compare this with Multi-Paxos, which is spread across half a dozen papers, none of which tells the complete story.
Performance Analysis
In normal operation (stable leader, no failures), VR and Multi-Paxos have identical performance characteristics:
- Message complexity per operation: 2(n-1) messages — the primary sends Prepare to n-1 backups and receives n-1 PrepareOK responses. (In practice, you only need f responses, so you might send to all but only wait for f.)
- Round trips: 1 (Prepare/PrepareOK, equivalent to Accept/Accepted).
- Fsyncs per operation: 1 on the primary, 1 on each backup that persists before responding.
- Latency: Network RTT + fsync time.
View changes add latency:
- StartViewChange phase: 1 message delay (fan-out to all replicas).
- DoViewChange phase: 1 message delay (converge on new primary).
- StartView phase: 1 message delay (fan-out from new primary).
- Total view change latency: Approximately 3 message delays + processing time.
This is comparable to Raft’s leader election latency (1-2 election timeouts + message delays) and Multi-Paxos’s Phase 1 latency (1 round trip for Prepare/Promise).
Implementation Considerations
If you’re considering implementing VR, here are the things the paper glosses over:
Timeout Tuning
The view change is triggered by a timeout. How long should this timeout be? Too short and you get false positives — you change views when the primary is just slow, not dead. Too long and the system is unavailable for the entire timeout period after a real failure.
The standard approach is:
- Heartbeat interval: 50-100ms
- Election timeout: 3-10x the heartbeat interval
- Randomize the election timeout to avoid synchronized view changes
But “3-10x” is a wide range, and the right value depends on your network characteristics, your latency requirements, and how much you trust your failure detector. In a LAN environment, heartbeats every 50ms and a 300ms election timeout work well. In a WAN environment, you might need 500ms heartbeats and 5-second election timeouts.
Exactly-Once Semantics
VR’s client table provides exactly-once semantics: if a client retries a request, the primary detects the duplicate and returns the cached response. But the client table grows without bound unless you have a mechanism to garbage-collect old entries.
The standard approach is to require clients to include the request number of their last completed request in each new request, allowing the server to discard table entries for older requests. This works, but it requires careful coordination between client and server, and it breaks down if clients crash and restart with a new identity.
function gc_client_table():
for client_id, (req_num, reply) in self.client_table:
// Client's latest request implicitly acknowledges all previous
if req_num < client_latest_ack[client_id]:
delete self.client_table[client_id]
State Transfer
When a new replica joins or a crashed replica recovers far behind the current state, it needs a full state transfer. VR Revisited describes this but doesn’t specify the transfer mechanism in detail.
In practice, you need:
- A consistent snapshot of the state machine
- The log suffix from the snapshot point to the current op-number
- A way to send this over the network without blocking normal operation
This is the same problem as Raft’s InstallSnapshot and Multi-Paxos’s snapshot transfer. Everyone solves it, everyone finds it more annoying than expected.
Disk Persistence
The original VR paper assumes replicas can operate without disk persistence, relying on the recovery protocol to rebuild state from other replicas. This is attractive — fsyncs are expensive and eliminating them from the critical path improves latency dramatically.
However, diskless operation has a major limitation: if f+1 replicas fail simultaneously (even briefly), the system can lose committed data. With disk persistence, replicas can recover independently by reading their state from disk. Without it, they depend on other replicas being available.
In practice, most VR implementations use disk persistence for the log (just like Multi-Paxos and Raft), and use the recovery protocol as a backup for cases where the disk is corrupted or the replica is being replaced entirely.
Why VR Deserves More Attention
It is fashionable to say that Raft made consensus understandable. And Raft did excellent work on pedagogy — the paper is well-written, the visualization tools are helpful, and the design was explicitly optimized for comprehensibility.
But VR Revisited, published a year before the Raft paper, is also comprehensible. It’s also a complete system design. It also avoids log gaps. It also has a clean leader change protocol. The main differences between VR and Raft are:
- Raft uses randomized election timeouts; VR uses deterministic leader selection.
- Raft’s leader completeness property (the leader always has the most complete log) is enforced during election; VR transfers the best log during view change.
- Raft was accompanied by an excellent pedagogy campaign including videos, visualizations, and a reference implementation.
Point 3 is probably the most important. VR Revisited is a good paper. Raft is a good paper with a marketing campaign. In academia, as in industry, marketing matters.
If you are building a new system and choosing a consensus protocol, VR is a legitimate option. It has a smaller community and fewer reference implementations than Raft, which is a real practical disadvantage. But the protocol itself is sound, well-specified, and battle-tested (it’s the ancestor of protocols used in several production systems, even if they don’t acknowledge the lineage).
A Note on the Broader VR Family
VR influenced several later protocols:
- PBFT (1999). Castro and Liskov’s Practical Byzantine Fault Tolerance extends VR’s view change mechanism to handle Byzantine faults. The view change protocol in PBFT is recognizably a descendant of VR’s.
- Zab (2011). ZooKeeper Atomic Broadcast (Chapter 9) shares several design choices with VR, including the primary-backup model with ordered broadcasts and a view-based leader change mechanism.
- Raft (2014). Raft’s design is more similar to VR than to Paxos, particularly in its insistence on a contiguous log and its clean separation of leader election from normal operation.
The irony is rich: VR’s ideas live on in protocols that are far more famous than VR itself. It is the Velvet Underground of consensus protocols — not many people implemented it directly, but everyone who did went on to build something influential.
Summary
Viewstamped Replication is a complete, practical, well-specified consensus protocol that predates Paxos and anticipates Raft. Its obscurity is a historical accident, not a reflection of its technical merit. If you are studying consensus algorithms, reading VR Revisited alongside the Raft paper is illuminating — the similarities highlight what is fundamental about leader-based consensus, and the differences highlight design choices that are matters of taste rather than correctness.
The main lesson of VR’s story is that in distributed systems, as in life, being first and being right is no guarantee of being remembered. You also have to be named after a Greek island, apparently.