Virtual Consensus and Log-Based Architectures
Just Use a Log
There is a certain kind of distributed systems architect who, when confronted with any problem, will say: “Just use a log.” Need to replicate state? Write changes to a log and replay them. Need to coordinate services? Put coordination messages in a log. Need to build a database? It is a log with indexes on top. Need to make breakfast? Write the recipe to a log and have your toaster subscribe.
This sounds glib, but there is a deep truth underneath the meme. A log — specifically, a durable, replicated, totally ordered log — is equivalent to consensus. If you have a replicated log, you can build any replicated state machine on top of it. If you have consensus, you can build a replicated log. The two are formally equivalent.
Virtual consensus takes this observation to its logical conclusion: factor out the consensus into a shared log service, and build everything else — state machines, coordination services, databases — as consumers of that log. The consensus happens once, at the log level, and everything above it is “just” deterministic replay.
The idea is elegant, practical, and — in the right contexts — genuinely transformative. It is also, as with most things in distributed systems, rather harder than it sounds.
The Shared Log Abstraction
A shared log provides a simple API:
Interface SharedLog:
// Append an entry to the log. Returns the position (offset) assigned.
// This is the ONLY operation that requires consensus.
Function Append(entry: bytes) -> LogPosition
// Read the entry at a given position.
// This is a simple read — no consensus needed.
Function Read(position: LogPosition) -> bytes
// Get the current tail of the log (the next position to be written).
Function GetTail() -> LogPosition
// Read a range of entries.
Function ReadRange(start: LogPosition, end: LogPosition) -> List<bytes>
The critical insight: only Append requires consensus (to establish the total order). Read, ReadRange, and GetTail are simple storage operations that can be served from any replica that has the data.
This separation is powerful because it means the consensus protocol — the hard, expensive, latency-sensitive part — is invoked only when new data enters the system. Reading existing data is cheap and can be parallelized.
Corfu: The Original Shared Log
Corfu (published by Balakrishnan et al. at NSDI 2012) was the first system to seriously explore the shared log abstraction as a building block for distributed systems.
Corfu’s architecture separates the log into two planes:
The sequencer is a single node responsible for issuing log positions. When a client wants to append, it first asks the sequencer for the next available position. The sequencer is essentially a counter — it returns the next position and increments.
The storage nodes store the actual log entries. The log is striped across multiple storage nodes for parallelism. Each log position maps to a specific storage node (using a simple hash or a configurable mapping).
// Corfu architecture
Structure CorfuClient:
sequencer: SequencerClient
storage_nodes: List<StorageNodeClient>
layout: LayoutMap // maps log positions to storage nodes
Procedure Append(client, entry):
// Step 1: Get next log position from sequencer
position = client.sequencer.GetToken()
// Step 2: Write to the appropriate storage node
storage_node = client.layout.GetNode(position)
success = storage_node.Write(position, entry)
if not success:
// Position was already written (race condition with another client)
// Backoff and retry
Retry(client, entry)
return position
Procedure Read(client, position):
storage_node = client.layout.GetNode(position)
return storage_node.Read(position)
Wait — where is the consensus? The sequencer is a single node. If it fails, the system is unavailable for appends. Is this not the leader bottleneck all over again?
Yes and no. The sequencer is not a consensus protocol. It is a simple counter. Its failure handling is straightforward: a new sequencer is elected (using a separate consensus protocol, like Paxos, for the election), the new sequencer reads the storage nodes to find the current tail of the log, and it resumes issuing positions.
The key design choice is that the sequencer does not store any durable state — it is a soft-state cache of the current log position. This means failover is fast (discover the tail from storage nodes, start issuing positions) and the sequencer is not a durability bottleneck.
The consensus for durability happens implicitly: once a client writes an entry to a storage node at a position, that entry is durable. The total order is established by the sequencer’s position assignment. As long as the sequencer assigns unique, monotonically increasing positions, the log is totally ordered.
Corfu’s Storage Layer
Each storage node manages a simple write-once address space:
Structure StorageNode:
log: Map<LogPosition, bytes>
trimmed_positions: Set<LogPosition>
Procedure HandleWrite(position, entry):
if position in log:
Reply(ErrorAlreadyWritten)
else if position in trimmed_positions:
Reply(ErrorTrimmed)
else:
log[position] = entry
Reply(OK)
Procedure HandleRead(position):
if position in log:
Reply(OK, log[position])
else if position in trimmed_positions:
Reply(ErrorTrimmed)
else:
Reply(ErrorNotWritten)
The write-once semantics are crucial. If two clients somehow get the same position from the sequencer (which should not happen but might during sequencer failover), only one will succeed. The other will see ErrorAlreadyWritten and must retry with a new position.
For durability, Corfu replicates each log entry to multiple storage nodes using chain replication. The client writes to the head of the chain, which forwards to the next node, and so on. The write is considered durable when the tail of the chain acknowledges.
// Chain replication for durability
Procedure AppendWithChainReplication(client, entry):
position = client.sequencer.GetToken()
// Get the chain for this position
chain = client.layout.GetChain(position)
// Write to head of chain — it propagates to tail
head = chain[0]
success = head.ChainWrite(position, entry, chain)
if not success:
Retry(client, entry)
return position
Tango: Virtual State Machines on a Shared Log
Tango (Balakrishnan et al., SOSP 2013) builds on Corfu to create virtual replicated state machines. The idea: instead of running a separate consensus protocol for each replicated object, put all operations into the shared log and have each object replay the log to update its state.
Structure TangoObject:
oid: ObjectId
state: Any // the application-level state
version: LogPosition // last log position applied to this object
log_client: CorfuClient
Procedure Apply(tango_obj, operation):
// Step 1: Append the operation to the shared log
entry = LogEntry{
object_id: tango_obj.oid,
operation: operation
}
position = tango_obj.log_client.Append(Serialize(entry))
// Step 2: Replay the log up to the new position to update state
Sync(tango_obj, position)
return tango_obj.state
Procedure Sync(tango_obj, up_to_position):
// Read and apply all log entries from our current version to up_to_position
for pos in range(tango_obj.version + 1, up_to_position + 1):
entry = tango_obj.log_client.Read(pos)
log_entry = Deserialize(entry)
if log_entry.object_id == tango_obj.oid:
// This entry is for our object — apply it
tango_obj.state = ApplyOperation(tango_obj.state, log_entry.operation)
// Skip entries for other objects
tango_obj.version = up_to_position
This is remarkable. Multiple independent objects — a key-value store, a counter, a queue — can all be replicated by sharing a single log. The objects do not know about each other. They do not need their own consensus protocols. They just read the log.
Even more remarkably, this gives you cross-object transactions for free. If you want to atomically update objects A and B, you append a single log entry that contains both operations. When each object replays the log, they both apply the transaction. Because the log is totally ordered, all replicas see the transaction at the same position and apply it atomically.
// Cross-object transaction using shared log
Procedure TransactionalUpdate(log_client, operations: List<(ObjectId, Operation)>):
entry = TransactionEntry{
operations: operations
}
position = log_client.Append(Serialize(entry))
// Each object, when it replays the log, will see this entry
// and apply the operation intended for it
return position
// Modified Sync to handle transactions
Procedure SyncWithTransactions(tango_obj, up_to_position):
for pos in range(tango_obj.version + 1, up_to_position + 1):
entry = tango_obj.log_client.Read(pos)
parsed = Deserialize(entry)
if parsed is TransactionEntry:
for (oid, operation) in parsed.operations:
if oid == tango_obj.oid:
tango_obj.state = ApplyOperation(tango_obj.state, operation)
else if parsed is LogEntry:
if parsed.object_id == tango_obj.oid:
tango_obj.state = ApplyOperation(tango_obj.state, parsed.operation)
tango_obj.version = up_to_position
The Catch: Replay Cost
The elegance of Tango comes with a cost: every object must replay the entire log, filtering for entries relevant to it. If the log has a million entries and only 1% are relevant to your object, you are reading 990,000 irrelevant entries.
Tango addresses this with stream multiplexing: the log supports multiple named streams, and each object is assigned to a stream. The log maintains per-stream indexes, so an object can efficiently read only the entries in its stream.
// Stream-aware shared log
Interface StreamAwareLog:
Function Append(stream_id: StreamId, entry: bytes) -> LogPosition
Function ReadStream(stream_id: StreamId, from: LogPosition) -> List<(LogPosition, bytes)>
// For cross-stream transactions: append to multiple streams atomically
Function AppendMultiStream(stream_ids: List<StreamId>, entry: bytes) -> LogPosition
This works, but it introduces complexity. The stream index must be maintained consistently. Cross-stream transactions must be visible to all involved streams. And the space overhead of the index itself can be significant.
The Relationship to Kafka
If this all sounds familiar, it should. Kafka is, at its core, a shared log. Producers append to topics (which are logs). Consumers read from topics. The ordering within a partition is total. Multiple applications can consume the same topic independently.
The differences between Kafka and Corfu are architectural, not conceptual:
| Property | Corfu | Kafka |
|---|---|---|
| Ordering | Global total order (single log) | Per-partition order |
| Sequencing | Dedicated sequencer node | Partition leader |
| Replication | Chain replication | ISR (pull-based) |
| Storage | Write-once address space | Append-only segments |
| Primary use case | Building replicated state machines | Message passing and stream processing |
| Client model | Library (in-process) | Service (over network) |
Kafka trades global total order for partitioned parallelism. This makes it horizontally scalable (different partitions can be on different brokers) but means you cannot do cross-partition transactions without external coordination. Corfu provides global order but concentrates sequencing in a single node, which limits throughput.
The log-centric philosophy is the same. The trade-offs reflect different priorities.
Total Order Broadcast and Its Equivalence to Consensus
At this point, we should make explicit what has been implicit throughout this chapter: total order broadcast (TOB) and consensus are equivalent problems. Given a solution to one, you can build a solution to the other.
Consensus from TOB: To run consensus, broadcast your proposed value via TOB. All nodes receive all proposals in the same order. The first proposal received is the decided value.
TOB from Consensus: To broadcast a message with total order, use consensus to agree on the next message to deliver. Run a consensus instance for each position in the delivery sequence.
A shared log is total order broadcast. Append broadcasts a message. The log position is the total order. Every consumer sees the same messages in the same order.
This equivalence is why “just use a log” is both profound and hand-wavy. It is profound because it reduces every distributed coordination problem to log appending. It is hand-wavy because log appending is consensus — you have not eliminated the hard problem, you have just given it a different name and a cleaner interface.
The value of the log abstraction is not that it makes consensus easy. It is that it makes consensus reusable. Solve consensus once, in the log, and every application built on top gets consensus for free.
CorfuDB and VMware’s NSX
CorfuDB is the open-source descendant of Corfu, developed within VMware. It was used in production as the metadata store for VMware NSX (the network virtualization platform).
The architecture is a practical realization of the Tango vision: CorfuDB provides a replicated object framework where application objects are backed by a shared log. The framework handles serialization, log replay, snapshotting, and transaction support.
// CorfuDB-style replicated object
@CorfuObject
Structure NetworkPolicy:
rules: List<FirewallRule>
version: Integer
@Mutator
Procedure AddRule(rule):
// This method is automatically logged to the shared log
self.rules.append(rule)
self.version++
@Accessor
Function GetRules():
// This method triggers a log sync before reading
return self.rules
@TransactionalMutator
Procedure ReplaceAllRules(new_rules):
// This runs within a transaction
self.rules = new_rules
self.version++
The VMware NSX use case is illustrative: network policies need to be replicated across multiple controllers consistently. Using a shared log means all controllers see the same sequence of policy changes and apply them in the same order. The log provides both the replication mechanism and the consistency guarantee.
In practice, CorfuDB encountered the usual challenges of log-based systems: garbage collection (old log entries need to be trimmed), checkpointing (to avoid replaying the entire log on startup), and handling of slow consumers (who fall too far behind the log tail).
Delos: Virtual Consensus at Meta
Delos, developed at Meta (formerly Facebook), takes the virtual consensus idea one step further. It asks: what if the consensus implementation itself could be swapped out behind the log API?
The architecture has three layers:
- VirtualLog: A log API that applications program against.
- Loglet: A pluggable implementation of a log segment. Different loglets can use different consensus protocols (or no consensus at all).
- MetaStore: A small, reliable store that tracks which loglet is responsible for which range of the log.
// Delos architecture
Interface Loglet:
Function Append(entry: bytes) -> LogPosition
Function Read(position: LogPosition) -> bytes
Function Seal() -> LogPosition // prevent further appends, return final position
Structure VirtualLog:
metastore: MetaStore
active_loglet: Loglet
sealed_loglets: List<(LogPositionRange, Loglet)>
Procedure Append(vlog, entry):
// Append to the currently active loglet
try:
position = vlog.active_loglet.Append(entry)
return vlog.TranslateToGlobalPosition(position)
catch LogletSealedException:
// Active loglet was sealed — switch to new one
vlog.SwitchLoglet()
return Append(vlog, entry) // retry with new loglet
Procedure Read(vlog, global_position):
// Find which loglet contains this position
loglet = vlog.FindLoglet(global_position)
local_position = vlog.TranslateToLocalPosition(global_position)
return loglet.Read(local_position)
Procedure SwitchLoglet(vlog):
// Seal the current loglet
final_pos = vlog.active_loglet.Seal()
// Record the sealed loglet's range in metastore
vlog.metastore.RecordSealedLoglet(
vlog.active_loglet,
position_range = (start, final_pos)
)
// Create a new loglet (possibly using a different implementation!)
new_loglet = CreateLoglet(vlog.metastore.GetActiveLogletConfig())
vlog.active_loglet = new_loglet
vlog.sealed_loglets.append((range, old_loglet))
The brilliance of Delos is the Seal() operation. When a loglet is sealed, no more entries can be appended to it. The loglet becomes read-only. A new loglet is created for future appends. This allows:
-
Live migration between consensus protocols. Seal the old loglet (backed by, say, Paxos), create a new one (backed by Raft), and continue. Applications see a seamless log.
-
Heterogeneous storage. Old loglets can be backed by cheap, slow storage. The active loglet uses fast, expensive storage. Historical reads go to the cold tier; current appends go to the hot tier.
-
Testing and experimentation. You can run a new consensus implementation for a subset of the log (a single loglet), compare its behavior with the production implementation, and switch back if problems arise.
-
Reconfiguration without downtime. Changing the replica set? Seal the old loglet, create a new one with the new replica set, done.
Delos in Practice
Meta uses Delos for several internal control plane services. The initial deployment used a simple loglet backed by a ZooKeeper ensemble. Later deployments used custom loglet implementations optimized for Meta’s infrastructure.
The practical benefits reported by Meta:
- Faster iteration. New consensus implementations can be deployed behind the VirtualLog API without changing application code.
- Easier testing. The loglet interface is small and well-defined, making it easier to test new implementations in isolation.
- Graceful degradation. If a loglet implementation has a bug, it can be sealed and replaced, limiting the blast radius.
- Separation of concerns. Application developers think in terms of log operations. Infrastructure engineers think in terms of loglet implementations. The VirtualLog bridges the gap.
Why “Just Use a Log” Is Both Profound and Hand-Wavy
Let us be honest about the limitations of the log-based approach.
The Profundity
The shared log abstraction genuinely simplifies distributed system design. Instead of every application implementing its own replication protocol, they all share a single, well-tested, well-understood log. The benefits are real:
- Correctness by construction. If the log is correct (totally ordered, durable, replicated), then any deterministic state machine built on top is correct. You prove the log correct once; applications are correct by construction.
- Composability. Multiple objects can share a log, and cross-object transactions are straightforward. This is much harder with per-object consensus protocols.
- Operational simplicity. One system to monitor, tune, and debug, rather than N separate consensus implementations.
The Hand-Waviness
The log is a bottleneck. A single totally ordered log has a throughput limit determined by the consensus protocol and the sequencer. For write-heavy workloads, this bottleneck is real. Partitioning the log (like Kafka does) restores throughput but sacrifices global total order — the very thing the log was supposed to provide.
Replay is not free. Building state by replaying a log means that startup time grows with log length. Snapshotting mitigates this but adds complexity. A consumer that falls behind must catch up, which can take significant time for high-volume logs.
The log grows. An append-only log grows without bound. Garbage collection, compaction, and trimming are necessary and non-trivial. When can you safely trim old entries? Only when all consumers have processed them — which requires tracking consumer progress, which requires coordination.
Determinism is hard. Log-based state machine replication requires that all replicas execute the same operations and produce the same results. This means operations must be deterministic. No random numbers, no system clock reads, no external I/O during replay. Enforcing determinism in application code is a constant battle.
Latency characteristics are not always great. Every operation goes through the log: append, wait for commit, sync, apply. For read-heavy workloads, this adds unnecessary latency. Optimizations like read leases or local read paths are necessary but add complexity.
// The hidden costs of log-based replication
Procedure HandleClientRequest(request):
// Step 1: Append to log (consensus latency)
position = log.Append(Serialize(request))
// Latency: network RTT to quorum + disk fsync
// Step 2: Wait for local replica to catch up to this position
WaitUntil(local_state.version >= position)
// Latency: depends on how far behind we are
// Step 3: Read result from local state
result = local_state.GetResult(position)
return result
// Total latency = consensus + replay catch-up + local read
// vs. direct state machine replication where latency = consensus + apply
// The catch-up step is the extra cost of indirection through the log
Why This Approach Has Not Taken Over the World
Despite the elegance of virtual consensus and log-based architectures, they remain niche. Most distributed systems use embedded consensus (Raft in etcd, Paxos in Spanner, ZAB in ZooKeeper) rather than building on a shared log. Why?
Coupling vs. decoupling. A shared log introduces a dependency between all services that use it. If the log is down, everything is down. If the log is slow, everything is slow. Embedded consensus has failure isolation — a problem with etcd does not affect your database.
The abstraction is leaky. Applications need to understand log positions, replay, snapshotting, and trimming. The log API is simple, but using it correctly requires understanding the distributed systems underneath. The abstraction saves implementation effort but not understanding effort.
Existing ecosystem. etcd, ZooKeeper, and Consul provide battle-tested coordination services with rich APIs (watches, ephemeral nodes, transactions). A shared log provides a lower-level primitive that requires more work to turn into a usable coordination service.
Performance trade-offs. A general-purpose shared log is optimized for no specific workload. A purpose-built consensus protocol (like Raft in CockroachDB, tuned for the database’s access patterns) can outperform a general-purpose log for its specific use case.
Organizational factors. In most organizations, each team manages its own infrastructure. Sharing a log across teams creates coordination overhead (who manages it? who pays for it? whose SLA applies?). This is an organizational problem, not a technical one, but organizational problems kill more architectures than technical ones.
The Spectrum of Log Usage
In practice, systems exist on a spectrum from “embedded consensus” to “fully virtual consensus”:
Embedded Consensus Hybrid Virtual Consensus
| | |
etcd Kafka Delos
CockroachDB Pulsar Corfu/Tango
ZooKeeper (log for data, (everything via log)
(consensus tightly consensus for
coupled to metadata)
application)
Most production systems are in the hybrid zone. Kafka uses a log for data but consensus (KRaft) for metadata. Pulsar uses BookKeeper (a log) for data but ZooKeeper (consensus) for coordination. Even CockroachDB, which uses Raft for replication, structures its storage as a log (the Raft log) that is replayed into a state machine (RocksDB/Pebble).
The fully virtual approach (Delos, Corfu) is most viable for control plane services where the workload is metadata-heavy, the throughput requirements are moderate, and the benefits of swappable consensus implementations are high.
A Practical Assessment
The log-based architecture is a genuine contribution to how we think about distributed systems. The insight that consensus can be factored out into a reusable service, and that the log is the right abstraction for that service, has influenced the design of many systems even if they do not use a shared log directly.
For practitioners, the takeaways are:
-
Think in terms of logs. Even if you are using Raft or Paxos directly, understanding your system as a replicated log with state machines on top clarifies the architecture.
-
Consider the shared log when building control planes. For metadata management, configuration storage, and coordination, a shared log can simplify the architecture significantly. Delos’s success at Meta is evidence that this works at scale.
-
Do not use a shared log for everything. High-throughput data paths are better served by purpose-built protocols (like Kafka’s ISR for message streaming, or CockroachDB’s Raft for database replication). The generality of the shared log comes at a performance cost.
-
The VirtualLog pattern is powerful. Even if you do not use Delos specifically, the idea of separating the log abstraction from the consensus implementation — allowing the implementation to be swapped, upgraded, or reconfigured without affecting applications — is a design pattern worth adopting.
The shared log is not the answer to every distributed systems problem. But it is a remarkably good answer to the question “how do I factor consensus out of my application logic?” — and that question comes up more often than you might think.
In the end, the log is not magic. It is consensus with a better API. And sometimes, a better API is exactly what you need.