Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CRDTs: Avoiding Consensus Entirely

The Nuclear Option

Every chapter in this book so far has been about getting nodes to agree. Paxos, Raft, ZAB, EPaxos — all of them are mechanisms for forcing agreement on a single value, a single order, a single truth. The protocols differ in elegance, performance, and how much suffering they inflict on implementers, but they share a common assumption: agreement is necessary.

CRDTs ask a heretical question: what if it is not?

Conflict-Free Replicated Data Types are data structures designed so that concurrent updates by different replicas always converge to the same state, without any coordination. No leader. No quorums. No message ordering. No consensus. Each replica processes updates locally, syncs with other replicas whenever it can, and everyone eventually ends up in the same state.

This sounds too good to be true. It is — for general computation. But for a specific and surprisingly useful class of problems, CRDTs deliver on their promise. The trick is understanding exactly how narrow that class is, and how much design contortion is required to stay within it.

The Mathematical Foundation

CRDTs are built on a simple algebraic structure: the join-semilattice.

A join-semilattice is a set S with a binary operation (called join, written as ⊔ or merge) that is:

  1. Commutative: a ⊔ b = b ⊔ a
  2. Associative: (a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)
  3. Idempotent: a ⊔ a = a

These three properties are what make CRDTs work:

  • Commutativity means the order in which updates arrive does not matter. Replica A can receive update X before Y, while replica B receives Y before X, and they will reach the same state.
  • Associativity means grouping does not matter. You can merge updates pairwise in any order.
  • Idempotency means receiving the same update twice is harmless. No need for exactly-once delivery — at-least-once is sufficient.

If your state and merge operation form a join-semilattice, convergence is guaranteed by mathematics, not by protocol. This is not a probabilistic guarantee or a best-effort one — it is a theorem.

The catch — and there is always a catch — is that your state must only ever move “up” in the lattice. The lattice has a partial order, and every merge operation produces a result that is ≥ both inputs. This means CRDTs can only represent monotonically growing information.

You can count up. You can add to a set. You can record that an event happened. What you cannot do — at least not directly — is count down, remove from a set, or un-happen an event. Every CRDT that appears to support these operations does so through a clever encoding that transforms “removal” into “addition of a tombstone” or similar tricks. We will see the consequences.

State-Based vs. Operation-Based CRDTs

There are two flavors of CRDTs, corresponding to two different ways of achieving convergence:

State-based CRDTs (CvRDTs) synchronize by shipping their entire state to other replicas. The receiving replica merges the received state with its local state using the lattice join. As long as every replica eventually communicates its state to every other replica (directly or transitively), convergence is guaranteed.

Operation-based CRDTs (CmRDTs) synchronize by shipping individual operations. Each operation is broadcast to all replicas, and each replica applies it locally. For this to work, the operations must be commutative (so order does not matter) and the delivery layer must guarantee at-least-once delivery to all replicas.

In theory, the two approaches are equivalent — anything expressible as a CvRDT can be expressed as a CmRDT and vice versa. In practice, they have different trade-offs:

PropertyState-based (CvRDT)Operation-based (CmRDT)
Network payloadEntire state (can be large)Individual operation (small)
Delivery requirementAt-least-once, eventualAt-least-once to all replicas
Merge complexityMust implement merge functionMust ensure commutativity
Metadata overheadState includes all metadataOperations may be smaller
Network efficiencyPoor for large statesGood (small messages)

Most practical systems use state-based CRDTs with delta-state optimizations (shipping only the changes since last sync rather than the full state). This gives the small message size of operation-based CRDTs with the simpler delivery requirements of state-based ones.

The CRDT Zoo: Concrete Examples

G-Counter (Grow-only Counter)

The simplest useful CRDT. Each replica maintains a separate counter, and the global count is the sum of all replica counters.

Structure GCounter:
    counts: Map<ReplicaId, Integer>  // one counter per replica

Function Increment(gcounter, replica_id):
    gcounter.counts[replica_id] += 1

Function Value(gcounter):
    return Sum(gcounter.counts.values())

Function Merge(a: GCounter, b: GCounter):
    result = GCounter{}
    for each replica_id in Union(a.counts.keys(), b.counts.keys()):
        result.counts[replica_id] = max(
            a.counts.get(replica_id, 0),
            b.counts.get(replica_id, 0)
        )
    return result

Why does this work? Each replica only increments its own entry. The merge takes the maximum of each entry. Since each entry only grows, max produces the correct count of increments from each replica. The sum of maxima gives the total count.

This is a legitimate join-semilattice: merge is commutative (max is), associative, and idempotent (max(x, x) = x).

The limitation: You can only count up. A “page view counter” works perfectly. A “users currently online” counter does not.

PN-Counter (Positive-Negative Counter)

To support both increment and decrement, use two G-Counters: one for increments (P) and one for decrements (N). The value is P - N.

Structure PNCounter:
    p: GCounter  // positive counts
    n: GCounter  // negative counts

Function Increment(pncounter, replica_id):
    Increment(pncounter.p, replica_id)

Function Decrement(pncounter, replica_id):
    Increment(pncounter.n, replica_id)

Function Value(pncounter):
    return Value(pncounter.p) - Value(pncounter.n)

Function Merge(a: PNCounter, b: PNCounter):
    return PNCounter{
        p: Merge(a.p, b.p),
        n: Merge(a.n, b.n)
    }

This works because both P and N are G-Counters (monotonically increasing), so the overall structure is still a join-semilattice. The value can go down, but the underlying state only grows.

The catch: The counter can go negative. If two replicas concurrently decrement a counter that is at zero, the result is -2. There is no way to enforce a “minimum value of zero” constraint without coordination. This is fundamental — enforcing invariants across replicas requires consensus.

LWW-Register (Last-Writer-Wins Register)

A register that holds a single value. Concurrent writes are resolved by timestamp: the write with the highest timestamp wins.

Structure LWWRegister:
    value: Any
    timestamp: Timestamp

Function Write(register, value, timestamp):
    if timestamp > register.timestamp:
        register.value = value
        register.timestamp = timestamp

Function Read(register):
    return register.value

Function Merge(a: LWWRegister, b: LWWRegister):
    if a.timestamp > b.timestamp:
        return a
    else if b.timestamp > a.timestamp:
        return b
    else:
        // Tie-breaking: use some deterministic rule
        // (e.g., lexicographic comparison of values, or replica ID)
        return DeterministicTieBreak(a, b)

LWW-Register is the most widely used CRDT and also the most philosophically questionable. It “resolves” conflicts by throwing away all but one of the conflicting writes. Whether the “last” write is actually the one the user intended to keep depends on clock accuracy, which in a distributed system is… well, let us just say it is an area of active prayer.

The uncomfortable truth: LWW-Register does not resolve conflicts. It hides them. If two users concurrently edit a document — one writing “yes” and the other writing “no” — LWW will silently pick one and discard the other. No user is notified. No merge is attempted. The loser’s write simply vanishes.

For some use cases (caching, session storage, “last configuration wins”), this is fine. For others (collaborative editing, financial records), it is a disaster wearing a formal proof.

OR-Set (Observed-Remove Set)

The OR-Set is where CRDTs start getting clever — and where the metadata overhead starts getting real.

The problem with sets is removal. Adding to a set is monotonic (the set grows). Removing is not (the set shrinks). And if one replica adds an element while another concurrently removes it, which operation wins?

The OR-Set’s answer: add wins over concurrent remove. More precisely, a remove only removes copies of the element that the removing replica has observed. If another replica concurrently adds the same element, that add is not removed because the removing replica had not observed it.

Structure ORSet:
    // Each element is tagged with a unique identifier
    // The set contains (element, unique_tag) pairs
    entries: Set<(Element, UniqueTag)>

Function Add(orset, element, replica_id):
    tag = GenerateUniqueTag(replica_id)  // e.g., (replica_id, counter++)
    orset.entries.add((element, tag))

Function Remove(orset, element):
    // Remove all entries with this element that we can currently see
    to_remove = {(e, tag) in orset.entries where e == element}
    orset.entries = orset.entries - to_remove

Function Contains(orset, element):
    return exists (e, tag) in orset.entries where e == element

Function Elements(orset):
    return {e for (e, tag) in orset.entries}

Function Merge(a: ORSet, b: ORSet):
    // Union of all entries
    // Entries that were removed on one side but not added on the other
    // stay removed. Entries that were independently added survive.
    return ORSet{entries: a.entries UNION b.entries}
    // NOTE: this simplified version doesn't handle removes correctly
    // The actual implementation needs to track removed tags separately

Wait — the merge function above is wrong, or at least incomplete. This is the part that papers tend to gloss over. The actual implementation needs to handle the case where one replica has removed an entry (e, tag) while another replica still has it. A simple union would resurrect removed elements.

The correct implementation typically uses one of these approaches:

  1. Tombstones: Keep a separate set of removed (element, tag) pairs. An element is in the set if it has at least one tag that is not in the tombstone set.

  2. Causal context tracking: Each replica tracks a causal context (a vector clock or similar), and the merge operation uses this context to determine whether an absence on one side means “never added” or “was added and then removed.”

// More realistic OR-Set with tombstones
Structure ORSetWithTombstones:
    entries: Set<(Element, UniqueTag)>     // added entries
    tombstones: Set<(Element, UniqueTag)>  // removed entries

Function Add(orset, element, replica_id):
    tag = GenerateUniqueTag(replica_id)
    orset.entries.add((element, tag))

Function Remove(orset, element):
    observed = {(e, tag) in orset.entries where e == element}
    orset.entries = orset.entries - observed
    orset.tombstones = orset.tombstones UNION observed

Function Contains(orset, element):
    active = orset.entries - orset.tombstones
    return exists (e, tag) in active where e == element

Function Merge(a: ORSetWithTombstones, b: ORSetWithTombstones):
    return ORSetWithTombstones{
        entries: (a.entries UNION b.entries),
        tombstones: (a.tombstones UNION b.tombstones)
    }

Function Elements(orset):
    active = orset.entries - orset.tombstones
    return {e for (e, tag) in active}

Now you can see the metadata problem: the tombstone set grows without bound. Every element that is ever removed leaves a tombstone that must be retained forever (or at least until all replicas have observed the removal and a garbage collection protocol has run). For a set with high churn — elements frequently added and removed — the tombstone set can dwarf the actual data.

What Can You Actually Build With CRDTs?

This is the question that separates CRDT enthusiasts from CRDT practitioners. The theory is beautiful. The implementations are… educational. The question is what real systems you can build.

Things CRDTs Handle Well

Counters and accumulators. Page views, like counts, vote tallies (where the total only matters, not per-user votes), distributed metrics collection. G-Counters and PN-Counters are simple, efficient, and genuinely useful.

Grow-only sets. Event logs, “users who have visited this page,” sets of tags where removal is not needed. These are trivial CRDTs (set union is a join) and are widely used.

Last-writer-wins registers. User profile fields, configuration values, session data — anything where “most recent write wins” is an acceptable conflict resolution policy.

Collaborative text editing. This is the marquee use case. CRDTs like RGA (Replicated Growable Array), LSEQ, and the data structures behind Automerge and Yjs provide real-time collaborative editing without a central server. Each user’s edits are represented as CRDT operations (insert character at position, delete character at position), and all users converge to the same document.

This is genuinely impressive and practically useful. Google Docs uses OT (Operational Transformation), which requires a central server. CRDT-based editors like those built on Automerge or Yjs can work peer-to-peer. The trade-off is metadata overhead and complexity.

Things CRDTs Handle Poorly

Anything with constraints. “The counter must not go below zero.” “The set must contain at most 10 elements.” “These two fields must be consistent with each other.” CRDTs cannot enforce cross-replica invariants because enforcing invariants requires coordination — which is exactly what CRDTs are designed to avoid.

Total ordering of events. CRDTs provide causal ordering at best. If you need a total order (a global sequence of events that all replicas agree on), you need consensus. CRDTs and total order are fundamentally incompatible.

Transactions. “Transfer $100 from account A to account B” requires both the debit and credit to be atomic. This is a coordination problem. CRDTs for individual accounts (PN-Counters) will let the debit happen without the credit, or vice versa.

Authorization and access control. “User X is no longer an admin” needs to be enforced consistently across all replicas. If one replica has processed the revocation and another has not, the non-updated replica will still grant admin access. This is a consistency problem that CRDTs cannot solve.

Real-World Usage

Riak

Riak was one of the first production databases to offer CRDT data types. It supports counters, sets, maps, registers, and flags as first-class data types. Under the hood, it uses state-based CRDTs with delta propagation.

Riak’s experience validated the theory but also exposed the practical challenges. The metadata overhead for OR-Sets and maps was significant. Garbage collection of tombstones required careful coordination. And users frequently tried to use CRDTs for things they were not designed for, leading to subtle data integrity issues.

Riak is largely defunct now, but its CRDT implementation was a valuable proof-of-concept for the industry.

Redis CRDT Types

Redis Enterprise (the commercial version) includes CRDT-based data types for active-active geo-replication. Counters, strings (LWW), sets, sorted sets, and other Redis data types are replicated across datacenters using CRDT semantics.

The Redis implementation is pragmatic: it uses LWW for most things and provides “add wins” semantics for sets. It does not expose the full theoretical CRDT taxonomy to users — it just makes Redis data types work across datacenters without explicit conflict resolution.

Automerge and Yjs

Automerge and Yjs are JavaScript libraries for building collaborative applications. They use CRDTs (or CRDT-like structures) to enable real-time collaborative editing of JSON documents, text, and other data structures.

These libraries represent the state of the art in CRDT-based collaborative editing. They handle text insertion and deletion, rich text formatting, and nested data structures. The performance has improved dramatically in recent years (Automerge 2.0 was a ground-up rewrite focused on performance).

The metadata overhead is still significant. A document that is a few kilobytes of text can have megabytes of CRDT metadata, especially if it has a long editing history. Compaction helps but does not eliminate the problem.

// Simplified collaborative text editing with a CRDT
// Based on RGA (Replicated Growable Array) concepts

Structure TextCRDT:
    // Each character has a unique ID and a reference to the character it was inserted after
    characters: List<CharEntry>

Structure CharEntry:
    id: (ReplicaId, SequenceNumber)  // globally unique
    value: Character                  // the actual character (or TOMBSTONE)
    after: CharEntryId               // inserted after this character
    visible: Boolean                  // false if deleted

Function InsertAfter(text, position_id, character, replica_id):
    new_id = (replica_id, next_sequence_number++)
    entry = CharEntry{
        id: new_id,
        value: character,
        after: position_id,
        visible: true
    }
    // Insert into list at correct position
    // (after position_id, using ID ordering to break ties with concurrent inserts)
    InsertInOrder(text.characters, entry)

Function Delete(text, char_id):
    // Don't actually remove — mark as tombstone
    entry = FindEntry(text.characters, char_id)
    entry.visible = false

Function Render(text):
    return Concatenate(entry.value for entry in text.characters if entry.visible)

Function Merge(a: TextCRDT, b: TextCRDT):
    // Union of all character entries
    // Entries present in both: take visible = a.visible AND b.visible
    //   (if either side deleted it, it stays deleted... actually this depends on semantics)
    // Entries present in only one: include them
    // Position determined by 'after' references and tie-breaking
    result = TextCRDT{}
    all_entries = Union(a.characters, b.characters) by id
    for each id in all_entries:
        if id in a AND id in b:
            result.add(CharEntry{
                id: id,
                value: a[id].value,  // same in both
                after: a[id].after,  // same in both
                visible: a[id].visible AND b[id].visible
            })
        else if id in a:
            result.add(a[id])
        else:
            result.add(b[id])
    // Re-sort by insertion order using 'after' references
    TopologicalSort(result.characters)
    return result

The Garbage Collection Problem

Every CRDT implementation eventually confronts the garbage collection problem. CRDTs are monotonically growing data structures — that is how they achieve convergence. But monotonically growing state eventually exhausts memory.

Tombstones in OR-Sets, deleted characters in text CRDTs, old counter values that have been superseded — all of this metadata accumulates over time and must eventually be cleaned up.

The irony is profound: garbage collecting a CRDT requires coordination. You need all replicas to agree that a tombstone is no longer needed (because all replicas have observed the corresponding deletion). This agreement is, itself, a form of consensus.

// CRDT garbage collection requires coordination — the irony
Procedure GarbageCollect(orset):
    // Find tombstones that ALL replicas have observed
    // This requires knowing each replica's "state version"
    min_version = Min(replica_versions for all replicas)

    for each (element, tag) in orset.tombstones:
        if tag.version <= min_version:
            // All replicas have seen this tombstone
            // Safe to remove the tombstone AND the corresponding entry
            orset.tombstones.remove((element, tag))
            orset.entries.remove((element, tag))

    // BUT: how do we know min_version without consensus?
    // Option 1: Use a background consensus protocol (defeats the purpose?)
    // Option 2: Use vector clocks and gossip (eventually consistent GC)
    // Option 3: Accept unbounded growth (not practical)
    // Option 4: "Epoch-based" GC with a coordinator (practical but impure)

Most practical CRDT systems use option 4: a coordinator periodically snapshots the state, all replicas sync to the snapshot, and old metadata is discarded. This works well in practice but means your “coordination-free” data structure actually requires periodic coordination for maintenance.

When CRDTs Are Moving the Problem, Not Solving It

The most important criticism of CRDTs is not that they do not work — they do, for what they are designed for. The criticism is that they are sometimes used to avoid solving a coordination problem that actually needs to be solved.

Example: Shopping cart. A user adds item A on their phone and removes item B on their laptop. With an OR-Set CRDT, both operations succeed locally and merge correctly. But what if item B is a pre-order with a price guarantee? The removal might need to trigger a refund, which requires coordination with the payment system. The CRDT handles the data structure, but the business logic still requires coordination.

Example: Distributed counter for inventory. You use a PN-Counter to track inventory across warehouses. Warehouse A decrements (sold an item), warehouse B decrements (sold an item). The counter goes to -1. Now what? You have oversold. The CRDT correctly tracked the decrements, but the invariant you needed (inventory >= 0) was violated because CRDTs cannot enforce it.

Example: User permissions. You use an OR-Set for a user’s roles. Admin A revokes the “admin” role. Concurrently, admin B adds the “super-admin” role. The OR-Set merges with “add wins” semantics: the user ends up with “super-admin” but not “admin.” Whether this is the correct outcome depends on your security model, and it is almost certainly not what either admin intended.

In all of these cases, the CRDT correctly implements its specification. The problem is that the specification does not capture the actual requirements. The coordination problem has not been eliminated — it has been pushed from the data structure layer to the application layer, where it is harder to see and harder to handle correctly.

CRDTs vs. Consensus: An Honest Comparison

RequirementCRDTsConsensus (Paxos/Raft)
Availability during partitionFull (reads and writes)Majority side only
Consistency modelStrong eventual consistencyLinearizability
Conflict resolutionAutomatic (by data structure design)Prevention (by serialization)
Invariant enforcementNot possiblePossible
Total orderingNot possibleGuaranteed
LatencyLocal (no coordination)Round-trip to quorum
ThroughputUnlimited (local operations)Bounded by leader/quorum
Metadata overheadSignificant (grows over time)Minimal (log compaction)
Implementation complexityModerate (for basic types) to Very High (for rich types)High
Correctness reasoningAlgebraic (lattice properties)Protocol-based (message ordering)

The fundamental trade-off is: CRDTs give you availability and low latency at the cost of giving up strong consistency and invariant enforcement. Consensus gives you strong consistency and invariant enforcement at the cost of availability during partitions and higher latency.

This is not a matter of one being “better.” It is the CAP theorem in action. CRDTs choose AP (availability and partition tolerance). Consensus chooses CP (consistency and partition tolerance). You cannot have all three, and which two you choose depends on your application.

When to Use CRDTs (For Real)

After all the caveats, here is when CRDTs genuinely shine:

  1. Multi-datacenter replication with independent operation. If each datacenter must continue operating during a network partition, and eventual convergence after the partition heals is acceptable, CRDTs are the right tool.

  2. Collaborative editing. Real-time collaborative editing on documents, whiteboards, or other creative tools. The “add wins” semantics of CRDTs is usually the right conflict resolution for human-generated content.

  3. Metrics and counters. Distributed counters, histograms, and other aggregation metrics where approximate real-time values are useful and exact consistency is not required.

  4. Caching with automatic conflict resolution. Distributed caches that replicate across nodes, where LWW or similar semantics are acceptable for stale data.

  5. Offline-first applications. Mobile or edge applications that must work without connectivity and sync when a connection is available. CRDTs let the application function fully offline with guaranteed convergence on reconnection.

And here is when you should not use CRDTs, no matter how appealing they seem:

  1. Financial transactions. Money requires invariants (no negative balances, atomic transfers). CRDTs cannot help.

  2. User authentication and authorization. Security-critical operations require strong consistency. A revoked permission must be revoked everywhere, not “eventually.”

  3. Coordination problems. Leader election, distributed locking, total ordering — these are fundamentally coordination problems. CRDTs are coordination-free by design.

  4. Anything where “both concurrent operations succeed” is the wrong answer. If two users try to book the same hotel room, exactly one should succeed. CRDTs will happily let both succeed and leave you with an overbooked hotel.

A Final Perspective

CRDTs are not a replacement for consensus. They are an alternative for situations where consensus is unnecessary, unavailable, or too expensive. The mathematical elegance is real, and the practical applications — particularly in collaborative editing and geo-replicated systems — are significant.

But the hype around CRDTs has sometimes obscured a fundamental truth: most interesting distributed systems problems require coordination. CRDTs are a way to carve out the subset of problems that do not, and solve those problems well. For everything else, you still need the agony of consensus.

The most dangerous CRDT deployment is one where someone has convinced themselves that they do not need coordination, when in fact they do. The data structure will work perfectly. The application will be subtly, silently wrong. And unlike a consensus protocol that fails loud (timeout, leader election, unavailability), a CRDT that cannot enforce your invariants fails quiet.

Quiet failures are the worst kind.