Chapter 15: Distributed Databases and Replication

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport

Single-server databases have limits: storage capacity, query throughput, availability. Distributed databases spread data across multiple machines to overcome these limits—but at the cost of significant complexity.

15.1 Why Distribute?

Scaling Vertically vs Horizontally

    Vertical Scaling (Scale Up):
    ┌───────────────────┐      ┌───────────────────┐
    │ Small Server      │  →   │ Big Server        │
    │ 4 cores, 16GB RAM │      │ 64 cores, 1TB RAM │
    └───────────────────┘      └───────────────────┘

    Limits: Hardware maxes out, cost increases exponentially

    Horizontal Scaling (Scale Out):
    ┌───────────────────┐      ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
    │ Small Server      │  →   │Node1│ │Node2│ │Node3│ │Node4│
    │                   │      └─────┘ └─────┘ └─────┘ └─────┘
    └───────────────────┘

    Benefits: Linear cost scaling, no hardware ceiling
    Challenge: Distributed systems are HARD

Reasons to Distribute

Capacity: Data doesn’t fit on one machine
Throughput: More machines = more queries/second
Availability: Survive machine failures
Latency: Place data closer to users geographically

15.2 Replication

Replication copies data to multiple nodes:

                    REPLICATION

    ┌─────────────────────────────────────────────────────────────┐
    │                        Primary                               │
    │                     ┌─────────┐                             │
    │                     │  Data   │                             │
    │                     └────┬────┘                             │
    │                          │                                   │
    │            ┌─────────────┼─────────────┐                    │
    │            │             │             │                    │
    │            ▼             ▼             ▼                    │
    │      ┌─────────┐   ┌─────────┐   ┌─────────┐              │
    │      │ Replica │   │ Replica │   │ Replica │              │
    │      │   #1    │   │   #2    │   │   #3    │              │
    │      └─────────┘   └─────────┘   └─────────┘              │
    │                                                              │
    │  Benefits:                                                   │
    │  - Fault tolerance (survive node failure)                   │
    │  - Read scalability (spread reads across replicas)          │
    │  - Geographic distribution (low latency)                    │
    └─────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

    Synchronous:
    ┌──────────────────────────────────────────────────────────┐
    │ 1. Client writes to primary                               │
    │ 2. Primary sends to replica                               │
    │ 3. Replica acknowledges                                   │
    │ 4. Primary acknowledges to client                         │
    │                                                           │
    │ Pros: No data loss on failure                            │
    │ Cons: Higher latency (wait for replica)                  │
    └──────────────────────────────────────────────────────────┘

    Asynchronous:
    ┌──────────────────────────────────────────────────────────┐
    │ 1. Client writes to primary                               │
    │ 2. Primary acknowledges to client immediately            │
    │ 3. Primary sends to replica (background)                 │
    │                                                           │
    │ Pros: Lower latency                                       │
    │ Cons: Potential data loss if primary fails               │
    └──────────────────────────────────────────────────────────┘

Leader-Based Replication

                    LEADER-FOLLOWER

    ┌─────────────┐
    │   Leader    │ ← All writes go here
    │  (Primary)  │
    └──────┬──────┘
           │ Replication stream (WAL)
    ┌──────┴───────────────────┐
    │              │           │
    ▼              ▼           ▼
┌─────────┐  ┌─────────┐  ┌─────────┐
│Follower │  │Follower │  │Follower │  ← Serve reads
│   #1    │  │   #2    │  │   #3    │
└─────────┘  └─────────┘  └─────────┘

PostgreSQL, MySQL, and most traditional databases use this model.

Leaderless Replication

                    LEADERLESS

    ┌─────────────────────────────────────────────────────────────┐
    │                    Client writes to N nodes                  │
    │                                                              │
    │      ┌─────────┐   ┌─────────┐   ┌─────────┐               │
    │      │  Node   │   │  Node   │   │  Node   │               │
    │      │   A     │   │   B     │   │   C     │               │
    │      └─────────┘   └─────────┘   └─────────┘               │
    │          ▲             ▲             ▲                      │
    │          │             │             │                      │
    │          └─────────────┴─────────────┘                      │
    │                        │                                     │
    │                   Write to all                               │
    │              Wait for W acknowledgments                      │
    │                                                              │
    │  Quorum: W + R > N ensures read sees latest write           │
    │  Example: N=3, W=2, R=2                                     │
    └─────────────────────────────────────────────────────────────┘

Cassandra and DynamoDB use leaderless replication.

15.3 Partitioning (Sharding)

Partitioning splits data across multiple nodes:

                    HORIZONTAL PARTITIONING

    Users Table (10 million rows):

    ┌─────────────────────────────────────────────────────────────┐
    │                      Partitioning Strategy                   │
    │                    Hash(user_id) mod 4                       │
    └─────────────────────────────────────────────────────────────┘
                                  │
         ┌────────────────────────┼────────────────────────┐
         │                        │                        │
         ▼                        ▼                        ▼
    ┌─────────┐             ┌─────────┐             ┌─────────┐
    │ Shard 0 │             │ Shard 1 │             │ Shard 2 │ ...
    │ 2.5M    │             │ 2.5M    │             │ 2.5M    │
    │ rows    │             │ rows    │             │ rows    │
    │         │             │         │             │         │
    │user_id  │             │user_id  │             │user_id  │
    │mod 4 = 0│             │mod 4 = 1│             │mod 4 = 2│
    └─────────┘             └─────────┘             └─────────┘

Partitioning Strategies

Hash Partitioning:

    partition = hash(key) mod num_partitions

    Pros: Even distribution
    Cons: Can't do range queries efficiently

Range Partitioning:

    user_id 1-1M → Partition 1
    user_id 1M-2M → Partition 2
    ...

    Pros: Range queries stay on one partition
    Cons: Hot spots if data is skewed

Directory-Based Partitioning:

    Lookup table maps keys to partitions
    Maximum flexibility
    But: Lookup table is a bottleneck

Cross-Partition Queries

    -- Single partition (fast):
    SELECT * FROM users WHERE user_id = 12345;
    -- Routes to one shard

    -- Cross-partition (slow):
    SELECT * FROM users WHERE name = 'Alice';
    -- Must query ALL shards, combine results

    -- Aggregation (very slow):
    SELECT COUNT(*) FROM users;
    -- Query all shards, sum results

15.4 The CAP Theorem

You can’t have all three:

                    CAP THEOREM

              Consistency
                  /\
                 /  \
                /    \
               /      \
              /   CA   \
             /  (RDBMS) \
            /____________\
           /\            /\
          /  \          /  \
         / CP \        / AP \
        /(HBase)\     /(Cassandra)
       /________\    /________\
    Partition        Availability
    Tolerance

Consistency: All nodes see the same data Availability: Every request gets a response Partition Tolerance: System works despite network failures

In a distributed system, network partitions WILL happen. So you must choose:

CP: Consistent but may be unavailable during partition
AP: Available but may return stale data during partition

CAP in Practice

    PostgreSQL (single node): CA
    - Consistent and available
    - Not partition tolerant (single node!)

    PostgreSQL with sync replication: CP
    - Consistent (waits for replica)
    - Unavailable if replica unreachable

    Cassandra: AP
    - Available (writes succeed)
    - Eventually consistent (reads may be stale)

    CockroachDB, Spanner: CP
    - Consistent (distributed transactions)
    - Unavailable if can't reach quorum

15.5 Consistency Models

Strong Consistency

Every read sees the most recent write:

    Time:     T1          T2          T3
    Write:   X=1    →
    Read:             →   X=1   →   X=1

    All reads after write see the new value
    Requires coordination (slow)

Eventual Consistency

Reads may see old data, but eventually converge:

    Time:     T1          T2          T3          T4
    Write:   X=1
    Read A:             X=0 (stale!)
    Read B:                         X=1
    Read C:                                     X=1

    Eventually all reads return 1
    No coordination required (fast)

Read-Your-Writes Consistency

A user always sees their own writes:

    User A writes X=1
    User A reads X → sees 1 (guaranteed)
    User B reads X → might see 0 (stale OK)

    Common in session-scoped scenarios

Causal Consistency

If A causes B, everyone sees them in that order:

    User posts message, then replies to it
    Viewers always see post before reply (never reply without post)

15.6 Distributed Transactions

Transactions spanning multiple nodes are hard:

Two-Phase Commit (2PC)

                    TWO-PHASE COMMIT

    ┌─────────────┐
    │ Coordinator │
    └──────┬──────┘
           │
    Phase 1: PREPARE
           │
    ┌──────┴──────────────────────────────┐
    │              │                      │
    ▼              ▼                      ▼
┌───────┐      ┌───────┐             ┌───────┐
│Node A │      │Node B │             │Node C │
│       │      │       │             │       │
│Prepare│      │Prepare│             │Prepare│
│  OK   │      │  OK   │             │  OK   │
└───┬───┘      └───┬───┘             └───┬───┘
    │              │                      │
    └──────────────┴──────────────────────┘
           │
           │ All said OK
           │
    Phase 2: COMMIT
           │
    ┌──────┴──────────────────────────────┐
    │              │                      │
    ▼              ▼                      ▼
┌───────┐      ┌───────┐             ┌───────┐
│Node A │      │Node B │             │Node C │
│COMMIT │      │COMMIT │             │COMMIT │
└───────┘      └───────┘             └───────┘

Problems with 2PC:

Coordinator is single point of failure
Blocking: if coordinator dies after prepare, nodes stuck
High latency (multiple round trips)

Consensus Protocols

Paxos and Raft provide fault-tolerant consensus:

                    RAFT CONSENSUS

    Leader Election:
    - Nodes vote for leader
    - Leader handles all writes
    - Followers replicate

    Write:
    1. Client sends to leader
    2. Leader writes to log
    3. Leader replicates to followers
    4. Majority acknowledge
    5. Leader commits
    6. Leader responds to client

    If leader fails: New election
    Safety: Never lose committed writes

CockroachDB, etcd, and TiKV use Raft.

15.7 Distributed SQL Databases

NewSQL databases provide distributed ACID:

CockroachDB

-- Looks like PostgreSQL
CREATE TABLE accounts (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    balance DECIMAL NOT NULL
);

-- But data is distributed across nodes
-- Transactions work across shards!
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT;  -- Distributed transaction, ACID guaranteed

Google Spanner

Global distribution with strong consistency:

    TrueTime: GPS-synchronized clocks
    Enables: Global consistent reads
    Trade-off: Higher latency (cross-region coordination)

TiDB

MySQL-compatible distributed database:

                    TIDB ARCHITECTURE

    ┌─────────────────────────────────────────────────────────────┐
    │                        TiDB Server                           │
    │                    (SQL processing)                          │
    └────────────────────────────┬────────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │                         │
              ┌─────┴─────┐             ┌─────┴─────┐
              │    TiKV   │             │   TiFlash │
              │ (Row OLTP)│             │ (Column   │
              │           │             │   OLAP)   │
              └───────────┘             └───────────┘

    HTAP: Same data accessible via row or column engine

15.8 NoSQL Approaches

Key-Value Stores

    Redis, etcd, DynamoDB

    Simple model: GET(key) → value, PUT(key, value)

    Distribution: Hash key → partition
    Replication: Each partition replicated

    No joins, limited queries
    But: Extremely scalable

Document Stores

    MongoDB, CouchDB

    Store JSON/BSON documents
    Query within documents
    Flexible schema

    Distribution: By document ID or shard key

Wide-Column Stores

    Cassandra, HBase, Bigtable

    Row key → columns (sparse, dynamic)

    ┌─────────────────────────────────────────────────────────────┐
    │ Row: user:12345                                              │
    │   email: alice@example.com                                   │
    │   profile:name: Alice                                        │
    │   profile:city: NYC                                          │
    │   settings:theme: dark                                       │
    │   login:2024-03-15: 10:30:00                                │
    │   login:2024-03-14: 09:15:00                                │
    └─────────────────────────────────────────────────────────────┘

    Columns are dynamic, rows can have different columns

15.9 Challenges in Distribution

Handling Failures

    Network Partitions:
    - Nodes can't communicate
    - Must decide: consistency or availability

    Node Failures:
    - Detect via heartbeat/gossip
    - Failover to replica
    - Rebalance data

    Data Center Failures:
    - Entire DC goes offline
    - Multi-DC replication essential

Consistency vs Performance

    Synchronous replication:
    - Write latency = max(replica latencies)
    - WAN: 50-200ms
    - Users notice!

    Asynchronous replication:
    - Write latency = primary only
    - But: data loss on failure

    Trade-off: How much data can you lose?

Clock Synchronization

    Without synchronized clocks:
    - Can't order events globally
    - Transaction conflicts ambiguous

    Solutions:
    - Logical clocks (Lamport, Vector clocks)
    - TrueTime (GPS + atomic clocks)
    - Hybrid logical clocks (HLC)

15.10 Practical Considerations

When to Distribute

    ✓ Data exceeds single machine capacity
    ✓ Need higher availability than one server
    ✓ Read throughput exceeds single server
    ✓ Users are geographically distributed

    ✗ "Just in case" - Premature optimization
    ✗ Single server handles load fine
    ✗ Data fits on one machine

Starting Simple

    Stage 1: Single PostgreSQL
    - Handles more than you think
    - Vertical scaling first

    Stage 2: Read Replicas
    - Primary for writes
    - Replicas for reads
    - Still relatively simple

    Stage 3: Sharding
    - When single primary is bottleneck
    - Significant complexity increase

    Stage 4: Distributed Database
    - When you truly need it
    - Or use managed service

Managed Services

    AWS RDS: Managed PostgreSQL/MySQL with read replicas
    AWS Aurora: Distributed storage, PostgreSQL/MySQL compatible
    Google Cloud Spanner: Globally distributed, strong consistency
    CockroachDB Serverless: Distributed PostgreSQL-like
    PlanetScale: Distributed MySQL (Vitess)
    MongoDB Atlas: Managed distributed MongoDB

    Trade-offs:
    + No operational burden
    + Expertise baked in
    - Less control
    - Vendor lock-in
    - Cost at scale

15.11 Summary

Distributed databases trade complexity for scalability:

Replication:

Copies data for availability and read scaling
Synchronous = consistent but slow
Asynchronous = fast but potential data loss

Partitioning:

Splits data across nodes for capacity
Hash partitioning for even distribution
Range partitioning for range queries

CAP Theorem:

Choose consistency or availability during partitions
Most systems are AP or CP

Distributed Transactions:

2PC works but has issues
Consensus protocols (Raft) are more robust

Key Advice:

Don’t distribute prematurely
Start with single node + read replicas
Use managed services when possible
Understand your consistency requirements

What’s Next

Congratulations! You’ve completed the main content of “Database Internals.” The appendices provide a glossary of terms and recommendations for further reading.

“Distributed systems are not more complex because engineers like complexity. They’re more complex because the real world is messy, networks fail, and we still want things to work.”

Keyboard shortcuts

Database Internals: Where Your Data Actually Lives