Chapter 15: Distributed Databases and Replication
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport
Single-server databases have limits: storage capacity, query throughput, availability. Distributed databases spread data across multiple machines to overcome these limits—but at the cost of significant complexity.
15.1 Why Distribute?
Scaling Vertically vs Horizontally
Vertical Scaling (Scale Up):
┌───────────────────┐ ┌───────────────────┐
│ Small Server │ → │ Big Server │
│ 4 cores, 16GB RAM │ │ 64 cores, 1TB RAM │
└───────────────────┘ └───────────────────┘
Limits: Hardware maxes out, cost increases exponentially
Horizontal Scaling (Scale Out):
┌───────────────────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ Small Server │ → │Node1│ │Node2│ │Node3│ │Node4│
│ │ └─────┘ └─────┘ └─────┘ └─────┘
└───────────────────┘
Benefits: Linear cost scaling, no hardware ceiling
Challenge: Distributed systems are HARD
Reasons to Distribute
- Capacity: Data doesn’t fit on one machine
- Throughput: More machines = more queries/second
- Availability: Survive machine failures
- Latency: Place data closer to users geographically
15.2 Replication
Replication copies data to multiple nodes:
REPLICATION
┌─────────────────────────────────────────────────────────────┐
│ Primary │
│ ┌─────────┐ │
│ │ Data │ │
│ └────┬────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Replica │ │ Replica │ │ Replica │ │
│ │ #1 │ │ #2 │ │ #3 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Benefits: │
│ - Fault tolerance (survive node failure) │
│ - Read scalability (spread reads across replicas) │
│ - Geographic distribution (low latency) │
└─────────────────────────────────────────────────────────────┘
Synchronous vs Asynchronous Replication
Synchronous:
┌──────────────────────────────────────────────────────────┐
│ 1. Client writes to primary │
│ 2. Primary sends to replica │
│ 3. Replica acknowledges │
│ 4. Primary acknowledges to client │
│ │
│ Pros: No data loss on failure │
│ Cons: Higher latency (wait for replica) │
└──────────────────────────────────────────────────────────┘
Asynchronous:
┌──────────────────────────────────────────────────────────┐
│ 1. Client writes to primary │
│ 2. Primary acknowledges to client immediately │
│ 3. Primary sends to replica (background) │
│ │
│ Pros: Lower latency │
│ Cons: Potential data loss if primary fails │
└──────────────────────────────────────────────────────────┘
Leader-Based Replication
LEADER-FOLLOWER
┌─────────────┐
│ Leader │ ← All writes go here
│ (Primary) │
└──────┬──────┘
│ Replication stream (WAL)
┌──────┴───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Follower │ │Follower │ │Follower │ ← Serve reads
│ #1 │ │ #2 │ │ #3 │
└─────────┘ └─────────┘ └─────────┘
PostgreSQL, MySQL, and most traditional databases use this model.
Leaderless Replication
LEADERLESS
┌─────────────────────────────────────────────────────────────┐
│ Client writes to N nodes │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node │ │ Node │ │ Node │ │
│ │ A │ │ B │ │ C │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ │ │
│ Write to all │
│ Wait for W acknowledgments │
│ │
│ Quorum: W + R > N ensures read sees latest write │
│ Example: N=3, W=2, R=2 │
└─────────────────────────────────────────────────────────────┘
Cassandra and DynamoDB use leaderless replication.
15.3 Partitioning (Sharding)
Partitioning splits data across multiple nodes:
HORIZONTAL PARTITIONING
Users Table (10 million rows):
┌─────────────────────────────────────────────────────────────┐
│ Partitioning Strategy │
│ Hash(user_id) mod 4 │
└─────────────────────────────────────────────────────────────┘
│
┌────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 0 │ │ Shard 1 │ │ Shard 2 │ ...
│ 2.5M │ │ 2.5M │ │ 2.5M │
│ rows │ │ rows │ │ rows │
│ │ │ │ │ │
│user_id │ │user_id │ │user_id │
│mod 4 = 0│ │mod 4 = 1│ │mod 4 = 2│
└─────────┘ └─────────┘ └─────────┘
Partitioning Strategies
Hash Partitioning:
partition = hash(key) mod num_partitions
Pros: Even distribution
Cons: Can't do range queries efficiently
Range Partitioning:
user_id 1-1M → Partition 1
user_id 1M-2M → Partition 2
...
Pros: Range queries stay on one partition
Cons: Hot spots if data is skewed
Directory-Based Partitioning:
Lookup table maps keys to partitions
Maximum flexibility
But: Lookup table is a bottleneck
Cross-Partition Queries
-- Single partition (fast):
SELECT * FROM users WHERE user_id = 12345;
-- Routes to one shard
-- Cross-partition (slow):
SELECT * FROM users WHERE name = 'Alice';
-- Must query ALL shards, combine results
-- Aggregation (very slow):
SELECT COUNT(*) FROM users;
-- Query all shards, sum results
15.4 The CAP Theorem
You can’t have all three:
CAP THEOREM
Consistency
/\
/ \
/ \
/ \
/ CA \
/ (RDBMS) \
/____________\
/\ /\
/ \ / \
/ CP \ / AP \
/(HBase)\ /(Cassandra)
/________\ /________\
Partition Availability
Tolerance
Consistency: All nodes see the same data Availability: Every request gets a response Partition Tolerance: System works despite network failures
In a distributed system, network partitions WILL happen. So you must choose:
- CP: Consistent but may be unavailable during partition
- AP: Available but may return stale data during partition
CAP in Practice
PostgreSQL (single node): CA
- Consistent and available
- Not partition tolerant (single node!)
PostgreSQL with sync replication: CP
- Consistent (waits for replica)
- Unavailable if replica unreachable
Cassandra: AP
- Available (writes succeed)
- Eventually consistent (reads may be stale)
CockroachDB, Spanner: CP
- Consistent (distributed transactions)
- Unavailable if can't reach quorum
15.5 Consistency Models
Strong Consistency
Every read sees the most recent write:
Time: T1 T2 T3
Write: X=1 →
Read: → X=1 → X=1
All reads after write see the new value
Requires coordination (slow)
Eventual Consistency
Reads may see old data, but eventually converge:
Time: T1 T2 T3 T4
Write: X=1
Read A: X=0 (stale!)
Read B: X=1
Read C: X=1
Eventually all reads return 1
No coordination required (fast)
Read-Your-Writes Consistency
A user always sees their own writes:
User A writes X=1
User A reads X → sees 1 (guaranteed)
User B reads X → might see 0 (stale OK)
Common in session-scoped scenarios
Causal Consistency
If A causes B, everyone sees them in that order:
User posts message, then replies to it
Viewers always see post before reply (never reply without post)
15.6 Distributed Transactions
Transactions spanning multiple nodes are hard:
Two-Phase Commit (2PC)
TWO-PHASE COMMIT
┌─────────────┐
│ Coordinator │
└──────┬──────┘
│
Phase 1: PREPARE
│
┌──────┴──────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Node A │ │Node B │ │Node C │
│ │ │ │ │ │
│Prepare│ │Prepare│ │Prepare│
│ OK │ │ OK │ │ OK │
└───┬───┘ └───┬───┘ └───┬───┘
│ │ │
└──────────────┴──────────────────────┘
│
│ All said OK
│
Phase 2: COMMIT
│
┌──────┴──────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Node A │ │Node B │ │Node C │
│COMMIT │ │COMMIT │ │COMMIT │
└───────┘ └───────┘ └───────┘
Problems with 2PC:
- Coordinator is single point of failure
- Blocking: if coordinator dies after prepare, nodes stuck
- High latency (multiple round trips)
Consensus Protocols
Paxos and Raft provide fault-tolerant consensus:
RAFT CONSENSUS
Leader Election:
- Nodes vote for leader
- Leader handles all writes
- Followers replicate
Write:
1. Client sends to leader
2. Leader writes to log
3. Leader replicates to followers
4. Majority acknowledge
5. Leader commits
6. Leader responds to client
If leader fails: New election
Safety: Never lose committed writes
CockroachDB, etcd, and TiKV use Raft.
15.7 Distributed SQL Databases
NewSQL databases provide distributed ACID:
CockroachDB
-- Looks like PostgreSQL
CREATE TABLE accounts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
balance DECIMAL NOT NULL
);
-- But data is distributed across nodes
-- Transactions work across shards!
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT; -- Distributed transaction, ACID guaranteed
Google Spanner
Global distribution with strong consistency:
TrueTime: GPS-synchronized clocks
Enables: Global consistent reads
Trade-off: Higher latency (cross-region coordination)
TiDB
MySQL-compatible distributed database:
TIDB ARCHITECTURE
┌─────────────────────────────────────────────────────────────┐
│ TiDB Server │
│ (SQL processing) │
└────────────────────────────┬────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ TiKV │ │ TiFlash │
│ (Row OLTP)│ │ (Column │
│ │ │ OLAP) │
└───────────┘ └───────────┘
HTAP: Same data accessible via row or column engine
15.8 NoSQL Approaches
Key-Value Stores
Redis, etcd, DynamoDB
Simple model: GET(key) → value, PUT(key, value)
Distribution: Hash key → partition
Replication: Each partition replicated
No joins, limited queries
But: Extremely scalable
Document Stores
MongoDB, CouchDB
Store JSON/BSON documents
Query within documents
Flexible schema
Distribution: By document ID or shard key
Wide-Column Stores
Cassandra, HBase, Bigtable
Row key → columns (sparse, dynamic)
┌─────────────────────────────────────────────────────────────┐
│ Row: user:12345 │
│ email: alice@example.com │
│ profile:name: Alice │
│ profile:city: NYC │
│ settings:theme: dark │
│ login:2024-03-15: 10:30:00 │
│ login:2024-03-14: 09:15:00 │
└─────────────────────────────────────────────────────────────┘
Columns are dynamic, rows can have different columns
15.9 Challenges in Distribution
Handling Failures
Network Partitions:
- Nodes can't communicate
- Must decide: consistency or availability
Node Failures:
- Detect via heartbeat/gossip
- Failover to replica
- Rebalance data
Data Center Failures:
- Entire DC goes offline
- Multi-DC replication essential
Consistency vs Performance
Synchronous replication:
- Write latency = max(replica latencies)
- WAN: 50-200ms
- Users notice!
Asynchronous replication:
- Write latency = primary only
- But: data loss on failure
Trade-off: How much data can you lose?
Clock Synchronization
Without synchronized clocks:
- Can't order events globally
- Transaction conflicts ambiguous
Solutions:
- Logical clocks (Lamport, Vector clocks)
- TrueTime (GPS + atomic clocks)
- Hybrid logical clocks (HLC)
15.10 Practical Considerations
When to Distribute
✓ Data exceeds single machine capacity
✓ Need higher availability than one server
✓ Read throughput exceeds single server
✓ Users are geographically distributed
✗ "Just in case" - Premature optimization
✗ Single server handles load fine
✗ Data fits on one machine
Starting Simple
Stage 1: Single PostgreSQL
- Handles more than you think
- Vertical scaling first
Stage 2: Read Replicas
- Primary for writes
- Replicas for reads
- Still relatively simple
Stage 3: Sharding
- When single primary is bottleneck
- Significant complexity increase
Stage 4: Distributed Database
- When you truly need it
- Or use managed service
Managed Services
AWS RDS: Managed PostgreSQL/MySQL with read replicas
AWS Aurora: Distributed storage, PostgreSQL/MySQL compatible
Google Cloud Spanner: Globally distributed, strong consistency
CockroachDB Serverless: Distributed PostgreSQL-like
PlanetScale: Distributed MySQL (Vitess)
MongoDB Atlas: Managed distributed MongoDB
Trade-offs:
+ No operational burden
+ Expertise baked in
- Less control
- Vendor lock-in
- Cost at scale
15.11 Summary
Distributed databases trade complexity for scalability:
Replication:
- Copies data for availability and read scaling
- Synchronous = consistent but slow
- Asynchronous = fast but potential data loss
Partitioning:
- Splits data across nodes for capacity
- Hash partitioning for even distribution
- Range partitioning for range queries
CAP Theorem:
- Choose consistency or availability during partitions
- Most systems are AP or CP
Distributed Transactions:
- 2PC works but has issues
- Consensus protocols (Raft) are more robust
Key Advice:
- Don’t distribute prematurely
- Start with single node + read replicas
- Use managed services when possible
- Understand your consistency requirements
What’s Next
Congratulations! You’ve completed the main content of “Database Internals.” The appendices provide a glossary of terms and recommendations for further reading.
“Distributed systems are not more complex because engineers like complexity. They’re more complex because the real world is messy, networks fail, and we still want things to work.”