Chapter 5: LSM Trees and Write-Optimized Structures
“If you can’t make writes fast by updating in place, batch them up and sort later.”
B-trees are excellent for reads, but every insert requires finding the right page and modifying it—potentially triggering cascading splits. For write-heavy workloads, this random I/O becomes a bottleneck. LSM trees take a radically different approach: buffer writes in memory, flush them sequentially, and merge later.
5.1 The Write Amplification Problem
With B-trees, a single logical write may cause multiple physical writes:
INSERT one row:
1. Write to WAL (sequential, good)
2. Read B-tree page from disk (random I/O)
3. Modify page in memory
4. Eventually write page back (random I/O)
If page split occurs:
5. Create new page
6. Update parent page
7. Possibly cascade upward
One INSERT → Multiple random writes
Write amplification = (Bytes written to storage) / (Bytes of actual data)
B-trees may have write amplification of 10x or more, especially with many indexes.
5.2 The LSM Tree Insight
The Log-Structured Merge-tree (LSM tree) was introduced by Patrick O’Neil in 1996. The core insight:
Transform random writes into sequential writes by buffering and batching.
LSM TREE ARCHITECTURE
┌─────────────────────────────────────────────────────────────┐
│ IN-MEMORY COMPONENT (Memtable) │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Balanced tree (red-black, skip list) ││
│ │ All recent writes go here ││
│ │ Sorted by key for fast lookups ││
│ └─────────────────────────────────────────────────────────┘│
└──────────────────────────┬──────────────────────────────────┘
│ When full, flush to disk
▼
┌─────────────────────────────────────────────────────────────┐
│ ON-DISK COMPONENTS (SSTables) │
│ ┌──────────────────┐ │
│ │ Level 0 (newest) │ ← Recently flushed memtables │
│ │ SSTable SSTable │ │
│ └──────────────────┘ │
│ │ Compaction merges into larger files │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Level 1 │ ← Larger, sorted, non-overlapping │
│ │ SSTable SSTable │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Level 2 │ ← Even larger │
│ │ SSTable SSTable SSTable │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Level N (oldest) │ ← Oldest data │
│ │ Large SSTables │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
5.3 The Memtable
The memtable is an in-memory sorted data structure holding recent writes:
Common Implementations
Skip list: Fast O(log n) insert and lookup, used by LevelDB, RocksDB
SKIP LIST STRUCTURE
Level 3: ────────────────────────►[50]─────────────────►NIL
Level 2: ────►[10]───────────────►[50]─────────►[80]───►NIL
Level 1: ────►[10]──►[25]────────►[50]──►[60]──►[80]───►NIL
Level 0: ────►[10]──►[25]──►[30]─►[50]──►[60]──►[80]──►[90]►NIL
Random level assignment gives O(log n) average performance
Red-black tree: Balanced BST, used by some implementations
Concurrent data structures: Lock-free or fine-grained locking for multi-threaded access
Memtable Lifecycle
1. Writes go to active memtable
2. When memtable reaches threshold (e.g., 64MB):
a. Make it immutable (read-only)
b. Create new active memtable for new writes
c. Flush immutable memtable to disk as SSTable
3. Delete flushed memtable from memory
Write-Ahead Log Protection
The memtable is volatile—a crash would lose recent writes. To ensure durability:
Write operation:
1. Append to WAL (sequential, durable)
2. Insert into memtable (in-memory)
3. Return success to client
On crash recovery:
1. Replay WAL to reconstruct memtable
2. Continue normal operation
5.4 SSTables (Sorted String Tables)
When a memtable flushes, it becomes an SSTable—an immutable, sorted file:
SSTABLE STRUCTURE
┌─────────────────────────────────────────────────────────────┐
│ DATA BLOCKS │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ key1:value1 │ │ key100:value │ │ key200:value │ ... │
│ │ key2:value2 │ │ key101:value │ │ key201:value │ │
│ │ ... │ │ ... │ │ ... │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ INDEX BLOCK │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ key1 → offset 0 │ key100 → offset 4096 │ │
│ │ key200 → offset 8192 │ ... │ │
│ └───────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ BLOOM FILTER │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Probabilistic membership test (is key maybe here?) │ │
│ └───────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ FOOTER │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Index offset │ Filter offset │ Checksum │ Version │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Properties
- Immutable: Once written, never modified
- Sorted: Keys in lexicographic order
- Indexed: Sparse index for efficient lookup
- Compressed: Often with block-level compression
Bloom Filters
A bloom filter is a probabilistic data structure that quickly tells you if a key is definitely not in the SSTable:
Bloom Filter Query:
"Is key X in this SSTable?"
Bloom filter says NO → Definitely not there (skip this SSTable)
Bloom filter says YES → Might be there (need to check)
False positives possible, false negatives impossible
This saves disk reads by quickly eliminating SSTables that don’t contain the key.
Without bloom filter:
Check SSTable 1: Read index, read block → Not found
Check SSTable 2: Read index, read block → Not found
Check SSTable 3: Read index, read block → Found!
= 6+ disk reads
With bloom filter:
Check SSTable 1 bloom: NO → Skip
Check SSTable 2 bloom: NO → Skip
Check SSTable 3 bloom: MAYBE → Read index, read block → Found!
= 2 disk reads
5.5 Read Path
Reading from an LSM tree requires checking multiple locations:
Read key K:
1. Check memtable (in-memory)
Found? → Return value
2. Check immutable memtable (if any)
Found? → Return value
3. Check Level 0 SSTables (newest first)
For each SSTable:
- Check bloom filter
- If maybe present, search SSTable
Found? → Return value
4. Check Level 1 SSTables
(Binary search since non-overlapping)
Found? → Return value
5. Check Level 2, 3, ... N
Found? → Return value
6. Not found
Read amplification: May need to check multiple SSTables for one key.
Worst case: Key doesn't exist
Check: memtable + immutable memtable +
all L0 SSTables +
one SSTable per level L1-LN
With 4 levels and 4 L0 files = ~8 locations
This is the trade-off: writes are fast (sequential), but reads may check many files.
5.6 Compaction
Compaction is the process of merging SSTables to:
- Reduce read amplification (fewer files to check)
- Reclaim space from deleted/overwritten keys
- Reorganize data for efficient access
Size-Tiered Compaction
Group SSTables of similar size, merge when too many:
Size-Tiered Compaction:
Before:
Small: [S1] [S2] [S3] [S4] ← 4 small files, time to merge
Medium: [M1] [M2]
Large: [L1]
After merging small:
Small: [S5] ← New file from S1+S2
Medium: [M1] [M2] [M3] ← M3 from S3+S4 (grew to medium)
Large: [L1]
Pros: Simple, good write amplification Cons: Space amplification (temporary doubling), unpredictable read performance
Leveled Compaction
Organize SSTables into levels with size ratios:
Leveled Compaction:
Level 0: [SST] [SST] [SST] (overlapping ranges, flush destination)
Level 1: [a-d] [e-h] [i-l] (non-overlapping, 10MB each)
Level 2: [a-b][c-d][e-f]... (non-overlapping, 10x larger total)
Level 3: [a][b][c][d]... (non-overlapping, 10x larger total)
Compaction: Pick L0 file, merge with overlapping L1 files
Level size ratio = 10 (typical)
Level 0: ~10MB total
Level 1: ~100MB total
Level 2: ~1GB total
Level 3: ~10GB total
Level 4: ~100GB total
90% of data is in the last level
Pros: Better read performance, bounded space amplification Cons: Higher write amplification (data rewritten ~10x per level)
Compaction Trade-offs
| Size-Tiered | Leveled | |
|---|---|---|
| Write amplification | Lower | Higher |
| Read amplification | Higher | Lower |
| Space amplification | Higher | Lower |
| Best for | Write-heavy | Read-heavy |
5.7 Handling Updates and Deletes
Updates
LSM trees don’t update in place. A new version is simply written:
Time T1: Write key="user:1", value="Alice"
Time T2: Write key="user:1", value="Alicia" (update)
Both versions exist in different SSTables!
Read sees newest version (T2): "Alicia"
Compaction eventually removes T1 version
Deletes: Tombstones
Deleting a key writes a special tombstone marker:
Time T1: Write key="user:1", value="Alice"
Time T2: Delete key="user:1" → Write tombstone
SSTable 1: [user:1 = "Alice"]
SSTable 2: [user:1 = TOMBSTONE]
Read checks SSTable 2 first, sees tombstone → Key is deleted
Tombstones persist until compaction merges all SSTables containing the key:
Compaction merges SSTable 1 and 2:
See "Alice" at T1
See TOMBSTONE at T2
T2 > T1, so tombstone wins
Result: Key is dropped entirely
Tombstone Accumulation
Deleted keys temporarily increase space usage:
Heavy delete workload:
Original data: 10GB
Delete 5GB worth of data
Until compaction:
- Original data still on disk: 10GB
- Plus tombstones: ~100MB
- Total: ~10.1GB (not 5GB!)
After full compaction: 5GB
5.8 Write Path
The complete write flow:
Write(key, value):
┌─────────────────────────────────────────────────────────────┐
│ 1. APPEND TO WAL │
│ [LSN: 1001 | key | value | checksum] │
│ Sequential write, fsync for durability │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. INSERT INTO MEMTABLE │
│ Skip list insert: O(log n) │
│ In-memory only │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. RETURN SUCCESS │
│ Client sees success once WAL is durable │
└─────────────────────────────────────────────────────────────┘
│
(Background, asynchronous)
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. MEMTABLE FLUSH (when threshold reached) │
│ - Make memtable immutable │
│ - Create new memtable for writes │
│ - Write SSTable to disk (sequential) │
│ - Build bloom filter and index │
│ - Delete old WAL segments │
└─────────────────────────────────────────────────────────────┘
│
(Background, asynchronous)
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. COMPACTION (when thresholds reached) │
│ - Merge SSTables │
│ - Remove obsolete versions │
│ - Remove tombstones (when safe) │
│ - Delete old SSTables │
└─────────────────────────────────────────────────────────────┘
The key insight: All disk writes are sequential. The only random I/O is during reads.
5.9 Comparing B-Trees and LSM Trees
Write Performance
B-tree write (random I/O):
Read page → Modify → Write page
~2 random I/Os per write
LSM write (sequential I/O):
Append to WAL → Insert memtable → Done
1 sequential I/O per write (amortized)
LSM wins on write throughput, especially for SSDs (though gap is smaller than HDD).
Read Performance
B-tree read (one location):
Navigate tree → Read leaf → Done
~3-4 I/Os
LSM read (multiple locations):
Check memtable → Check each level → Maybe found
Varies: 1 to many I/Os
B-tree wins on point reads, especially for keys that exist.
Space Amplification
B-tree:
Data + indexes
~1.5-2x raw data size
LSM (before compaction):
Multiple versions, tombstones
Can be 2-10x temporarily
LSM (after compaction):
~1.1-1.2x raw data size
LSM has variable space usage, depending on compaction state.
Summary Table
| Factor | B-tree | LSM Tree |
|---|---|---|
| Write throughput | Good | Excellent |
| Point read latency | Excellent | Good |
| Range scan | Good | Good |
| Space efficiency | Good | Variable |
| Write amplification | Medium | Low-High |
| Read amplification | Low | Medium-High |
| Predictable latency | Yes | Less so |
5.10 LSM Trees in Practice
RocksDB
Facebook’s embedded storage engine, widely used:
Used by:
- MySQL (MyRocks storage engine)
- CockroachDB
- TiKV (TiDB)
- Apache Flink (state backend)
- Many more
Key features:
- Leveled and size-tiered compaction options
- Column families (multiple LSM trees)
- Rate limiting for compaction
- Block cache and bloom filters
LevelDB
Google’s original embedded LSM implementation:
- Simpler than RocksDB
- Single-threaded compaction
- Foundation for many derivatives
Cassandra
Distributed database using LSM trees:
- Size-tiered compaction by default
- Leveled compaction option
- Time-window compaction for time-series data
HBase
Hadoop-based distributed database:
- LSM trees (called “HFiles”)
- Automatic region splitting
- Bloom filters and block cache
5.11 Tuning LSM Trees
Memtable Size
Larger memtable:
- More writes batched before flush
- Better write throughput
- More memory usage
- Longer recovery time (more WAL to replay)
# RocksDB
write_buffer_size = 64MB
max_write_buffer_number = 3
Level Sizes and Ratios
# Typical configuration:
Level 0: 4 files max before compaction
Level 1: 256MB
Level multiplier: 10
Result:
L0: ~256MB (4 × 64MB memtables)
L1: 256MB
L2: 2.56GB
L3: 25.6GB
L4: 256GB
Bloom Filter Configuration
More bits per key = lower false positive rate = fewer unnecessary reads:
Bits per key | False positive rate
8 | 2%
10 | 1%
12 | 0.3%
14 | 0.1%
Compaction Throttling
Limit compaction I/O to avoid interfering with foreground operations:
# RocksDB rate limiter
rate_limiter = 100MB/s # Cap compaction I/O
5.12 Summary
LSM trees optimize for writes by trading off reads:
- Memtable: In-memory buffer for recent writes
- SSTables: Immutable, sorted files on disk
- Compaction: Merges files to reduce read amplification
- Bloom filters: Skip files that definitely don’t have the key
- Tombstones: Logical deletes until compaction
LSM trees excel for:
- Write-heavy workloads
- Time-series data
- Log aggregation
- Situations where sequential I/O matters
B-trees are better for:
- Read-heavy workloads
- Predictable latency requirements
- Frequent updates to existing data
Many modern databases offer both or hybrid approaches.
What’s Next
In Chapter 6, we’ll explore hash indexes and other specialized index structures—bloom filters in depth, bitmap indexes, and when to use alternatives to B-trees and LSM trees.
“Every write is expensive eventually—the question is whether you pay now (B-tree) or later (LSM tree).”