Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 5: LSM Trees and Write-Optimized Structures

“If you can’t make writes fast by updating in place, batch them up and sort later.”

B-trees are excellent for reads, but every insert requires finding the right page and modifying it—potentially triggering cascading splits. For write-heavy workloads, this random I/O becomes a bottleneck. LSM trees take a radically different approach: buffer writes in memory, flush them sequentially, and merge later.


5.1 The Write Amplification Problem

With B-trees, a single logical write may cause multiple physical writes:

    INSERT one row:
    1. Write to WAL (sequential, good)
    2. Read B-tree page from disk (random I/O)
    3. Modify page in memory
    4. Eventually write page back (random I/O)

    If page split occurs:
    5. Create new page
    6. Update parent page
    7. Possibly cascade upward

    One INSERT → Multiple random writes

Write amplification = (Bytes written to storage) / (Bytes of actual data)

B-trees may have write amplification of 10x or more, especially with many indexes.


5.2 The LSM Tree Insight

The Log-Structured Merge-tree (LSM tree) was introduced by Patrick O’Neil in 1996. The core insight:

Transform random writes into sequential writes by buffering and batching.

                    LSM TREE ARCHITECTURE

    ┌─────────────────────────────────────────────────────────────┐
    │  IN-MEMORY COMPONENT (Memtable)                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │  Balanced tree (red-black, skip list)                   ││
    │  │  All recent writes go here                               ││
    │  │  Sorted by key for fast lookups                          ││
    │  └─────────────────────────────────────────────────────────┘│
    └──────────────────────────┬──────────────────────────────────┘
                               │ When full, flush to disk
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  ON-DISK COMPONENTS (SSTables)                               │
    │  ┌──────────────────┐                                       │
    │  │ Level 0 (newest) │  ← Recently flushed memtables         │
    │  │ SSTable SSTable  │                                       │
    │  └──────────────────┘                                       │
    │          │ Compaction merges into larger files               │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level 1          │  ← Larger, sorted, non-overlapping    │
    │  │ SSTable SSTable  │                                       │
    │  └──────────────────┘                                       │
    │          │                                                   │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level 2          │  ← Even larger                        │
    │  │ SSTable SSTable SSTable │                                │
    │  └──────────────────┘                                       │
    │          │                                                   │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level N (oldest) │  ← Oldest data                        │
    │  │ Large SSTables   │                                       │
    │  └──────────────────┘                                       │
    └─────────────────────────────────────────────────────────────┘

5.3 The Memtable

The memtable is an in-memory sorted data structure holding recent writes:

Common Implementations

Skip list: Fast O(log n) insert and lookup, used by LevelDB, RocksDB

    SKIP LIST STRUCTURE

    Level 3:    ────────────────────────►[50]─────────────────►NIL
    Level 2:    ────►[10]───────────────►[50]─────────►[80]───►NIL
    Level 1:    ────►[10]──►[25]────────►[50]──►[60]──►[80]───►NIL
    Level 0:    ────►[10]──►[25]──►[30]─►[50]──►[60]──►[80]──►[90]►NIL

    Random level assignment gives O(log n) average performance

Red-black tree: Balanced BST, used by some implementations

Concurrent data structures: Lock-free or fine-grained locking for multi-threaded access

Memtable Lifecycle

    1. Writes go to active memtable
    2. When memtable reaches threshold (e.g., 64MB):
       a. Make it immutable (read-only)
       b. Create new active memtable for new writes
       c. Flush immutable memtable to disk as SSTable
    3. Delete flushed memtable from memory

Write-Ahead Log Protection

The memtable is volatile—a crash would lose recent writes. To ensure durability:

    Write operation:
    1. Append to WAL (sequential, durable)
    2. Insert into memtable (in-memory)
    3. Return success to client

    On crash recovery:
    1. Replay WAL to reconstruct memtable
    2. Continue normal operation

5.4 SSTables (Sorted String Tables)

When a memtable flushes, it becomes an SSTable—an immutable, sorted file:

                    SSTABLE STRUCTURE

    ┌─────────────────────────────────────────────────────────────┐
    │  DATA BLOCKS                                                 │
    │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
    │  │ key1:value1   │ │ key100:value  │ │ key200:value  │ ...  │
    │  │ key2:value2   │ │ key101:value  │ │ key201:value  │      │
    │  │ ...           │ │ ...           │ │ ...           │      │
    │  └───────────────┘ └───────────────┘ └───────────────┘      │
    ├─────────────────────────────────────────────────────────────┤
    │  INDEX BLOCK                                                 │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ key1 → offset 0     │ key100 → offset 4096           │  │
    │  │ key200 → offset 8192 │ ...                            │  │
    │  └───────────────────────────────────────────────────────┘  │
    ├─────────────────────────────────────────────────────────────┤
    │  BLOOM FILTER                                                │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ Probabilistic membership test (is key maybe here?)    │  │
    │  └───────────────────────────────────────────────────────┘  │
    ├─────────────────────────────────────────────────────────────┤
    │  FOOTER                                                      │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ Index offset │ Filter offset │ Checksum │ Version    │  │
    │  └───────────────────────────────────────────────────────┘  │
    └─────────────────────────────────────────────────────────────┘

Key Properties

  1. Immutable: Once written, never modified
  2. Sorted: Keys in lexicographic order
  3. Indexed: Sparse index for efficient lookup
  4. Compressed: Often with block-level compression

Bloom Filters

A bloom filter is a probabilistic data structure that quickly tells you if a key is definitely not in the SSTable:

    Bloom Filter Query:

    "Is key X in this SSTable?"

    Bloom filter says NO  → Definitely not there (skip this SSTable)
    Bloom filter says YES → Might be there (need to check)

    False positives possible, false negatives impossible

This saves disk reads by quickly eliminating SSTables that don’t contain the key.

    Without bloom filter:
    Check SSTable 1: Read index, read block → Not found
    Check SSTable 2: Read index, read block → Not found
    Check SSTable 3: Read index, read block → Found!
    = 6+ disk reads

    With bloom filter:
    Check SSTable 1 bloom: NO → Skip
    Check SSTable 2 bloom: NO → Skip
    Check SSTable 3 bloom: MAYBE → Read index, read block → Found!
    = 2 disk reads

5.5 Read Path

Reading from an LSM tree requires checking multiple locations:

    Read key K:

    1. Check memtable (in-memory)
       Found? → Return value

    2. Check immutable memtable (if any)
       Found? → Return value

    3. Check Level 0 SSTables (newest first)
       For each SSTable:
         - Check bloom filter
         - If maybe present, search SSTable
       Found? → Return value

    4. Check Level 1 SSTables
       (Binary search since non-overlapping)
       Found? → Return value

    5. Check Level 2, 3, ... N
       Found? → Return value

    6. Not found

Read amplification: May need to check multiple SSTables for one key.

    Worst case: Key doesn't exist

    Check: memtable + immutable memtable +
           all L0 SSTables +
           one SSTable per level L1-LN

    With 4 levels and 4 L0 files = ~8 locations

This is the trade-off: writes are fast (sequential), but reads may check many files.


5.6 Compaction

Compaction is the process of merging SSTables to:

  1. Reduce read amplification (fewer files to check)
  2. Reclaim space from deleted/overwritten keys
  3. Reorganize data for efficient access

Size-Tiered Compaction

Group SSTables of similar size, merge when too many:

    Size-Tiered Compaction:

    Before:
    Small:  [S1] [S2] [S3] [S4]  ← 4 small files, time to merge
    Medium: [M1] [M2]
    Large:  [L1]

    After merging small:
    Small:  [S5]                 ← New file from S1+S2
    Medium: [M1] [M2] [M3]       ← M3 from S3+S4 (grew to medium)
    Large:  [L1]

Pros: Simple, good write amplification Cons: Space amplification (temporary doubling), unpredictable read performance

Leveled Compaction

Organize SSTables into levels with size ratios:

    Leveled Compaction:

    Level 0: [SST] [SST] [SST]     (overlapping ranges, flush destination)
    Level 1: [a-d] [e-h] [i-l]     (non-overlapping, 10MB each)
    Level 2: [a-b][c-d][e-f]...    (non-overlapping, 10x larger total)
    Level 3: [a][b][c][d]...       (non-overlapping, 10x larger total)

    Compaction: Pick L0 file, merge with overlapping L1 files
    Level size ratio = 10 (typical)

    Level 0: ~10MB total
    Level 1: ~100MB total
    Level 2: ~1GB total
    Level 3: ~10GB total
    Level 4: ~100GB total

    90% of data is in the last level

Pros: Better read performance, bounded space amplification Cons: Higher write amplification (data rewritten ~10x per level)

Compaction Trade-offs

Size-TieredLeveled
Write amplificationLowerHigher
Read amplificationHigherLower
Space amplificationHigherLower
Best forWrite-heavyRead-heavy

5.7 Handling Updates and Deletes

Updates

LSM trees don’t update in place. A new version is simply written:

    Time T1: Write key="user:1", value="Alice"
    Time T2: Write key="user:1", value="Alicia"  (update)

    Both versions exist in different SSTables!

    Read sees newest version (T2): "Alicia"
    Compaction eventually removes T1 version

Deletes: Tombstones

Deleting a key writes a special tombstone marker:

    Time T1: Write key="user:1", value="Alice"
    Time T2: Delete key="user:1"  → Write tombstone

    SSTable 1: [user:1 = "Alice"]
    SSTable 2: [user:1 = TOMBSTONE]

    Read checks SSTable 2 first, sees tombstone → Key is deleted

Tombstones persist until compaction merges all SSTables containing the key:

    Compaction merges SSTable 1 and 2:

    See "Alice" at T1
    See TOMBSTONE at T2
    T2 > T1, so tombstone wins
    Result: Key is dropped entirely

Tombstone Accumulation

Deleted keys temporarily increase space usage:

    Heavy delete workload:

    Original data: 10GB
    Delete 5GB worth of data

    Until compaction:
    - Original data still on disk: 10GB
    - Plus tombstones: ~100MB
    - Total: ~10.1GB (not 5GB!)

    After full compaction: 5GB

5.8 Write Path

The complete write flow:

    Write(key, value):

    ┌─────────────────────────────────────────────────────────────┐
    │ 1. APPEND TO WAL                                             │
    │    [LSN: 1001 | key | value | checksum]                     │
    │    Sequential write, fsync for durability                    │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. INSERT INTO MEMTABLE                                      │
    │    Skip list insert: O(log n)                                │
    │    In-memory only                                            │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 3. RETURN SUCCESS                                            │
    │    Client sees success once WAL is durable                   │
    └─────────────────────────────────────────────────────────────┘
                               │
                    (Background, asynchronous)
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 4. MEMTABLE FLUSH (when threshold reached)                   │
    │    - Make memtable immutable                                 │
    │    - Create new memtable for writes                          │
    │    - Write SSTable to disk (sequential)                      │
    │    - Build bloom filter and index                            │
    │    - Delete old WAL segments                                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                    (Background, asynchronous)
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 5. COMPACTION (when thresholds reached)                      │
    │    - Merge SSTables                                          │
    │    - Remove obsolete versions                                │
    │    - Remove tombstones (when safe)                           │
    │    - Delete old SSTables                                     │
    └─────────────────────────────────────────────────────────────┘

The key insight: All disk writes are sequential. The only random I/O is during reads.


5.9 Comparing B-Trees and LSM Trees

Write Performance

    B-tree write (random I/O):
    Read page → Modify → Write page
    ~2 random I/Os per write

    LSM write (sequential I/O):
    Append to WAL → Insert memtable → Done
    1 sequential I/O per write (amortized)

LSM wins on write throughput, especially for SSDs (though gap is smaller than HDD).

Read Performance

    B-tree read (one location):
    Navigate tree → Read leaf → Done
    ~3-4 I/Os

    LSM read (multiple locations):
    Check memtable → Check each level → Maybe found
    Varies: 1 to many I/Os

B-tree wins on point reads, especially for keys that exist.

Space Amplification

    B-tree:
    Data + indexes
    ~1.5-2x raw data size

    LSM (before compaction):
    Multiple versions, tombstones
    Can be 2-10x temporarily

    LSM (after compaction):
    ~1.1-1.2x raw data size

LSM has variable space usage, depending on compaction state.

Summary Table

FactorB-treeLSM Tree
Write throughputGoodExcellent
Point read latencyExcellentGood
Range scanGoodGood
Space efficiencyGoodVariable
Write amplificationMediumLow-High
Read amplificationLowMedium-High
Predictable latencyYesLess so

5.10 LSM Trees in Practice

RocksDB

Facebook’s embedded storage engine, widely used:

    Used by:
    - MySQL (MyRocks storage engine)
    - CockroachDB
    - TiKV (TiDB)
    - Apache Flink (state backend)
    - Many more

Key features:

  • Leveled and size-tiered compaction options
  • Column families (multiple LSM trees)
  • Rate limiting for compaction
  • Block cache and bloom filters

LevelDB

Google’s original embedded LSM implementation:

  • Simpler than RocksDB
  • Single-threaded compaction
  • Foundation for many derivatives

Cassandra

Distributed database using LSM trees:

  • Size-tiered compaction by default
  • Leveled compaction option
  • Time-window compaction for time-series data

HBase

Hadoop-based distributed database:

  • LSM trees (called “HFiles”)
  • Automatic region splitting
  • Bloom filters and block cache

5.11 Tuning LSM Trees

Memtable Size

Larger memtable:

  • More writes batched before flush
  • Better write throughput
  • More memory usage
  • Longer recovery time (more WAL to replay)
# RocksDB
write_buffer_size = 64MB
max_write_buffer_number = 3

Level Sizes and Ratios

# Typical configuration:
Level 0: 4 files max before compaction
Level 1: 256MB
Level multiplier: 10

Result:
L0: ~256MB (4 × 64MB memtables)
L1: 256MB
L2: 2.56GB
L3: 25.6GB
L4: 256GB

Bloom Filter Configuration

More bits per key = lower false positive rate = fewer unnecessary reads:

Bits per key | False positive rate
     8       |       2%
    10       |       1%
    12       |      0.3%
    14       |      0.1%

Compaction Throttling

Limit compaction I/O to avoid interfering with foreground operations:

# RocksDB rate limiter
rate_limiter = 100MB/s  # Cap compaction I/O

5.12 Summary

LSM trees optimize for writes by trading off reads:

  • Memtable: In-memory buffer for recent writes
  • SSTables: Immutable, sorted files on disk
  • Compaction: Merges files to reduce read amplification
  • Bloom filters: Skip files that definitely don’t have the key
  • Tombstones: Logical deletes until compaction

LSM trees excel for:

  • Write-heavy workloads
  • Time-series data
  • Log aggregation
  • Situations where sequential I/O matters

B-trees are better for:

  • Read-heavy workloads
  • Predictable latency requirements
  • Frequent updates to existing data

Many modern databases offer both or hybrid approaches.


What’s Next

In Chapter 6, we’ll explore hash indexes and other specialized index structures—bloom filters in depth, bitmap indexes, and when to use alternatives to B-trees and LSM trees.


“Every write is expensive eventually—the question is whether you pay now (B-tree) or later (LSM tree).”