Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 4: Indexing Structures - B-Trees and Beyond

“Show me your indexes, and I’ll show you your performance.”

Without indexes, every query would require scanning every row. Indexes are the data structures that make databases fast. In this chapter, we’ll explore the most important index structure in database history: the B-tree.


4.1 Why We Need Indexes

Consider a table with 10 million users:

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index: Scan all 10 million rows, checking each email. Even at 100,000 rows/second, this takes 100 seconds.

With an index on email: Look up in index (~3 I/O operations), fetch the row. Takes milliseconds.

    Table Scan:         Index Lookup:

    O(n) = 10,000,000   O(log n) = ~23 (base-2) or ~4 (B-tree)

    100 seconds         5 milliseconds

Indexes trade space (storing extra data structures) and write performance (maintaining indexes on INSERT/UPDATE/DELETE) for dramatically faster reads.


4.2 Index Basics

An index is a separate data structure that maps values to row locations:

                    CONCEPTUAL INDEX

    Index on email                          Table
    ┌─────────────────────┬──────┐         ┌────────────────────────┐
    │ alice@example.com   │ → ───│─────────│ id=42, Alice, alice@.. │
    │ bob@example.com     │ → ───│─────────│ id=17, Bob, bob@...    │
    │ carol@example.com   │ → ───│─────────│ id=91, Carol, carol@.. │
    │ ...                 │      │         │ ...                    │
    └─────────────────────┴──────┘         └────────────────────────┘

Key properties:

  • Keys: The values being indexed (email addresses)
  • Values: Pointers to the actual rows (page + slot, or primary key)
  • Sorted: Keys are kept in sorted order for efficient searching

4.3 From Binary Trees to B-Trees

Why Not Binary Search Trees?

A binary search tree (BST) seems like a natural choice:

                    BINARY SEARCH TREE

                         50
                        /  \
                      25    75
                     / \    / \
                   12  37  62  87
                   /\  /\  /\  /\
                  ...

Problem: With millions of keys, the tree becomes very deep.

  • 1 million keys → ~20 levels
  • Each level = 1 random I/O
  • 20 random I/Os per lookup = unacceptable

The B-Tree Insight

Rudolf Bayer and Ed McCreight invented B-trees in 1972 with a key insight: Increase node fanout to reduce tree height.

Instead of 2 children per node, have hundreds or thousands:

                    B-TREE (conceptual)

                    ┌─────────────────────────┐
       Level 0:     │ 100 │ 200 │ 300 │ 400  │    (1 node)
                    └───┬───┬───┬───┬───┬───┘
                       /    |   |   |    \
       Level 1:    ┌──┴──┐ ... ... ...  ┌──┴──┐   (~500 nodes)
                   │ 1-20│              │480-500│
                   └──┬──┘              └───┬──┘
                     / \                   / \
       Level 2:   ┌─┴─┐                 ┌─┴─┐     (~250,000 nodes)
                  │...│                 │...│
                  └───┘                 └───┘

With fanout of 500:

  • Level 0: 1 node
  • Level 1: 500 nodes
  • Level 2: 250,000 nodes
  • Level 3: 125,000,000 keys

Three levels can index 125 million keys!


4.4 B-Tree Structure

A B-tree of order m has these properties:

  1. Each node has at most m children
  2. Each internal node (except root) has at least ⌈m/2⌉ children
  3. The root has at least 2 children (if not a leaf)
  4. All leaves are at the same depth
  5. A node with k children contains k-1 keys

Node Structure

                    B-TREE NODE

    ┌─────────────────────────────────────────────────────────┐
    │  Header: node_type=internal, key_count=4, ...           │
    ├─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┤
    │ P0  │ K1  │ P1  │ K2  │ P2  │ K3  │ P3  │ K4  │ P4     │
    └──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬─────┘
       │           │           │           │           │
       ▼           ▼           ▼           ▼           ▼
    Keys <K1    K1≤Keys<K2  K2≤Keys<K3  K3≤Keys<K4   Keys≥K4
  • Pn: Pointer to child page (for internal nodes) or row (for leaves)
  • Kn: Key values
  • Keys are sorted: K1 < K2 < K3 < K4

Fanout Calculation

How many keys fit in one B-tree page?

    Page size: 8192 bytes
    Header: 24 bytes
    Key size: 8 bytes (bigint)
    Pointer size: 6 bytes (page + offset)

    Available: 8192 - 24 = 8168 bytes
    Entry size: 8 + 6 = 14 bytes
    Entries per page: 8168 / 14 ≈ 583

    Fanout ≈ 583

With fanout of 583:

  • 4 levels can index: 583^4 = 115 billion keys
  • Most real indexes are 3-4 levels deep

Searching a B-tree is straightforward:

Search(node, key):
    if node is leaf:
        binary search for key in node
        return found/not found
    else:
        find child pointer P where K[i-1] ≤ key < K[i]
        return Search(P, key)

Example: Search for key 75

                    Root: [50, 100, 150]
                          /    |     \
                        /      |      \
    [10,20,30,40]   [60,70,80,90]   [110,120,...]
                          |
                    Find: 75 is here!

Steps:

  1. At root: 50 ≤ 75 < 100, go to middle child
  2. At child: Binary search finds 75
  3. Total: 2 page reads

4.6 B-Tree Insertion

Insertion is more complex because we must maintain balance:

Simple Case: Leaf Has Room

Insert 75 into:

    [60, 70, 80, 90, _, _]   (has room)
              ↓
    [60, 70, 75, 80, 90, _]  (inserted in order)

Complex Case: Page Split

When a page is full, we must split:

Insert 75 into full node:

    [60, 65, 70, 80, 85, 90]   (full!)

Step 1: Split into two nodes
    [60, 65, 70]    [80, 85, 90]

Step 2: Insert new key
    [60, 65, 70, 75]    [80, 85, 90]

Step 3: Promote middle key (75) to parent
    Parent: [..., 75, ...]
              /        \
    [60, 65, 70]    [80, 85, 90]

If the parent is also full, the split propagates upward. In the worst case, it reaches the root, creating a new root and increasing tree height by 1.

                    PAGE SPLIT CASCADE

    Before:
                    [Root: 50]
                    /         \
           [30, 40]            [70, 80] ← Insert 75 here (full!)

    After splitting leaf:
                    [Root: 50]
                    /         \
           [30, 40]            [75] ← Promote to parent
                              /    \
                         [70]      [80]

    But parent might be full too... splits propagate!

4.7 B-Tree Deletion

Deletion must also maintain balance:

Simple Case: Remove Without Underflow

Delete 75:

    [60, 70, 75, 80, 90]   (has enough keys)
              ↓
    [60, 70, 80, 90, _]    (removed, still valid)

Complex Case: Underflow

When a node has too few keys after deletion:

Option 1: Borrow from sibling

    [30, 40]  parent: [50]  [_, _]  (right sibling underflow)

    Rotate through parent:
    [30]  parent: [40]  [50]

Option 2: Merge with sibling

    [30]  parent: [50]  [70]  (both minimal)

    Merge:
    [30, 50, 70]  (parent removes key, may cascade)

Merges can propagate upward, potentially reducing tree height.

Deletion in Practice

Many databases use a simpler approach:

  • Mark entries as deleted (tombstones)
  • Periodically rebuild or compact the index
  • Space is reclaimed lazily

This avoids complex rebalancing during normal operations.


4.8 B+ Trees: The Practical Variant

Most databases actually use B+ trees, a variant where:

  1. All values are in leaves: Internal nodes only have keys and child pointers
  2. Leaves are linked: Forms a doubly-linked list for range scans
  3. Keys may be duplicated: Internal keys are “separator” copies
                    B+ TREE

    Internal nodes (keys + child pointers only):
                    [50 | 100]
                   /    |    \
                 /      |      \
               /        |        \

    Leaf nodes (keys + values, linked):
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │ 10→row      │←→ │ 50→row      │←→ │ 100→row     │
    │ 20→row      │   │ 60→row      │   │ 110→row     │
    │ 30→row      │   │ 70→row      │   │ 120→row     │
    └─────────────┘   └─────────────┘   └─────────────┘
          ▲                                   ▲
          └─────── Linked list ───────────────┘

Why B+ Trees?

Advantage 1: Higher fanout in internal nodes

Internal nodes don’t store values, only keys. More keys per page = higher fanout = shallower tree.

Advantage 2: Efficient range scans

SELECT * FROM users WHERE age BETWEEN 20 AND 30;
  1. Find first leaf with age ≥ 20
  2. Follow leaf pointers until age > 30
  3. No need to traverse internal nodes again
    Range scan: Follow the chain

    [..., 18, 19] → [20, 21, 22, ...] → [28, 29, 30, 31, ...] → ...
                     ↑                         ↑
                   Start                     Stop

Advantage 3: Sequential leaf scanning

For full index scans, just read all leaves in order. Very cache-friendly.


4.9 B-Tree Variations

B* Trees

Keep nodes at least 2/3 full (instead of 1/2) by redistributing keys between siblings before splitting. Results in better space utilization.

Prefix B-trees (Patricia Trees)

Compress common key prefixes to save space:

    Keys: "application", "apple", "apply", "appreciate"

    Stored as:
    "appl" → [
        "ication" → row
        "e" → row
        "y" → row
        "reciate" → row → "appreciate"
    ]

Fractal Trees

Cache insertions in internal nodes, batch them to leaves. Better write performance at the cost of read performance. Used by TokuDB (now Percona).


4.10 Practical B-Tree Considerations

Key Size Matters

Large keys = lower fanout = deeper trees = more I/O

    Integer key (8 bytes):   Fanout ~500
    UUID key (16 bytes):     Fanout ~400
    VARCHAR(255) key:        Fanout ~30

Implications:

  • Prefer integer primary keys over UUIDs when possible
  • Use prefix indexes for long string columns
  • Consider hash of long keys

Fill Factor

When creating an index, you can specify how full to make pages:

-- PostgreSQL
CREATE INDEX idx ON users(email) WITH (fillfactor = 90);

-- Leave 10% empty for future inserts without immediate splits

Lower fill factor = more space overhead but fewer page splits on insert.

Index-Only Scans (Covering Indexes)

If all needed columns are in the index, we can skip reading the heap:

CREATE INDEX idx ON users(email, name);

SELECT name FROM users WHERE email = 'alice@example.com';
-- Can be satisfied from index alone!
    Index-Only Scan:

    Index: [email, name]           Table: (skip!)
    ┌───────────────────────┐
    │ alice@example.com → Alice │   No need to read table!
    └───────────────────────┘

Composite (Multi-column) Indexes

Order matters! The index is sorted by first column, then second, etc.

CREATE INDEX idx ON orders(customer_id, order_date);

-- Uses index (customer_id is leftmost):
SELECT * FROM orders WHERE customer_id = 42;

-- Uses index (both columns, customer_id first):
SELECT * FROM orders WHERE customer_id = 42 AND order_date > '2024-01-01';

-- Cannot use index efficiently (order_date is not leftmost):
SELECT * FROM orders WHERE order_date > '2024-01-01';
    Composite Index Structure:

    [customer_id=1, date=2024-01-01] → row
    [customer_id=1, date=2024-01-02] → row
    [customer_id=1, date=2024-01-03] → row
    [customer_id=2, date=2024-01-01] → row
    [customer_id=2, date=2024-01-02] → row
    ...

    Sorted by customer_id first, then date within each customer

4.11 Index Maintenance and Bloat

Write Overhead

Every INSERT, UPDATE, DELETE must update all indexes:

    INSERT INTO users (id, name, email, phone) VALUES (...)

    Updates:
    1. Heap (the table itself)
    2. Primary key index
    3. Index on email
    4. Index on phone
    = 4 writes for 1 INSERT

More indexes = slower writes. Only create indexes you actually need.

Index Bloat

Deleted entries leave “holes” in index pages:

    Page after many deletes:

    [10, _, _, 40, _, 60, _, _, 90, _]  (50% wasted space!)

PostgreSQL: Dead entries are cleaned up by VACUUM MySQL InnoDB: Space is reused but pages may not be reclaimed without OPTIMIZE TABLE

Monitoring Index Health

-- PostgreSQL: Check index bloat
SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    idx_scan as scans
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

-- PostgreSQL: Rebuild bloated index
REINDEX INDEX idx_users_email;

-- Or create concurrently (less locking)
CREATE INDEX CONCURRENTLY idx_users_email_new ON users(email);
DROP INDEX idx_users_email;
ALTER INDEX idx_users_email_new RENAME TO idx_users_email;

4.12 Comparing Index Types

Index TypeBest ForLimitations
B-treeGeneral purpose, range queriesHigher write cost
HashExact matches onlyNo range queries
GiSTGeometric, full-textSpecialized data
GINArrays, JSONB, full-textSlower updates
BRINVery large, naturally orderedLow selectivity

We’ll cover hash indexes and specialized types in Chapter 6.


4.13 B-Trees in Different Databases

PostgreSQL

  • Uses B+ trees for all standard indexes
  • Index entries point to heap (page, offset)
  • Supports partial indexes, expression indexes
  • VACUUM cleans dead entries

MySQL InnoDB

  • Primary key is a clustered B+ tree (data in leaves)
  • Secondary indexes point to primary key (not heap)
  • Leaf pages store full rows for primary key lookups
  • Change buffer batches secondary index updates

SQLite

  • B+ trees for tables (rowid) and indexes
  • Each table is a B+ tree keyed by rowid
  • WITHOUT ROWID tables use the primary key as the B-tree key

4.14 Summary

B-trees are the foundation of database indexing:

  • High fanout keeps trees shallow (typically 3-4 levels)
  • B+ trees store all data in linked leaves for efficient range scans
  • Insertions may cause page splits; deletions may cause merges
  • Key size and fill factor affect performance
  • Index maintenance has overhead; don’t over-index
  • Composite indexes require careful column ordering

Understanding B-trees helps you design better schemas and interpret query plans.


What’s Next

In Chapter 5, we’ll explore LSM trees—a fundamentally different approach that optimizes for writes instead of reads. You’ll learn how databases like Cassandra and RocksDB achieve their write performance.


“A database without indexes is like a library without a card catalog—technically functional, practically useless.”