Chapter 4: Indexing Structures - B-Trees and Beyond
“Show me your indexes, and I’ll show you your performance.”
Without indexes, every query would require scanning every row. Indexes are the data structures that make databases fast. In this chapter, we’ll explore the most important index structure in database history: the B-tree.
4.1 Why We Need Indexes
Consider a table with 10 million users:
SELECT * FROM users WHERE email = 'alice@example.com';
Without an index: Scan all 10 million rows, checking each email. Even at 100,000 rows/second, this takes 100 seconds.
With an index on email: Look up in index (~3 I/O operations), fetch the row. Takes milliseconds.
Table Scan: Index Lookup:
O(n) = 10,000,000 O(log n) = ~23 (base-2) or ~4 (B-tree)
100 seconds 5 milliseconds
Indexes trade space (storing extra data structures) and write performance (maintaining indexes on INSERT/UPDATE/DELETE) for dramatically faster reads.
4.2 Index Basics
An index is a separate data structure that maps values to row locations:
CONCEPTUAL INDEX
Index on email Table
┌─────────────────────┬──────┐ ┌────────────────────────┐
│ alice@example.com │ → ───│─────────│ id=42, Alice, alice@.. │
│ bob@example.com │ → ───│─────────│ id=17, Bob, bob@... │
│ carol@example.com │ → ───│─────────│ id=91, Carol, carol@.. │
│ ... │ │ │ ... │
└─────────────────────┴──────┘ └────────────────────────┘
Key properties:
- Keys: The values being indexed (email addresses)
- Values: Pointers to the actual rows (page + slot, or primary key)
- Sorted: Keys are kept in sorted order for efficient searching
4.3 From Binary Trees to B-Trees
Why Not Binary Search Trees?
A binary search tree (BST) seems like a natural choice:
BINARY SEARCH TREE
50
/ \
25 75
/ \ / \
12 37 62 87
/\ /\ /\ /\
...
Problem: With millions of keys, the tree becomes very deep.
- 1 million keys → ~20 levels
- Each level = 1 random I/O
- 20 random I/Os per lookup = unacceptable
The B-Tree Insight
Rudolf Bayer and Ed McCreight invented B-trees in 1972 with a key insight: Increase node fanout to reduce tree height.
Instead of 2 children per node, have hundreds or thousands:
B-TREE (conceptual)
┌─────────────────────────┐
Level 0: │ 100 │ 200 │ 300 │ 400 │ (1 node)
└───┬───┬───┬───┬───┬───┘
/ | | | \
Level 1: ┌──┴──┐ ... ... ... ┌──┴──┐ (~500 nodes)
│ 1-20│ │480-500│
└──┬──┘ └───┬──┘
/ \ / \
Level 2: ┌─┴─┐ ┌─┴─┐ (~250,000 nodes)
│...│ │...│
└───┘ └───┘
With fanout of 500:
- Level 0: 1 node
- Level 1: 500 nodes
- Level 2: 250,000 nodes
- Level 3: 125,000,000 keys
Three levels can index 125 million keys!
4.4 B-Tree Structure
A B-tree of order m has these properties:
- Each node has at most m children
- Each internal node (except root) has at least ⌈m/2⌉ children
- The root has at least 2 children (if not a leaf)
- All leaves are at the same depth
- A node with k children contains k-1 keys
Node Structure
B-TREE NODE
┌─────────────────────────────────────────────────────────┐
│ Header: node_type=internal, key_count=4, ... │
├─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┤
│ P0 │ K1 │ P1 │ K2 │ P2 │ K3 │ P3 │ K4 │ P4 │
└──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬─────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Keys <K1 K1≤Keys<K2 K2≤Keys<K3 K3≤Keys<K4 Keys≥K4
- Pn: Pointer to child page (for internal nodes) or row (for leaves)
- Kn: Key values
- Keys are sorted: K1 < K2 < K3 < K4
Fanout Calculation
How many keys fit in one B-tree page?
Page size: 8192 bytes
Header: 24 bytes
Key size: 8 bytes (bigint)
Pointer size: 6 bytes (page + offset)
Available: 8192 - 24 = 8168 bytes
Entry size: 8 + 6 = 14 bytes
Entries per page: 8168 / 14 ≈ 583
Fanout ≈ 583
With fanout of 583:
- 4 levels can index: 583^4 = 115 billion keys
- Most real indexes are 3-4 levels deep
4.5 B-Tree Search
Searching a B-tree is straightforward:
Search(node, key):
if node is leaf:
binary search for key in node
return found/not found
else:
find child pointer P where K[i-1] ≤ key < K[i]
return Search(P, key)
Example: Search for key 75
Root: [50, 100, 150]
/ | \
/ | \
[10,20,30,40] [60,70,80,90] [110,120,...]
|
Find: 75 is here!
Steps:
- At root: 50 ≤ 75 < 100, go to middle child
- At child: Binary search finds 75
- Total: 2 page reads
4.6 B-Tree Insertion
Insertion is more complex because we must maintain balance:
Simple Case: Leaf Has Room
Insert 75 into:
[60, 70, 80, 90, _, _] (has room)
↓
[60, 70, 75, 80, 90, _] (inserted in order)
Complex Case: Page Split
When a page is full, we must split:
Insert 75 into full node:
[60, 65, 70, 80, 85, 90] (full!)
Step 1: Split into two nodes
[60, 65, 70] [80, 85, 90]
Step 2: Insert new key
[60, 65, 70, 75] [80, 85, 90]
Step 3: Promote middle key (75) to parent
Parent: [..., 75, ...]
/ \
[60, 65, 70] [80, 85, 90]
If the parent is also full, the split propagates upward. In the worst case, it reaches the root, creating a new root and increasing tree height by 1.
PAGE SPLIT CASCADE
Before:
[Root: 50]
/ \
[30, 40] [70, 80] ← Insert 75 here (full!)
After splitting leaf:
[Root: 50]
/ \
[30, 40] [75] ← Promote to parent
/ \
[70] [80]
But parent might be full too... splits propagate!
4.7 B-Tree Deletion
Deletion must also maintain balance:
Simple Case: Remove Without Underflow
Delete 75:
[60, 70, 75, 80, 90] (has enough keys)
↓
[60, 70, 80, 90, _] (removed, still valid)
Complex Case: Underflow
When a node has too few keys after deletion:
Option 1: Borrow from sibling
[30, 40] parent: [50] [_, _] (right sibling underflow)
Rotate through parent:
[30] parent: [40] [50]
Option 2: Merge with sibling
[30] parent: [50] [70] (both minimal)
Merge:
[30, 50, 70] (parent removes key, may cascade)
Merges can propagate upward, potentially reducing tree height.
Deletion in Practice
Many databases use a simpler approach:
- Mark entries as deleted (tombstones)
- Periodically rebuild or compact the index
- Space is reclaimed lazily
This avoids complex rebalancing during normal operations.
4.8 B+ Trees: The Practical Variant
Most databases actually use B+ trees, a variant where:
- All values are in leaves: Internal nodes only have keys and child pointers
- Leaves are linked: Forms a doubly-linked list for range scans
- Keys may be duplicated: Internal keys are “separator” copies
B+ TREE
Internal nodes (keys + child pointers only):
[50 | 100]
/ | \
/ | \
/ | \
Leaf nodes (keys + values, linked):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 10→row │←→ │ 50→row │←→ │ 100→row │
│ 20→row │ │ 60→row │ │ 110→row │
│ 30→row │ │ 70→row │ │ 120→row │
└─────────────┘ └─────────────┘ └─────────────┘
▲ ▲
└─────── Linked list ───────────────┘
Why B+ Trees?
Advantage 1: Higher fanout in internal nodes
Internal nodes don’t store values, only keys. More keys per page = higher fanout = shallower tree.
Advantage 2: Efficient range scans
SELECT * FROM users WHERE age BETWEEN 20 AND 30;
- Find first leaf with age ≥ 20
- Follow leaf pointers until age > 30
- No need to traverse internal nodes again
Range scan: Follow the chain
[..., 18, 19] → [20, 21, 22, ...] → [28, 29, 30, 31, ...] → ...
↑ ↑
Start Stop
Advantage 3: Sequential leaf scanning
For full index scans, just read all leaves in order. Very cache-friendly.
4.9 B-Tree Variations
B* Trees
Keep nodes at least 2/3 full (instead of 1/2) by redistributing keys between siblings before splitting. Results in better space utilization.
Prefix B-trees (Patricia Trees)
Compress common key prefixes to save space:
Keys: "application", "apple", "apply", "appreciate"
Stored as:
"appl" → [
"ication" → row
"e" → row
"y" → row
"reciate" → row → "appreciate"
]
Fractal Trees
Cache insertions in internal nodes, batch them to leaves. Better write performance at the cost of read performance. Used by TokuDB (now Percona).
4.10 Practical B-Tree Considerations
Key Size Matters
Large keys = lower fanout = deeper trees = more I/O
Integer key (8 bytes): Fanout ~500
UUID key (16 bytes): Fanout ~400
VARCHAR(255) key: Fanout ~30
Implications:
- Prefer integer primary keys over UUIDs when possible
- Use prefix indexes for long string columns
- Consider hash of long keys
Fill Factor
When creating an index, you can specify how full to make pages:
-- PostgreSQL
CREATE INDEX idx ON users(email) WITH (fillfactor = 90);
-- Leave 10% empty for future inserts without immediate splits
Lower fill factor = more space overhead but fewer page splits on insert.
Index-Only Scans (Covering Indexes)
If all needed columns are in the index, we can skip reading the heap:
CREATE INDEX idx ON users(email, name);
SELECT name FROM users WHERE email = 'alice@example.com';
-- Can be satisfied from index alone!
Index-Only Scan:
Index: [email, name] Table: (skip!)
┌───────────────────────┐
│ alice@example.com → Alice │ No need to read table!
└───────────────────────┘
Composite (Multi-column) Indexes
Order matters! The index is sorted by first column, then second, etc.
CREATE INDEX idx ON orders(customer_id, order_date);
-- Uses index (customer_id is leftmost):
SELECT * FROM orders WHERE customer_id = 42;
-- Uses index (both columns, customer_id first):
SELECT * FROM orders WHERE customer_id = 42 AND order_date > '2024-01-01';
-- Cannot use index efficiently (order_date is not leftmost):
SELECT * FROM orders WHERE order_date > '2024-01-01';
Composite Index Structure:
[customer_id=1, date=2024-01-01] → row
[customer_id=1, date=2024-01-02] → row
[customer_id=1, date=2024-01-03] → row
[customer_id=2, date=2024-01-01] → row
[customer_id=2, date=2024-01-02] → row
...
Sorted by customer_id first, then date within each customer
4.11 Index Maintenance and Bloat
Write Overhead
Every INSERT, UPDATE, DELETE must update all indexes:
INSERT INTO users (id, name, email, phone) VALUES (...)
Updates:
1. Heap (the table itself)
2. Primary key index
3. Index on email
4. Index on phone
= 4 writes for 1 INSERT
More indexes = slower writes. Only create indexes you actually need.
Index Bloat
Deleted entries leave “holes” in index pages:
Page after many deletes:
[10, _, _, 40, _, 60, _, _, 90, _] (50% wasted space!)
PostgreSQL: Dead entries are cleaned up by VACUUM MySQL InnoDB: Space is reused but pages may not be reclaimed without OPTIMIZE TABLE
Monitoring Index Health
-- PostgreSQL: Check index bloat
SELECT
schemaname || '.' || relname AS table,
indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan as scans
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- PostgreSQL: Rebuild bloated index
REINDEX INDEX idx_users_email;
-- Or create concurrently (less locking)
CREATE INDEX CONCURRENTLY idx_users_email_new ON users(email);
DROP INDEX idx_users_email;
ALTER INDEX idx_users_email_new RENAME TO idx_users_email;
4.12 Comparing Index Types
| Index Type | Best For | Limitations |
|---|---|---|
| B-tree | General purpose, range queries | Higher write cost |
| Hash | Exact matches only | No range queries |
| GiST | Geometric, full-text | Specialized data |
| GIN | Arrays, JSONB, full-text | Slower updates |
| BRIN | Very large, naturally ordered | Low selectivity |
We’ll cover hash indexes and specialized types in Chapter 6.
4.13 B-Trees in Different Databases
PostgreSQL
- Uses B+ trees for all standard indexes
- Index entries point to heap (page, offset)
- Supports partial indexes, expression indexes
- VACUUM cleans dead entries
MySQL InnoDB
- Primary key is a clustered B+ tree (data in leaves)
- Secondary indexes point to primary key (not heap)
- Leaf pages store full rows for primary key lookups
- Change buffer batches secondary index updates
SQLite
- B+ trees for tables (rowid) and indexes
- Each table is a B+ tree keyed by rowid
- WITHOUT ROWID tables use the primary key as the B-tree key
4.14 Summary
B-trees are the foundation of database indexing:
- High fanout keeps trees shallow (typically 3-4 levels)
- B+ trees store all data in linked leaves for efficient range scans
- Insertions may cause page splits; deletions may cause merges
- Key size and fill factor affect performance
- Index maintenance has overhead; don’t over-index
- Composite indexes require careful column ordering
Understanding B-trees helps you design better schemas and interpret query plans.
What’s Next
In Chapter 5, we’ll explore LSM trees—a fundamentally different approach that optimizes for writes instead of reads. You’ll learn how databases like Cassandra and RocksDB achieve their write performance.
“A database without indexes is like a library without a card catalog—technically functional, practically useless.”