Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Database Internals: Where Your Data Actually Lives

A CloudStreet Educational Book

Written by Opus 4.5


“A database is not just a collection of tables—it’s a carefully orchestrated symphony of data structures, algorithms, and system design decisions, all working together to answer your queries while keeping your data safe.”


Preface

Every day, billions of database transactions occur around the world. When you check your bank balance, post a photo, or search for a product, databases spring into action—reading pages from disk, navigating tree structures, acquiring locks, and committing transactions. Yet for most developers, databases remain mysterious black boxes. SQL goes in, data comes out, and everything in between is magic.

This book aims to demystify that magic.

We’ll explore how databases actually store your data on disk, how they find it quickly using sophisticated index structures, how they allow thousands of users to read and write simultaneously without corrupting anything, and how they recover gracefully when servers crash. Along the way, we’ll encounter some of the most elegant algorithms and data structures in computer science—concepts that have been refined over five decades of database research.

Understanding database internals isn’t just academic curiosity. This knowledge helps you:

  • Write better queries by understanding what the database actually does with them
  • Design better schemas by knowing how data layout affects performance
  • Debug performance issues by recognizing symptoms of common problems
  • Choose the right database by understanding the trade-offs different architectures make
  • Build better systems by learning from decades of distributed systems research

Whether you’re a backend developer, a data engineer, or simply someone who wants to understand how their data is managed, this book will give you the foundation to think about databases at a deeper level.

Let’s begin our journey into the engine room.


Part I: Foundations

Before we can understand the sophisticated algorithms that make databases fast, we need to understand the physical reality they’re built on. Databases don’t exist in abstract mathematical space—they run on real machines with spinning disks (or flash storage), limited memory, and CPUs that can only do one thing at a time. The fundamental challenge of database design is bridging the gap between the logical world of tables and queries and the physical world of bytes on disk.

┌─────────────────────────────────────────────────────────────────┐
│                        YOUR APPLICATION                          │
│                     SELECT * FROM users                          │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                        SQL INTERFACE                             │
│              Parser → Planner → Optimizer → Executor             │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      STORAGE ENGINE                              │
│            Buffer Pool ← → Transaction Manager                   │
│                    ↓               ↓                             │
│              Index Manager    Lock Manager                       │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                         DISK                                     │
│     Data Files    │    Index Files    │    WAL Files             │
│   ┌──┬──┬──┬──┐   │   ┌──┬──┬──┬──┐   │   ┌──┬──┬──┬──┐         │
│   │▓▓│▓▓│░░│▓▓│   │   │▓▓│▓▓│▓▓│▓▓│   │   │▓▓│▓▓│▓▓│░░│         │
│   └──┴──┴──┴──┘   │   └──┴──┴──┴──┘   │   └──┴──┴──┴──┘         │
└─────────────────────────────────────────────────────────────────┘

In Part I, we’ll build up from the physical layer:

  • Chapter 1 introduces the journey of a query from SQL text to disk operations
  • Chapter 2 explores storage engines and how they organize data in files
  • Chapter 3 dives into the realities of disk I/O and page-based storage

Chapter 1: Introduction - The Journey of a Query

Chapter 2: Storage Engines and File Formats

Chapter 3: Disk I/O and Page Management


Part II: Data Structures

With an understanding of how data lives on disk, we can now explore the data structures that make finding that data efficient. A naive approach—scanning every record to find a match—works fine for ten records but becomes catastrophic at ten million. Index structures solve this problem, trading storage space and write overhead for dramatic read performance improvements.

                    WITHOUT INDEX                    WITH B-TREE INDEX

Query: WHERE id = 42                    Query: WHERE id = 42

┌──────────────────────┐                ┌──────────────────────┐
│ Scan ALL 1,000,000   │                │ Navigate 3-4 tree    │
│ records sequentially │                │ levels to find it    │
│                      │                │                      │
│ Time: O(n)           │                │ Time: O(log n)       │
│ = 1,000,000 reads    │                │ = ~20 reads          │
└──────────────────────┘                └──────────────────────┘

In Part II, we’ll explore the major index structures:

  • Chapter 4 covers B-trees and B+ trees—the workhorses of database indexing
  • Chapter 5 introduces LSM trees—the write-optimized alternative
  • Chapter 6 examines hash indexes and specialized structures

Chapter 4: Indexing Structures - B-Trees and Beyond

Chapter 5: LSM Trees and Write-Optimized Structures

Chapter 6: Hash Indexes and Specialized Structures


Part III: Transactions and Concurrency

Databases would be much simpler if only one user could access them at a time. But that’s not how the real world works. Modern databases handle thousands of concurrent connections, each trying to read and write data simultaneously. Without careful management, this concurrency leads to chaos—lost updates, phantom reads, and corrupted data.

        THE CONCURRENCY CHALLENGE

    Transaction A              Transaction B
         │                          │
         ▼                          ▼
    Read balance: $100         Read balance: $100
         │                          │
         ▼                          ▼
    Add $50                    Subtract $30
    Balance = $150             Balance = $70
         │                          │
         ▼                          ▼
    Write balance              Write balance
         │                          │
         ▼                          ▼
    ┌─────────────────────────────────────┐
    │   FINAL BALANCE: $70 or $150?       │
    │   (Should be $120!)                  │
    │   This is a LOST UPDATE             │
    └─────────────────────────────────────┘

Part III explores how databases maintain order in this chaos:

  • Chapter 7 explains write-ahead logging and durability guarantees
  • Chapter 8 covers MVCC and transaction isolation levels
  • Chapter 9 examines locking strategies and concurrency control

Chapter 7: Write-Ahead Logging (WAL)

Chapter 8: MVCC and Transaction Isolation

Chapter 9: Locking and Concurrency Control


Part IV: Query Processing

So far, we’ve focused on how databases store and retrieve data at the storage engine level. But there’s a crucial layer above this: the query processor. This is where SQL—a declarative language describing what data you want—gets transformed into an execution plan describing how to get it.

                    THE QUERY PROCESSING PIPELINE

    "SELECT name FROM users WHERE age > 21"
                      │
                      ▼
              ┌───────────────┐
              │    PARSER     │  → Syntax check, build AST
              └───────────────┘
                      │
                      ▼
              ┌───────────────┐
              │   ANALYZER    │  → Semantic check, resolve names
              └───────────────┘
                      │
                      ▼
              ┌───────────────┐
              │   OPTIMIZER   │  → Find best execution plan
              └───────────────┘
                      │
                      ▼
              ┌───────────────┐
              │   EXECUTOR    │  → Run the plan, return results
              └───────────────┘
                      │
                      ▼
              [ Results ]

Part IV explores this transformation:

  • Chapter 10 covers query parsing and planning fundamentals
  • Chapter 11 dives deep into query optimization
  • Chapter 12 examines buffer pools and caching strategies

Chapter 10: Query Parsing and Planning

Chapter 11: Query Optimization

Chapter 12: Buffer Pools and Caching


Part V: Reliability and Scale

Everything we’ve discussed so far assumes a single database server running without failures. Reality is messier: servers crash, disks fail, networks partition, and eventually one machine isn’t enough to handle all your data. The final part of this book addresses these challenges.

           FROM SINGLE NODE TO DISTRIBUTED

    ┌─────────────┐          ┌─────────────────────────────────┐
    │   Single    │          │       Distributed Cluster        │
    │   Server    │    →     │  ┌─────┐  ┌─────┐  ┌─────┐      │
    │  ┌───────┐  │          │  │Node1│  │Node2│  │Node3│      │
    │  │  DB   │  │          │  │ DB  │←→│ DB  │←→│ DB  │      │
    │  └───────┘  │          │  └─────┘  └─────┘  └─────┘      │
    └─────────────┘          └─────────────────────────────────┘

    Simple, but:              Complex, but:
    - Single point of         - High availability
      failure                 - Horizontal scaling
    - Limited scale           - Geographic distribution

Part V covers reliability and scaling:

  • Chapter 13 explains recovery mechanisms and crash safety
  • Chapter 14 compares column stores and row stores
  • Chapter 15 introduces distributed databases and replication

Chapter 13: Recovery and Crash Safety

Chapter 14: Column Stores vs Row Stores

Chapter 15: Distributed Databases and Replication


Appendices

Appendix A: Glossary of Terms

Appendix B: Further Reading


Closing Thoughts

Databases are among the most complex and important pieces of software ever created. They embody decades of research in data structures, distributed systems, and systems programming. Every time you save a document, make a purchase, or post a message, database engineers’ work ensures your data is stored safely and retrieved quickly.

My hope is that this book has lifted the veil on some of that complexity. You now understand why B-trees are shaped the way they are, how MVCC enables concurrent access without locking, why write-ahead logging is essential for durability, and how query optimizers choose execution plans. This knowledge will serve you well—whether you’re debugging a slow query, designing a new system, or simply appreciating the engineering marvels that power our digital world.

Databases will continue to evolve. New hardware like NVMe SSDs and persistent memory are changing the I/O assumptions that shaped classical designs. Distributed systems are becoming the norm rather than the exception. New workloads in machine learning and real-time analytics are driving innovation in storage formats and query processing. But the fundamentals—the need to organize data for efficient access, maintain consistency under concurrent access, and survive failures—remain constant.

Thank you for joining me on this journey into database internals. Now go forth and write better queries.


“The best performance improvement is the transition from the nonworking state to the working state.”

— John Ousterhout


Database Internals: Where Your Data Actually Lives

A CloudStreet Educational Book

Written by Opus 4.5

© 2024 CloudStreet Educational Series

Chapter 1: Introduction - The Journey of a Query

“To understand how databases work, follow the data.”

When you type a SQL query and press Enter, a remarkable sequence of events unfolds. Your simple SELECT * FROM users WHERE id = 42 triggers a cascade of parsing, planning, optimization, and execution—all within milliseconds. Understanding this journey is the key to understanding databases.

Let’s trace exactly what happens.


1.1 A Simple Query’s Not-So-Simple Journey

Consider this query:

SELECT name, email FROM users WHERE id = 42;

Here’s what happens, step by step:

Step 1: The Query Arrives

Your application sends the SQL text over a network connection (or local socket) to the database server. The database’s connection handler receives the text and assigns it to a worker process or thread.

┌──────────────────┐         ┌──────────────────┐
│   Application    │   SQL   │    Database      │
│                  │ ──────► │    Server        │
│  query("SELECT.. │         │                  │
└──────────────────┘         └──────────────────┘

Step 2: Parsing

The parser takes the raw SQL text and converts it into a structured representation—an Abstract Syntax Tree (AST). This involves:

  1. Lexical analysis: Breaking the text into tokens (SELECT, name, ,, email, FROM, etc.)
  2. Syntax analysis: Checking that tokens form valid SQL grammar
  3. Building the AST: Creating a tree structure representing the query
         SELECT Statement
              │
    ┌─────────┴─────────┐
    │                   │
 Columns              FROM
 ┌──┴──┐                │
name  email          users
                        │
                     WHERE
                        │
                   id = 42

If you write SELEKT instead of SELECT, this is where the error is caught.

Step 3: Semantic Analysis

The analyzer (or “binder”) verifies that the query makes sense:

  • Does the users table exist?
  • Does it have name, email, and id columns?
  • Is id = 42 a valid comparison (correct types)?

It also resolves any ambiguity (which id if there are multiple tables?) and expands wildcards (SELECT * becomes SELECT col1, col2, col3...).

Step 4: Query Optimization

Now comes the magic. The optimizer takes the logical query and figures out the best way to execute it. For our simple query, it might consider:

  • Full table scan: Read every row in users, check if id = 42
  • Index lookup: If there’s an index on id, use it to jump directly to row 42

The optimizer uses statistics about the data (how many rows? how many distinct values?) to estimate the cost of each approach and picks the cheapest one.

                    OPTIMIZER DECISION

    Option A: Full Table Scan          Option B: Index Lookup
    ┌─────────────────────────┐        ┌─────────────────────────┐
    │ Read all 1,000,000 rows │        │ Look up id=42 in index  │
    │ Cost: ~1,000,000 I/Os   │        │ Read 1 row from table   │
    │                         │        │ Cost: ~3 I/Os           │
    └─────────────────────────┘        └─────────────────────────┘
                                                  │
                                                  ▼
                                            WINNER!

The output is an execution plan—a recipe for how to retrieve the data.

Step 5: Execution

The executor runs the plan. For an index lookup:

  1. Navigate the B-tree index on id to find the entry for 42
  2. Follow the pointer to the actual row in the data file
  3. Read the row from disk (or buffer pool if cached)
  4. Extract the name and email columns
  5. Return the result to the client

But wait—what if that data page isn’t in memory?

Step 6: Buffer Pool Interaction

The executor doesn’t read directly from disk. Instead, it asks the buffer pool (also called buffer cache or page cache) for the needed pages. The buffer pool:

  1. Checks if the page is already in memory
  2. If yes, returns it immediately
  3. If no, reads it from disk, caches it, then returns it
┌─────────────────────────────────────────────────────────┐
│                     BUFFER POOL                          │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐          │
│  │Page 1│ │Page 7│ │Page 3│ │Page 42│ │Page 9│  ...     │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘          │
│                         ▲                                │
│                         │ HIT!                           │
│          Request: "I need page 42"                       │
└─────────────────────────────────────────────────────────┘

Step 7: Return Results

Finally, the result row is sent back to your application. The entire process might take under a millisecond if data is cached, or a few milliseconds if disk I/O is needed.


1.2 Why This Matters

Understanding this journey helps you:

Debug slow queries: Is the optimizer picking a bad plan? Is the buffer pool too small? Is there no suitable index?

Design better schemas: Knowing how indexes work helps you create the right ones.

Tune performance: Understanding the buffer pool helps you allocate memory effectively.

Choose the right database: Different databases make different trade-offs at each step.


1.3 The Write Path: What Happens on INSERT?

Reading is only half the story. Let’s trace an INSERT:

INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');

Step 1: Same Parsing and Planning

The query goes through parsing and planning, just like SELECT.

Step 2: Transaction Start

If not already in a transaction, an implicit one begins. The database assigns a transaction ID for tracking.

Step 3: Write-Ahead Log (WAL)

Before modifying any data, the database writes a log record describing the change to the Write-Ahead Log. This is crucial for crash recovery.

┌─────────────────────────────────────────────────────────┐
│                    WAL (on disk)                         │
│  ┌─────────────────────────────────────────────────────┐│
│  │ LSN: 1001 | TXN: 42 | INSERT users | ('Alice',...)  ││
│  └─────────────────────────────────────────────────────┘│
│                         ▲                                │
│                         │                                │
│                  Write log FIRST                         │
└─────────────────────────────────────────────────────────┘

Step 4: Modify Data Page

Now the actual data page is modified—but only in the buffer pool (memory). The page becomes “dirty” (modified but not yet written to disk).

┌─────────────────────────────────────────────────────────┐
│                     BUFFER POOL                          │
│  ┌──────────────────────────────────────┐               │
│  │ Page 7 (users table)                 │ ← DIRTY       │
│  │ ┌────────────────────────────────┐   │               │
│  │ │ ... existing rows ...          │   │               │
│  │ │ NEW: Alice, alice@example.com  │   │               │
│  │ └────────────────────────────────┘   │               │
│  └──────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────┘

Step 5: Update Indexes

If there are indexes on users, they must be updated too. Each index gets a new entry pointing to the new row.

Step 6: Commit

When the transaction commits:

  1. A commit record is written to the WAL
  2. The WAL is flushed to disk (this is the durability guarantee!)
  3. The transaction is now permanent

Note: The actual data pages may still be dirty in memory. They’ll be written to disk later by a background process. But because the WAL has the changes, we can recover them after a crash.


1.4 The Key Abstractions

Throughout this book, we’ll explore several key abstractions that make databases work:

Pages

The fundamental unit of storage. Everything—data, indexes, metadata—is stored in fixed-size pages (typically 4KB, 8KB, or 16KB).

┌─────────────────────────────────────────────┐
│              PAGE (8KB typical)              │
├─────────────────────────────────────────────┤
│ Page Header                                  │
│   - Page ID                                  │
│   - Free space pointer                       │
│   - Checksum                                 │
├─────────────────────────────────────────────┤
│                                             │
│ Row 1: id=1, name='Bob', email='...'        │
│ Row 2: id=2, name='Carol', email='...'      │
│ Row 3: id=3, name='Dave', email='...'       │
│ ...                                         │
│                                             │
├─────────────────────────────────────────────┤
│ [Free Space]                                │
└─────────────────────────────────────────────┘

Indexes

Data structures (usually B-trees) that speed up lookups. An index on id lets you find id=42 without scanning all rows.

Buffer Pool

The memory cache for pages. Most database memory is dedicated to keeping frequently-used pages in RAM.

Write-Ahead Log

An append-only file recording all changes. Enables crash recovery and replication.

Transactions

A way to group operations so they all succeed or all fail together. Provides ACID guarantees.

Locks

Mechanisms to prevent concurrent transactions from interfering with each other.


1.5 The Storage Engine

At the heart of every database is the storage engine—the component responsible for storing, retrieving, and managing data on disk. Some databases have a single storage engine; others let you choose.

Examples:

  • PostgreSQL: Has one integrated storage engine (with pluggable extensions)
  • MySQL: Pluggable engines—InnoDB (default), MyISAM, Memory, etc.
  • SQLite: Single embedded engine optimized for simplicity
  • MongoDB: WiredTiger (default), previously MMAPv1

The storage engine handles:

  • Data file format and organization
  • Index structures and maintenance
  • Buffer pool management
  • Crash recovery
  • Concurrency control

Everything else—SQL parsing, query optimization, networking—sits above the storage engine.

┌───────────────────────────────────────────────────────────┐
│                      SQL Layer                             │
│        Parser │ Optimizer │ Executor │ Network            │
├───────────────────────────────────────────────────────────┤
│                   Storage Engine                           │
│    Buffer Pool │ B-trees │ WAL │ Transactions │ Recovery  │
├───────────────────────────────────────────────────────────┤
│                   Operating System                         │
│              File System │ Disk I/O │ Memory              │
├───────────────────────────────────────────────────────────┤
│                      Hardware                              │
│              SSD/HDD │ RAM │ CPU                          │
└───────────────────────────────────────────────────────────┘

1.6 What We’ll Cover

This book explores each layer of this stack in detail:

Part I (Chapters 1-3): Foundations—how data is physically stored and managed

Part II (Chapters 4-6): Indexing—the data structures that make queries fast

Part III (Chapters 7-9): Transactions—how databases handle concurrency and guarantee durability

Part IV (Chapters 10-12): Query Processing—how SQL becomes execution plans

Part V (Chapters 13-15): Reliability and Scale—crash recovery, storage architectures, and distributed systems


1.7 A Note on Examples

Throughout this book, we’ll use PostgreSQL as our primary reference implementation. It’s open source, well-documented, and represents a mature, full-featured relational database. However, the concepts apply broadly:

  • B-trees work the same way in PostgreSQL, MySQL, Oracle, and SQL Server
  • WAL (Write-Ahead Logging) is used by virtually all modern databases
  • MVCC principles are similar across implementations (though details differ)
  • Buffer pool management uses similar algorithms everywhere

Where implementations differ significantly, we’ll note it.


Summary

  • A SQL query goes through parsing, optimization, and execution
  • The optimizer chooses the best execution plan based on statistics
  • The buffer pool caches pages to avoid disk I/O
  • Writes go to the WAL first, then to data pages
  • The storage engine is the heart of the database
  • Understanding this flow helps you debug and optimize

What’s Next

In Chapter 2, we’ll dive deep into storage engines and file formats—how databases actually organize bits on disk. We’ll see why page size matters, how rows are packed into pages, and the trade-offs different formats make.


“The query optimizer is the second-best place to put smarts in a database. The first is in the schema design.”

Chapter 2: Storage Engines and File Formats

“The way you store data determines everything about how you can access it.”

A storage engine is the component of a database that handles how data is stored on disk and retrieved. It’s the foundation upon which everything else is built. In this chapter, we’ll explore how storage engines organize data, the file formats they use, and the trade-offs they make.


2.1 What Is a Storage Engine?

A storage engine is responsible for:

  • Writing data to persistent storage (disk, SSD)
  • Reading data back when requested
  • Managing space as data grows and shrinks
  • Maintaining indexes for fast lookups
  • Ensuring durability through logging and recovery
  • Handling concurrency when multiple operations occur simultaneously

Different storage engines make different trade-offs. Some optimize for read-heavy workloads, others for writes. Some prioritize simplicity, others flexibility. Understanding these trade-offs helps you choose the right database for your needs.


2.2 The Fundamental Trade-off: Read vs Write Optimization

Every storage engine faces a fundamental trade-off:

Optimize for reads → Keep data sorted and organized, making lookups fast but updates expensive (need to maintain order)

Optimize for writes → Append new data quickly, defer organization for later (reads may be slower)

              THE READ-WRITE SPECTRUM

  ← Read Optimized              Write Optimized →

  ┌─────────────────────────────────────────────┐
  │                                             │
  │   B-tree          Hybrid          LSM-tree  │
  │   (PostgreSQL,    (some           (RocksDB, │
  │    MySQL InnoDB)   configurations) Cassandra)│
  │                                             │
  └─────────────────────────────────────────────┘

  - Fast reads         Mixed          - Fast writes
  - Slower writes      workloads      - Slower reads
  - More I/O on write                 - Background work
  - Data always sorted                - Eventual sorting

We’ll explore both approaches in detail in later chapters. For now, understand that this trade-off shapes every design decision.


2.3 Data Files and Page Structure

Most databases store data in files, organized as a sequence of fixed-size pages (also called blocks). Typical page sizes are 4KB, 8KB, or 16KB.

Why Fixed-Size Pages?

  1. Alignment with disk I/O: Disks read in fixed-size blocks (usually 512B or 4KB). Matching page size to disk block size avoids partial reads.

  2. Simpler memory management: The buffer pool can allocate a fixed number of equal-sized slots.

  3. Predictable addressing: Page N is at byte offset N * page_size in the file. No need for an index of page locations.

                    DATA FILE LAYOUT

    File offset:  0      8KB    16KB    24KB    32KB
                  │       │       │       │       │
                  ▼       ▼       ▼       ▼       ▼
    ┌─────────┬─────────┬─────────┬─────────┬─────────┐
    │ Page 0  │ Page 1  │ Page 2  │ Page 3  │ Page 4  │ ...
    │ (header)│ (data)  │ (data)  │ (data)  │ (index) │
    └─────────┴─────────┴─────────┴─────────┴─────────┘

    To read Page 3: seek to 24KB, read 8KB

Page Types

Different pages serve different purposes:

  • Data pages: Store actual table rows
  • Index pages: Store index entries (B-tree nodes)
  • Free space map pages: Track which pages have room for new data
  • Metadata pages: Store table definitions, statistics
  • Overflow pages: Handle data too large for one page

2.4 Heap Files: The Simplest Organization

The simplest way to store table data is as a heap file—pages containing rows in no particular order. New rows are inserted wherever there’s space.

                    HEAP FILE ORGANIZATION

    ┌──────────────────────────────────────────────────┐
    │ Page 0                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=47, name='Zoe', email='...'          │ │
    │ │ Row: id=3,  name='Carol', email='...'        │ │
    │ │ Row: id=91, name='Alice', email='...'        │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘
    ┌──────────────────────────────────────────────────┐
    │ Page 1                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=12, name='Bob', email='...'          │ │
    │ │ Row: id=7,  name='Eve', email='...'          │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘

    Note: Rows are NOT sorted by id!

Advantages:

  • Simple to implement
  • Fast inserts (just find space and write)
  • No overhead of maintaining order

Disadvantages:

  • Finding a specific row requires scanning all pages (unless you have an index)
  • Deleted rows leave holes (fragmentation)

Who uses heap files?

  • PostgreSQL stores table data in heap files (indexes provide fast access)
  • MySQL InnoDB uses clustered indexes instead (data sorted by primary key)

2.5 Row Format: How Rows Are Stored in Pages

Within a page, how are individual rows laid out? Most databases use one of two approaches:

Slotted Pages

The most common format. Each page has:

  • A header with metadata
  • A slot array pointing to rows
  • Rows stored from the end of the page backward
  • Free space in the middle
                    SLOTTED PAGE FORMAT

    ┌────────────────────────────────────────────────────┐
    │ PAGE HEADER                                        │
    │   Page ID: 42                                      │
    │   Slot count: 3                                    │
    │   Free space start: 2048                           │
    │   Free space end: 5120                             │
    ├────────────────────────────────────────────────────┤
    │ SLOT ARRAY                                         │
    │   Slot 0: offset=7680, length=256                  │
    │   Slot 1: offset=7168, length=512                  │
    │   Slot 2: offset=5120, length=2048                 │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │           [FREE SPACE]                             │
    │                                                    │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │   Row 2 (at offset 5120, 2048 bytes)              │
    │   ┌────────────────────────────────────────────┐  │
    │   │ id=91 │ name='Alice' │ bio='Long text...' │  │
    │   └────────────────────────────────────────────┘  │
    │                                                    │
    │   Row 1 (at offset 7168, 512 bytes)               │
    │   ┌───────────────────────────────────┐           │
    │   │ id=3 │ name='Carol' │ bio='...'   │           │
    │   └───────────────────────────────────┘           │
    │                                                    │
    │   Row 0 (at offset 7680, 256 bytes)               │
    │   ┌─────────────────────────────┐                 │
    │   │ id=47 │ name='Zoe' │ bio='' │                 │
    │   └─────────────────────────────┘                 │
    │                                                    │
    └────────────────────────────────────────────────────┘

Why grow rows backward?

Free space stays contiguous. The slot array grows forward, rows grow backward, and they meet in the middle. No fragmentation until the page is full.

Why use a slot array?

  • Rows can be different sizes
  • Rows can be moved within the page (for compaction) without changing their external address—just update the slot
  • Indexes can reference rows as (page_id, slot_number), which remains stable

Identifying Rows: TIDs and ROWIDs

Databases need a way to reference specific rows. Common approaches:

  • PostgreSQL TID (Tuple ID): (page_number, slot_number) — e.g., (42, 3)
  • Oracle ROWID: Encodes data file, block, and slot
  • MySQL InnoDB: Uses the primary key as the row identifier
-- PostgreSQL: You can see the TID of each row
SELECT ctid, * FROM users;

   ctid   | id | name
----------+----+-------
 (0,1)    |  1 | Alice
 (0,2)    |  2 | Bob
 (1,1)    |  3 | Carol

2.6 Row Storage Formats

How is the actual row data encoded? Several formats exist:

N-ary Storage Model (NSM) — Row-Oriented

Store all columns of a row together. This is the traditional approach.

Row format: [header][col1][col2][col3][col4]...

┌────────────────────────────────────────────────────────┐
│ Row 1: │header│ id=1 │ name='Alice' │ email='a@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 2: │header│ id=2 │ name='Bob'   │ email='b@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 3: │header│ id=3 │ name='Carol' │ email='c@b.com' │
└────────────────────────────────────────────────────────┘

Row header typically contains:

  • Row length
  • Null bitmap (which columns are NULL)
  • Transaction visibility info (for MVCC)

Handling Variable-Length Data

Fixed-length columns (INTEGER, CHAR(10)) are easy—always the same size. Variable-length columns (VARCHAR, TEXT) need special handling:

Approach 1: Length prefix

│ 5 │ A │ l │ i │ c │ e │   (5 bytes: "Alice")

Approach 2: Null-terminated

│ A │ l │ i │ c │ e │ \0 │  (terminated by null byte)

Approach 3: Offset array Store offsets to each variable-length column in the header:

[header: offset1=10, offset2=15][fixed cols][var1][var2]

Handling NULL Values

NULL values need representation without wasting space:

Null bitmap: A bit array in the row header. Bit N = 1 means column N is NULL.

Row header: null_bitmap = 0b00100000  (column 3 is NULL)

This is space-efficient—one bit per column regardless of column size.


2.7 Large Objects: TOAST and Overflow Pages

What if a value is too large to fit in a page? A single TEXT column might contain megabytes of data.

PostgreSQL TOAST (The Oversized-Attribute Storage Technique)

PostgreSQL automatically handles large values:

  1. Compression: Try to compress the value
  2. Out-of-line storage: If still too large, store in a separate TOAST table
  3. Chunking: Split very large values across multiple TOAST pages
                    TOAST ARCHITECTURE

    Main Table Page                    TOAST Table
    ┌─────────────────┐               ┌─────────────────┐
    │ Row: id=1,      │               │ Chunk 1 of 3    │
    │   name='Alice', │  ──────────►  │ (32KB of data)  │
    │   bio=TOAST_PTR │               ├─────────────────┤
    └─────────────────┘               │ Chunk 2 of 3    │
                                      │ (32KB of data)  │
    TOAST_PTR contains:               ├─────────────────┤
    - TOAST table OID                 │ Chunk 3 of 3    │
    - Chunk ID                        │ (15KB of data)  │
    - Original size                   └─────────────────┘

The main row stays small (just a pointer), so scans that don’t need the large column remain fast.

Other Databases

  • MySQL InnoDB: Stores large values in overflow pages, keeps a prefix in-line
  • Oracle: LOB (Large Object) storage with separate segments
  • SQL Server: Stores large values in separate allocation units

2.8 File Organization Strategies

Beyond individual pages, how are entire files organized?

Single File vs Multiple Files

Single file (SQLite):

  • Simple to manage
  • Database = one file
  • Limits: harder to grow, can’t easily use multiple disks

Multiple files (PostgreSQL, MySQL):

  • Separate files for data, indexes, WAL
  • Each table might have its own file
  • Can spread across disks, easier growth

PostgreSQL File Layout

$PGDATA/
├── base/                    # Database files
│   ├── 1/                   # Database OID 1 (template1)
│   ├── 16384/              # Your database
│   │   ├── 16385           # Table file (OID 16385)
│   │   ├── 16385_fsm       # Free space map
│   │   ├── 16385_vm        # Visibility map
│   │   └── 16386           # Index file
├── global/                  # Shared tables
├── pg_wal/                  # Write-ahead log
│   ├── 000000010000000000000001
│   └── 000000010000000000000002
├── pg_xact/                 # Transaction status
└── postgresql.conf          # Configuration

Segment Files

Large tables are split into segments (typically 1GB each):

16385        # First 1GB
16385.1      # Second 1GB
16385.2      # Third 1GB
...

This avoids filesystem limitations on maximum file size and allows parallel I/O.


2.9 Pluggable Storage Engines

Some databases support multiple storage engines:

MySQL Storage Engines

-- Create table with specific engine
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100)
) ENGINE=InnoDB;

-- Check table's engine
SHOW TABLE STATUS LIKE 'users';

InnoDB (default):

  • ACID compliant
  • Row-level locking
  • MVCC for concurrency
  • Clustered indexes (data sorted by primary key)
  • Crash recovery via redo logs

MyISAM (legacy):

  • Table-level locking only
  • No transactions
  • Faster for read-heavy, single-user workloads
  • Full-text search (before InnoDB had it)

Memory:

  • Data in RAM only
  • Lost on restart
  • Useful for temporary tables

Embedded Storage Engines

Some applications embed a storage engine directly:

  • RocksDB: LSM-tree engine, used by MySQL (MyRocks), CockroachDB
  • LevelDB: Google’s LSM-tree implementation
  • WiredTiger: B-tree + LSM hybrid, used by MongoDB
  • SQLite: Complete embedded database

2.10 Comparing Storage Formats

FeaturePostgreSQLMySQL InnoDBSQLite
Page size8KB (configurable)16KB (configurable)4KB (configurable)
Row formatHeap + slotsClustered B-treeB-tree
Large objectsTOASTOverflow pagesOverflow pages
FilesMultipleMultipleSingle
Primary key storageSeparate indexData sorted by PKSeparate

Clustered vs Non-Clustered Organization

Heap + Index (PostgreSQL):

  • Data stored in insertion order
  • Primary key is just another index
  • Index entries point to heap location
  • PRO: Updates don’t move data
  • CON: Extra lookup through index

Clustered Index (MySQL InnoDB):

  • Data stored sorted by primary key
  • Primary key lookup is direct
  • Secondary indexes point to primary key
  • PRO: Fast primary key lookups
  • CON: Large primary keys bloat secondary indexes
        HEAP + INDEX                    CLUSTERED INDEX
        (PostgreSQL)                    (MySQL InnoDB)

    Index on id          Heap          Data IS the index
    ┌─────────┐     ┌─────────┐        ┌─────────────────┐
    │ 1 → (0,1)│     │ id=47   │        │ id=1, Alice,... │
    │ 2 → (0,2)│     │ id=3    │        │ id=2, Bob,...   │
    │ 3 → (1,1)│────►│ id=91   │        │ id=3, Carol,... │
    │ ...     │     │ ...     │        │ id=4, Dave,...  │
    └─────────┘     └─────────┘        └─────────────────┘
       │                                     │
       │ Two lookups                         │ One lookup
       └─────────────────────────────────────┘

2.11 Checksums and Data Integrity

Disks can silently corrupt data. Good storage engines detect this:

Page Checksums

Each page includes a checksum (CRC32 or similar). On read:

  1. Read page from disk
  2. Compute checksum of data
  3. Compare with stored checksum
  4. If mismatch → corruption detected!
┌────────────────────────────────────────────────────┐
│ Page Header                                        │
│   Checksum: 0xA1B2C3D4  ◄── Computed from content │
│   ...                                              │
├────────────────────────────────────────────────────┤
│ Page Content                                       │
│   ... data ...                                     │
└────────────────────────────────────────────────────┘

PostgreSQL Data Checksums

# Enable checksums when creating cluster
initdb --data-checksums -D /path/to/data

# Check if enabled
pg_controldata /path/to/data | grep checksum

Torn Page Detection

What if a crash happens mid-write, leaving a page partially written?

Full page writes: After a checkpoint, write entire page to WAL before first modification. Allows recovery of torn pages.

Double-write buffer (MySQL InnoDB): Write pages to a special buffer first, then to final location. If crash during final write, recover from double-write buffer.


2.12 Summary

Storage engines are the foundation of database systems:

  • Data is stored in fixed-size pages for efficient I/O
  • Heap files store rows in no particular order
  • Slotted pages allow variable-size rows with stable addressing
  • Row formats handle fixed and variable-length columns, NULLs
  • TOAST and overflow pages handle large values
  • Clustered indexes store data sorted by primary key
  • Checksums detect corruption

The choice of storage engine—and understanding its characteristics—is crucial for database performance.


What’s Next

In Chapter 3, we’ll zoom in on the physical layer: disk I/O, page management, and why understanding your storage hardware is crucial for database performance.


“Storage is a resource. Space is a cost. I/O is a bottleneck. Design accordingly.”

Chapter 3: Disk I/O and Page Management

“Understanding your storage hardware is half the battle of database performance.”

Databases live at the intersection of two worlds: the fast world of CPU and memory, and the slow world of persistent storage. This chapter explores that boundary—how data moves between disk and memory, why I/O is usually the bottleneck, and how databases optimize for the physical realities of storage hardware.


3.1 The Memory-Storage Performance Gap

The performance gap between memory and storage is enormous:

                    ACCESS LATENCY COMPARISON

    L1 Cache         │▌ 1 ns
    L2 Cache         │▌▌ 4 ns
    L3 Cache         │▌▌▌▌ 10 ns
    RAM              │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌ 100 ns
    NVMe SSD         │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌ 20,000 ns (20 μs)
    SATA SSD         │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌ 100,000 ns (100 μs)
    HDD (spinning)   │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌▌▌▌▌ 10,000,000 ns (10 ms)

                     └─────────────────────────────────────────────────►
                     Logarithmic scale!

Key observations:

  • RAM is ~200x faster than NVMe SSD
  • RAM is ~100,000x faster than spinning disk
  • A single HDD seek takes as long as 10 million memory accesses

This is why databases do everything possible to avoid disk I/O.


3.2 Understanding Storage Hardware

Hard Disk Drives (HDDs)

HDDs store data magnetically on spinning platters:

                    HDD ANATOMY

                    Spindle (motor)
                         │
            ┌────────────┴────────────┐
            │                         │
    ┌───────▼───────────────────────▼───────┐
    │  ╭─────────────────────────────────╮  │
    │ ╭│─────────────────────────────────│╮ │
    │╭─│─────────────────────────────────│─╮│
    ││ │         PLATTER                 │ ││
    │╰─│─────────────────────────────────│─╯│
    │ ╰│─────────────────────────────────│╯ │
    │  ╰─────────────────────────────────╯  │
    └───────────────────────────────────────┘
                    ▲
                    │
              Read/Write Head
              (moves in/out)

Access time components:

  1. Seek time: Move head to correct track (~5-10ms average)
  2. Rotational latency: Wait for sector to rotate under head (~4ms for 7200 RPM)
  3. Transfer time: Read the data (fast once positioned)

Key insight: Sequential reads are MUCH faster than random reads. Once the head is positioned, reading consecutive sectors is nearly free.

    Sequential read: 150+ MB/s
    Random read (4KB): ~100 IOPS = 0.4 MB/s

    That's 375x difference!

Solid State Drives (SSDs)

SSDs use flash memory—no moving parts:

                    SSD ARCHITECTURE

    ┌───────────────────────────────────────────────┐
    │                  CONTROLLER                    │
    │   (manages wear leveling, garbage collection)  │
    ├───────┬───────┬───────┬───────┬───────┬──────┤
    │ NAND  │ NAND  │ NAND  │ NAND  │ NAND  │ NAND │
    │ Chip  │ Chip  │ Chip  │ Chip  │ Chip  │ Chip │
    │   0   │   1   │   2   │   3   │   4   │   5  │
    └───────┴───────┴───────┴───────┴───────┴──────┘

SSD characteristics:

  • No seek time—any location accessed equally fast
  • Parallel access across chips
  • Write amplification (must erase blocks before writing)
  • Wear leveling (spreads writes to extend lifespan)

Performance:

  • Random read: 10,000-500,000+ IOPS
  • Sequential read: 500-7000+ MB/s (NVMe)
  • Random writes slower than reads (erase-before-write)

NVMe vs SATA

SATA SSDs: Use the same interface as HDDs. Limited by SATA bandwidth (~550 MB/s).

NVMe SSDs: Connect directly to PCIe. Much higher bandwidth (3000-7000+ MB/s) and lower latency.

    Interface Comparison:

    SATA SSD:  CPU ──► SATA Controller ──► SSD
               Latency: ~100 μs
               Bandwidth: 550 MB/s

    NVMe SSD:  CPU ══════════════════════► SSD
               Latency: ~20 μs
               Bandwidth: 3000+ MB/s

3.3 I/O Patterns and Database Design

Database design is heavily influenced by I/O patterns:

Sequential vs Random I/O

            SEQUENTIAL I/O                    RANDOM I/O

    Read pages: 1, 2, 3, 4, 5, 6         Read pages: 47, 3, 891, 12, 456

    ┌──┬──┬──┬──┬──┬──┬──┬──┐            ┌──┬──┬──┬──┬──┬──┬──┬──┐
    │▓▓│▓▓│▓▓│▓▓│▓▓│▓▓│  │  │            │  │  │▓▓│  │  │  │  │  │ ...
    └──┴──┴──┴──┴──┴──┴──┴──┘            └──┴──┴──┴──┴──┴──┴──┴──┘
       ──────────────────►                  ↑    ↑         ↑
       One seek, continuous read            Multiple seeks required

    HDD: ~150 MB/s                        HDD: ~0.4 MB/s (100 IOPS × 4KB)
    SSD: ~3000 MB/s                       SSD: ~400 MB/s (100K IOPS × 4KB)

Design implications:

  • Log-structured storage (WAL, LSM trees) uses sequential writes
  • B-trees minimize random I/O through high fanout
  • Table scans (sequential) can be faster than index scans (random) for large result sets

Read Amplification vs Write Amplification

Read amplification: Reading more data than needed to answer a query

  • Example: Reading entire 8KB page to get one 100-byte row

Write amplification: Writing more data than the logical change

  • Example: Changing one byte requires rewriting entire 4KB flash block
  • Example: B-tree page splits propagate up the tree

Different storage engines trade off between these:

                READ AMPLIFICATION    WRITE AMPLIFICATION

    B-tree              Low                  Medium
    LSM-tree            High                 High
    Hash index          Very Low             Low

3.4 Page Management Fundamentals

The Page as Unit of I/O

Databases read and write in pages, not bytes:

    Application wants: 100-byte row
    Database reads: 8,192-byte page (8KB)

    Why? Because disk I/O works in blocks.
    Reading 100 bytes costs the same as reading 4KB (or even 8KB).

Page Addressing

Pages are identified by their offset in the file:

    Page Number → File Offset

    Page 0  →  Offset 0
    Page 1  →  Offset 8192
    Page 2  →  Offset 16384
    Page N  →  Offset N × 8192

    To read page 42:
    1. Seek to offset 42 × 8192 = 344,064
    2. Read 8192 bytes

Free Space Management

How does the database know which pages have room for new data?

Free Space Map (FSM): A bitmap or array tracking free space per page.

                    FREE SPACE MAP

    Page    Free Space    Category
    ────    ──────────    ────────
    0       8%            Nearly full
    1       45%           Half full
    2       0%            Full
    3       92%           Mostly empty
    4       100%          Empty
    ...

    INSERT needs 200 bytes → Check FSM → Use Page 3 or 4

PostgreSQL uses a hierarchical FSM:

  • Top level: Coarse-grained (ranges of pages)
  • Bottom level: Fine-grained (individual pages)

This allows quick finding of pages with sufficient space without scanning everything.


3.5 Allocating and Extending Files

File Extension

When a table grows, the database must allocate more space:

Approach 1: Extend on demand

Need new page → Extend file by one page → Use it

Simple but causes many small extends.

Approach 2: Extend in chunks

Need new page → Extend file by 1MB (128 pages) → Use first one

Fewer, larger extends. Better sequential layout.

Segment Files

Large tables split into multiple files:

    users           # First 1GB
    users.1         # Second 1GB
    users.2         # Third 1GB

    Page 150,000 is in users.1 at offset:
    (150,000 - 131,072) × 8192 = 155,025,408

Benefits:

  • Avoids filesystem maximum file size limits
  • Allows parallel I/O to different segments
  • Easier backup of individual segments

3.6 Direct I/O vs Buffered I/O

Operating systems cache file data in the page cache. Databases have a choice:

Buffered I/O (using OS page cache)

    Application ──► Database Buffer Pool ──► OS Page Cache ──► Disk

    Read path:
    1. Check database buffer pool
    2. If miss, read from OS (might be in OS cache)
    3. If miss, read from disk
    4. Data cached in BOTH locations

Problem: Double buffering wastes memory. Data may be cached twice.

Direct I/O (O_DIRECT, bypassing OS cache)

    Application ──► Database Buffer Pool ──────────────────► Disk

    Read path:
    1. Check database buffer pool
    2. If miss, read directly from disk
    3. Data cached only in database buffer pool

PostgreSQL: Uses buffered I/O by default. The OS provides read-ahead and write-behind.

MySQL InnoDB: Uses O_DIRECT by default. Takes full control of caching.

SQLite: Uses buffered I/O. Relies on OS for caching.

    When to use Direct I/O:
    + Database has sophisticated buffer management
    + Want predictable memory usage
    + Large buffer pool relative to OS memory

    When to use Buffered I/O:
    + Simpler implementation
    + Benefit from OS read-ahead for sequential scans
    + Small datasets that fit in OS cache

3.7 Read-Ahead and Prefetching

For sequential access, reading ahead improves performance:

OS Read-Ahead

The OS detects sequential access patterns and prefetches upcoming pages:

    Application reads: Page 1
    OS notices sequential pattern
    OS prefetches: Pages 2, 3, 4, 5, 6, 7, 8, ...

    When application reads Page 2: Already in memory!

Database-Level Prefetching

Databases can be smarter because they know the query plan:

    SELECT * FROM large_table;  -- Full table scan

The database knows it will need all pages, so it can issue large sequential reads.

    -- PostgreSQL effective_io_concurrency
    SET effective_io_concurrency = 200;  -- For SSDs

    -- Tells planner to issue multiple concurrent I/O requests

Bitmap Heap Scans

When an index scan finds many rows across many pages:

    Index scan finds rows on pages: 5, 7, 12, 15, 23, 24, 25, 31, ...

    Naive approach: Read pages in that order (random I/O)

    Bitmap approach:
    1. Build bitmap of pages needed: {5, 7, 12, 15, 23, 24, 25, 31}
    2. Sort by page number
    3. Read in sorted order: 5, 7, 12, 15, 23, 24, 25, 31
    4. Less seeking, benefits from read-ahead

3.8 Write Strategies

Writes have different requirements than reads:

Synchronous Writes (fsync)

For durability, writes must actually reach disk:

    write(fd, data, size);   // Write to OS buffer
    fsync(fd);               // Force to disk - SLOW!

The fsync ensures data survives power failure. Without it, data might be in OS buffers only.

Write-Back Caching

Most writes go to buffer pool first, written to disk later:

    1. Modify page in buffer pool (fast)
    2. Mark page "dirty"
    3. Page stays in memory
    4. Background writer flushes dirty pages to disk
    5. Checkpoint ensures all changes are durable

Why delay writes?

  • Amortize I/O: Multiple changes to same page = one write
  • Batch writes: Write many pages in one I/O operation
  • Sequential writes: Background writer can sort pages

Write Ordering

Some writes must happen before others:

    WAL write MUST complete before data page write
    (Otherwise crash recovery breaks)

    Checkpoint MUST complete before WAL truncation
    (Otherwise we lose recovery data)

Databases carefully manage write ordering for correctness.


3.9 I/O Scheduling

Operating System I/O Schedulers

Linux I/O schedulers reorder requests for efficiency:

CFQ (Completely Fair Queuing): Balances between processes Deadline: Prevents starvation, good for databases NOOP: No reordering, best for SSDs (they have internal schedulers) mq-deadline/none: Modern kernels with blk-mq

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set scheduler (example)
echo deadline > /sys/block/sda/queue/scheduler

Database I/O Scheduling

Databases also prioritize I/O:

    Priority 1: Transaction commit (fsync WAL) - User waiting!
    Priority 2: Active query I/O - User waiting!
    Priority 3: Background writer - No one waiting
    Priority 4: Vacuum/maintenance - Can wait

3.10 Filesystem Considerations

Filesystem Choice

Different filesystems have different characteristics:

FilesystemCharacteristicsDatabase Use
ext4Mature, good all-aroundCommon on Linux
XFSBetter for large files, parallel I/OGood for databases
ZFSChecksums, snapshots, compressionData integrity focus
btrfsCopy-on-write, snapshotsUse with caution

Important Mount Options

# Example for database filesystem
/dev/sda1 /data xfs noatime,nodiratime,nobarrier 0 0

noatime: Don’t update access time on read (reduces writes) nobarrier: Disable write barriers (only if using battery-backed cache!)

File System Full

Databases need to handle disk full gracefully:

    1. Reserve some space for emergencies
    2. Monitor disk usage, alert before full
    3. Read-only mode when nearly full (allow queries, block writes)
    4. Graceful degradation, not crash

3.11 Page Size Trade-offs

The choice of page size affects performance:

Smaller Pages (4KB)

Advantages:

  • Less read amplification (read only what you need)
  • Better for random access patterns
  • Lower memory waste for small rows

Disadvantages:

  • More pages to manage (more metadata overhead)
  • More I/O operations for large scans
  • Higher B-tree depth (more tree levels)

Larger Pages (16KB, 32KB)

Advantages:

  • Fewer I/O operations for sequential scans
  • Lower B-tree depth (fewer tree levels)
  • Better compression ratios

Disadvantages:

  • More read amplification
  • Higher memory usage per cached page
  • More write amplification
    Page Size Guidelines:

    4KB:  Good for OLTP with small rows, random access
    8KB:  Good balance (PostgreSQL default)
    16KB: Good for larger rows, sequential scans (InnoDB default)
    32KB: Specialized analytics workloads

3.12 Practical Performance Implications

Symptoms of I/O Bottlenecks

    High disk wait time (iowait)
    Low cache hit ratio
    Slow queries despite good execution plans
    Performance varies with time of day (cache cold vs warm)

Monitoring I/O

# Linux: iostat
iostat -x 1
# Watch: await (latency), %util (utilization)

# PostgreSQL: pg_stat_io (v16+)
SELECT * FROM pg_stat_io;

# PostgreSQL: buffer cache hit ratio
SELECT
    sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database;

Quick Wins

  1. Add more RAM: Larger buffer pool = fewer disk reads
  2. Use SSDs: 100x better random I/O than HDDs
  3. Optimize queries: Read fewer pages in the first place
  4. Add indexes: Turn random scans into index lookups
  5. Tune checkpoints: Spread writes over time

3.13 Summary

Disk I/O is usually the database bottleneck:

  • HDDs are good at sequential I/O, terrible at random I/O
  • SSDs are much better at random I/O but still slower than RAM
  • Databases use page-based I/O to amortize access costs
  • Buffer pools cache pages in memory to avoid disk access
  • Free space maps track available space for inserts
  • Direct I/O vs buffered I/O is a trade-off
  • Write ordering is critical for correctness
  • Page size choice affects performance characteristics

Understanding I/O helps you diagnose performance problems and make informed decisions about hardware and configuration.


What’s Next

In Chapter 4, we’ll explore the data structures that make finding data fast—B-trees and their variants. You’ll see how index structures are designed to minimize the I/O we’ve discussed in this chapter.


“The fastest I/O is no I/O. The second fastest is sequential I/O. Random I/O is where dreams go to die.”

Chapter 4: Indexing Structures - B-Trees and Beyond

“Show me your indexes, and I’ll show you your performance.”

Without indexes, every query would require scanning every row. Indexes are the data structures that make databases fast. In this chapter, we’ll explore the most important index structure in database history: the B-tree.


4.1 Why We Need Indexes

Consider a table with 10 million users:

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index: Scan all 10 million rows, checking each email. Even at 100,000 rows/second, this takes 100 seconds.

With an index on email: Look up in index (~3 I/O operations), fetch the row. Takes milliseconds.

    Table Scan:         Index Lookup:

    O(n) = 10,000,000   O(log n) = ~23 (base-2) or ~4 (B-tree)

    100 seconds         5 milliseconds

Indexes trade space (storing extra data structures) and write performance (maintaining indexes on INSERT/UPDATE/DELETE) for dramatically faster reads.


4.2 Index Basics

An index is a separate data structure that maps values to row locations:

                    CONCEPTUAL INDEX

    Index on email                          Table
    ┌─────────────────────┬──────┐         ┌────────────────────────┐
    │ alice@example.com   │ → ───│─────────│ id=42, Alice, alice@.. │
    │ bob@example.com     │ → ───│─────────│ id=17, Bob, bob@...    │
    │ carol@example.com   │ → ───│─────────│ id=91, Carol, carol@.. │
    │ ...                 │      │         │ ...                    │
    └─────────────────────┴──────┘         └────────────────────────┘

Key properties:

  • Keys: The values being indexed (email addresses)
  • Values: Pointers to the actual rows (page + slot, or primary key)
  • Sorted: Keys are kept in sorted order for efficient searching

4.3 From Binary Trees to B-Trees

Why Not Binary Search Trees?

A binary search tree (BST) seems like a natural choice:

                    BINARY SEARCH TREE

                         50
                        /  \
                      25    75
                     / \    / \
                   12  37  62  87
                   /\  /\  /\  /\
                  ...

Problem: With millions of keys, the tree becomes very deep.

  • 1 million keys → ~20 levels
  • Each level = 1 random I/O
  • 20 random I/Os per lookup = unacceptable

The B-Tree Insight

Rudolf Bayer and Ed McCreight invented B-trees in 1972 with a key insight: Increase node fanout to reduce tree height.

Instead of 2 children per node, have hundreds or thousands:

                    B-TREE (conceptual)

                    ┌─────────────────────────┐
       Level 0:     │ 100 │ 200 │ 300 │ 400  │    (1 node)
                    └───┬───┬───┬───┬───┬───┘
                       /    |   |   |    \
       Level 1:    ┌──┴──┐ ... ... ...  ┌──┴──┐   (~500 nodes)
                   │ 1-20│              │480-500│
                   └──┬──┘              └───┬──┘
                     / \                   / \
       Level 2:   ┌─┴─┐                 ┌─┴─┐     (~250,000 nodes)
                  │...│                 │...│
                  └───┘                 └───┘

With fanout of 500:

  • Level 0: 1 node
  • Level 1: 500 nodes
  • Level 2: 250,000 nodes
  • Level 3: 125,000,000 keys

Three levels can index 125 million keys!


4.4 B-Tree Structure

A B-tree of order m has these properties:

  1. Each node has at most m children
  2. Each internal node (except root) has at least ⌈m/2⌉ children
  3. The root has at least 2 children (if not a leaf)
  4. All leaves are at the same depth
  5. A node with k children contains k-1 keys

Node Structure

                    B-TREE NODE

    ┌─────────────────────────────────────────────────────────┐
    │  Header: node_type=internal, key_count=4, ...           │
    ├─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┤
    │ P0  │ K1  │ P1  │ K2  │ P2  │ K3  │ P3  │ K4  │ P4     │
    └──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬──┴─────┴──┬─────┘
       │           │           │           │           │
       ▼           ▼           ▼           ▼           ▼
    Keys <K1    K1≤Keys<K2  K2≤Keys<K3  K3≤Keys<K4   Keys≥K4
  • Pn: Pointer to child page (for internal nodes) or row (for leaves)
  • Kn: Key values
  • Keys are sorted: K1 < K2 < K3 < K4

Fanout Calculation

How many keys fit in one B-tree page?

    Page size: 8192 bytes
    Header: 24 bytes
    Key size: 8 bytes (bigint)
    Pointer size: 6 bytes (page + offset)

    Available: 8192 - 24 = 8168 bytes
    Entry size: 8 + 6 = 14 bytes
    Entries per page: 8168 / 14 ≈ 583

    Fanout ≈ 583

With fanout of 583:

  • 4 levels can index: 583^4 = 115 billion keys
  • Most real indexes are 3-4 levels deep

Searching a B-tree is straightforward:

Search(node, key):
    if node is leaf:
        binary search for key in node
        return found/not found
    else:
        find child pointer P where K[i-1] ≤ key < K[i]
        return Search(P, key)

Example: Search for key 75

                    Root: [50, 100, 150]
                          /    |     \
                        /      |      \
    [10,20,30,40]   [60,70,80,90]   [110,120,...]
                          |
                    Find: 75 is here!

Steps:

  1. At root: 50 ≤ 75 < 100, go to middle child
  2. At child: Binary search finds 75
  3. Total: 2 page reads

4.6 B-Tree Insertion

Insertion is more complex because we must maintain balance:

Simple Case: Leaf Has Room

Insert 75 into:

    [60, 70, 80, 90, _, _]   (has room)
              ↓
    [60, 70, 75, 80, 90, _]  (inserted in order)

Complex Case: Page Split

When a page is full, we must split:

Insert 75 into full node:

    [60, 65, 70, 80, 85, 90]   (full!)

Step 1: Split into two nodes
    [60, 65, 70]    [80, 85, 90]

Step 2: Insert new key
    [60, 65, 70, 75]    [80, 85, 90]

Step 3: Promote middle key (75) to parent
    Parent: [..., 75, ...]
              /        \
    [60, 65, 70]    [80, 85, 90]

If the parent is also full, the split propagates upward. In the worst case, it reaches the root, creating a new root and increasing tree height by 1.

                    PAGE SPLIT CASCADE

    Before:
                    [Root: 50]
                    /         \
           [30, 40]            [70, 80] ← Insert 75 here (full!)

    After splitting leaf:
                    [Root: 50]
                    /         \
           [30, 40]            [75] ← Promote to parent
                              /    \
                         [70]      [80]

    But parent might be full too... splits propagate!

4.7 B-Tree Deletion

Deletion must also maintain balance:

Simple Case: Remove Without Underflow

Delete 75:

    [60, 70, 75, 80, 90]   (has enough keys)
              ↓
    [60, 70, 80, 90, _]    (removed, still valid)

Complex Case: Underflow

When a node has too few keys after deletion:

Option 1: Borrow from sibling

    [30, 40]  parent: [50]  [_, _]  (right sibling underflow)

    Rotate through parent:
    [30]  parent: [40]  [50]

Option 2: Merge with sibling

    [30]  parent: [50]  [70]  (both minimal)

    Merge:
    [30, 50, 70]  (parent removes key, may cascade)

Merges can propagate upward, potentially reducing tree height.

Deletion in Practice

Many databases use a simpler approach:

  • Mark entries as deleted (tombstones)
  • Periodically rebuild or compact the index
  • Space is reclaimed lazily

This avoids complex rebalancing during normal operations.


4.8 B+ Trees: The Practical Variant

Most databases actually use B+ trees, a variant where:

  1. All values are in leaves: Internal nodes only have keys and child pointers
  2. Leaves are linked: Forms a doubly-linked list for range scans
  3. Keys may be duplicated: Internal keys are “separator” copies
                    B+ TREE

    Internal nodes (keys + child pointers only):
                    [50 | 100]
                   /    |    \
                 /      |      \
               /        |        \

    Leaf nodes (keys + values, linked):
    ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
    │ 10→row      │←→ │ 50→row      │←→ │ 100→row     │
    │ 20→row      │   │ 60→row      │   │ 110→row     │
    │ 30→row      │   │ 70→row      │   │ 120→row     │
    └─────────────┘   └─────────────┘   └─────────────┘
          ▲                                   ▲
          └─────── Linked list ───────────────┘

Why B+ Trees?

Advantage 1: Higher fanout in internal nodes

Internal nodes don’t store values, only keys. More keys per page = higher fanout = shallower tree.

Advantage 2: Efficient range scans

SELECT * FROM users WHERE age BETWEEN 20 AND 30;
  1. Find first leaf with age ≥ 20
  2. Follow leaf pointers until age > 30
  3. No need to traverse internal nodes again
    Range scan: Follow the chain

    [..., 18, 19] → [20, 21, 22, ...] → [28, 29, 30, 31, ...] → ...
                     ↑                         ↑
                   Start                     Stop

Advantage 3: Sequential leaf scanning

For full index scans, just read all leaves in order. Very cache-friendly.


4.9 B-Tree Variations

B* Trees

Keep nodes at least 2/3 full (instead of 1/2) by redistributing keys between siblings before splitting. Results in better space utilization.

Prefix B-trees (Patricia Trees)

Compress common key prefixes to save space:

    Keys: "application", "apple", "apply", "appreciate"

    Stored as:
    "appl" → [
        "ication" → row
        "e" → row
        "y" → row
        "reciate" → row → "appreciate"
    ]

Fractal Trees

Cache insertions in internal nodes, batch them to leaves. Better write performance at the cost of read performance. Used by TokuDB (now Percona).


4.10 Practical B-Tree Considerations

Key Size Matters

Large keys = lower fanout = deeper trees = more I/O

    Integer key (8 bytes):   Fanout ~500
    UUID key (16 bytes):     Fanout ~400
    VARCHAR(255) key:        Fanout ~30

Implications:

  • Prefer integer primary keys over UUIDs when possible
  • Use prefix indexes for long string columns
  • Consider hash of long keys

Fill Factor

When creating an index, you can specify how full to make pages:

-- PostgreSQL
CREATE INDEX idx ON users(email) WITH (fillfactor = 90);

-- Leave 10% empty for future inserts without immediate splits

Lower fill factor = more space overhead but fewer page splits on insert.

Index-Only Scans (Covering Indexes)

If all needed columns are in the index, we can skip reading the heap:

CREATE INDEX idx ON users(email, name);

SELECT name FROM users WHERE email = 'alice@example.com';
-- Can be satisfied from index alone!
    Index-Only Scan:

    Index: [email, name]           Table: (skip!)
    ┌───────────────────────┐
    │ alice@example.com → Alice │   No need to read table!
    └───────────────────────┘

Composite (Multi-column) Indexes

Order matters! The index is sorted by first column, then second, etc.

CREATE INDEX idx ON orders(customer_id, order_date);

-- Uses index (customer_id is leftmost):
SELECT * FROM orders WHERE customer_id = 42;

-- Uses index (both columns, customer_id first):
SELECT * FROM orders WHERE customer_id = 42 AND order_date > '2024-01-01';

-- Cannot use index efficiently (order_date is not leftmost):
SELECT * FROM orders WHERE order_date > '2024-01-01';
    Composite Index Structure:

    [customer_id=1, date=2024-01-01] → row
    [customer_id=1, date=2024-01-02] → row
    [customer_id=1, date=2024-01-03] → row
    [customer_id=2, date=2024-01-01] → row
    [customer_id=2, date=2024-01-02] → row
    ...

    Sorted by customer_id first, then date within each customer

4.11 Index Maintenance and Bloat

Write Overhead

Every INSERT, UPDATE, DELETE must update all indexes:

    INSERT INTO users (id, name, email, phone) VALUES (...)

    Updates:
    1. Heap (the table itself)
    2. Primary key index
    3. Index on email
    4. Index on phone
    = 4 writes for 1 INSERT

More indexes = slower writes. Only create indexes you actually need.

Index Bloat

Deleted entries leave “holes” in index pages:

    Page after many deletes:

    [10, _, _, 40, _, 60, _, _, 90, _]  (50% wasted space!)

PostgreSQL: Dead entries are cleaned up by VACUUM MySQL InnoDB: Space is reused but pages may not be reclaimed without OPTIMIZE TABLE

Monitoring Index Health

-- PostgreSQL: Check index bloat
SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    idx_scan as scans
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

-- PostgreSQL: Rebuild bloated index
REINDEX INDEX idx_users_email;

-- Or create concurrently (less locking)
CREATE INDEX CONCURRENTLY idx_users_email_new ON users(email);
DROP INDEX idx_users_email;
ALTER INDEX idx_users_email_new RENAME TO idx_users_email;

4.12 Comparing Index Types

Index TypeBest ForLimitations
B-treeGeneral purpose, range queriesHigher write cost
HashExact matches onlyNo range queries
GiSTGeometric, full-textSpecialized data
GINArrays, JSONB, full-textSlower updates
BRINVery large, naturally orderedLow selectivity

We’ll cover hash indexes and specialized types in Chapter 6.


4.13 B-Trees in Different Databases

PostgreSQL

  • Uses B+ trees for all standard indexes
  • Index entries point to heap (page, offset)
  • Supports partial indexes, expression indexes
  • VACUUM cleans dead entries

MySQL InnoDB

  • Primary key is a clustered B+ tree (data in leaves)
  • Secondary indexes point to primary key (not heap)
  • Leaf pages store full rows for primary key lookups
  • Change buffer batches secondary index updates

SQLite

  • B+ trees for tables (rowid) and indexes
  • Each table is a B+ tree keyed by rowid
  • WITHOUT ROWID tables use the primary key as the B-tree key

4.14 Summary

B-trees are the foundation of database indexing:

  • High fanout keeps trees shallow (typically 3-4 levels)
  • B+ trees store all data in linked leaves for efficient range scans
  • Insertions may cause page splits; deletions may cause merges
  • Key size and fill factor affect performance
  • Index maintenance has overhead; don’t over-index
  • Composite indexes require careful column ordering

Understanding B-trees helps you design better schemas and interpret query plans.


What’s Next

In Chapter 5, we’ll explore LSM trees—a fundamentally different approach that optimizes for writes instead of reads. You’ll learn how databases like Cassandra and RocksDB achieve their write performance.


“A database without indexes is like a library without a card catalog—technically functional, practically useless.”

Chapter 5: LSM Trees and Write-Optimized Structures

“If you can’t make writes fast by updating in place, batch them up and sort later.”

B-trees are excellent for reads, but every insert requires finding the right page and modifying it—potentially triggering cascading splits. For write-heavy workloads, this random I/O becomes a bottleneck. LSM trees take a radically different approach: buffer writes in memory, flush them sequentially, and merge later.


5.1 The Write Amplification Problem

With B-trees, a single logical write may cause multiple physical writes:

    INSERT one row:
    1. Write to WAL (sequential, good)
    2. Read B-tree page from disk (random I/O)
    3. Modify page in memory
    4. Eventually write page back (random I/O)

    If page split occurs:
    5. Create new page
    6. Update parent page
    7. Possibly cascade upward

    One INSERT → Multiple random writes

Write amplification = (Bytes written to storage) / (Bytes of actual data)

B-trees may have write amplification of 10x or more, especially with many indexes.


5.2 The LSM Tree Insight

The Log-Structured Merge-tree (LSM tree) was introduced by Patrick O’Neil in 1996. The core insight:

Transform random writes into sequential writes by buffering and batching.

                    LSM TREE ARCHITECTURE

    ┌─────────────────────────────────────────────────────────────┐
    │  IN-MEMORY COMPONENT (Memtable)                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │  Balanced tree (red-black, skip list)                   ││
    │  │  All recent writes go here                               ││
    │  │  Sorted by key for fast lookups                          ││
    │  └─────────────────────────────────────────────────────────┘│
    └──────────────────────────┬──────────────────────────────────┘
                               │ When full, flush to disk
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  ON-DISK COMPONENTS (SSTables)                               │
    │  ┌──────────────────┐                                       │
    │  │ Level 0 (newest) │  ← Recently flushed memtables         │
    │  │ SSTable SSTable  │                                       │
    │  └──────────────────┘                                       │
    │          │ Compaction merges into larger files               │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level 1          │  ← Larger, sorted, non-overlapping    │
    │  │ SSTable SSTable  │                                       │
    │  └──────────────────┘                                       │
    │          │                                                   │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level 2          │  ← Even larger                        │
    │  │ SSTable SSTable SSTable │                                │
    │  └──────────────────┘                                       │
    │          │                                                   │
    │          ▼                                                   │
    │  ┌──────────────────┐                                       │
    │  │ Level N (oldest) │  ← Oldest data                        │
    │  │ Large SSTables   │                                       │
    │  └──────────────────┘                                       │
    └─────────────────────────────────────────────────────────────┘

5.3 The Memtable

The memtable is an in-memory sorted data structure holding recent writes:

Common Implementations

Skip list: Fast O(log n) insert and lookup, used by LevelDB, RocksDB

    SKIP LIST STRUCTURE

    Level 3:    ────────────────────────►[50]─────────────────►NIL
    Level 2:    ────►[10]───────────────►[50]─────────►[80]───►NIL
    Level 1:    ────►[10]──►[25]────────►[50]──►[60]──►[80]───►NIL
    Level 0:    ────►[10]──►[25]──►[30]─►[50]──►[60]──►[80]──►[90]►NIL

    Random level assignment gives O(log n) average performance

Red-black tree: Balanced BST, used by some implementations

Concurrent data structures: Lock-free or fine-grained locking for multi-threaded access

Memtable Lifecycle

    1. Writes go to active memtable
    2. When memtable reaches threshold (e.g., 64MB):
       a. Make it immutable (read-only)
       b. Create new active memtable for new writes
       c. Flush immutable memtable to disk as SSTable
    3. Delete flushed memtable from memory

Write-Ahead Log Protection

The memtable is volatile—a crash would lose recent writes. To ensure durability:

    Write operation:
    1. Append to WAL (sequential, durable)
    2. Insert into memtable (in-memory)
    3. Return success to client

    On crash recovery:
    1. Replay WAL to reconstruct memtable
    2. Continue normal operation

5.4 SSTables (Sorted String Tables)

When a memtable flushes, it becomes an SSTable—an immutable, sorted file:

                    SSTABLE STRUCTURE

    ┌─────────────────────────────────────────────────────────────┐
    │  DATA BLOCKS                                                 │
    │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
    │  │ key1:value1   │ │ key100:value  │ │ key200:value  │ ...  │
    │  │ key2:value2   │ │ key101:value  │ │ key201:value  │      │
    │  │ ...           │ │ ...           │ │ ...           │      │
    │  └───────────────┘ └───────────────┘ └───────────────┘      │
    ├─────────────────────────────────────────────────────────────┤
    │  INDEX BLOCK                                                 │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ key1 → offset 0     │ key100 → offset 4096           │  │
    │  │ key200 → offset 8192 │ ...                            │  │
    │  └───────────────────────────────────────────────────────┘  │
    ├─────────────────────────────────────────────────────────────┤
    │  BLOOM FILTER                                                │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ Probabilistic membership test (is key maybe here?)    │  │
    │  └───────────────────────────────────────────────────────┘  │
    ├─────────────────────────────────────────────────────────────┤
    │  FOOTER                                                      │
    │  ┌───────────────────────────────────────────────────────┐  │
    │  │ Index offset │ Filter offset │ Checksum │ Version    │  │
    │  └───────────────────────────────────────────────────────┘  │
    └─────────────────────────────────────────────────────────────┘

Key Properties

  1. Immutable: Once written, never modified
  2. Sorted: Keys in lexicographic order
  3. Indexed: Sparse index for efficient lookup
  4. Compressed: Often with block-level compression

Bloom Filters

A bloom filter is a probabilistic data structure that quickly tells you if a key is definitely not in the SSTable:

    Bloom Filter Query:

    "Is key X in this SSTable?"

    Bloom filter says NO  → Definitely not there (skip this SSTable)
    Bloom filter says YES → Might be there (need to check)

    False positives possible, false negatives impossible

This saves disk reads by quickly eliminating SSTables that don’t contain the key.

    Without bloom filter:
    Check SSTable 1: Read index, read block → Not found
    Check SSTable 2: Read index, read block → Not found
    Check SSTable 3: Read index, read block → Found!
    = 6+ disk reads

    With bloom filter:
    Check SSTable 1 bloom: NO → Skip
    Check SSTable 2 bloom: NO → Skip
    Check SSTable 3 bloom: MAYBE → Read index, read block → Found!
    = 2 disk reads

5.5 Read Path

Reading from an LSM tree requires checking multiple locations:

    Read key K:

    1. Check memtable (in-memory)
       Found? → Return value

    2. Check immutable memtable (if any)
       Found? → Return value

    3. Check Level 0 SSTables (newest first)
       For each SSTable:
         - Check bloom filter
         - If maybe present, search SSTable
       Found? → Return value

    4. Check Level 1 SSTables
       (Binary search since non-overlapping)
       Found? → Return value

    5. Check Level 2, 3, ... N
       Found? → Return value

    6. Not found

Read amplification: May need to check multiple SSTables for one key.

    Worst case: Key doesn't exist

    Check: memtable + immutable memtable +
           all L0 SSTables +
           one SSTable per level L1-LN

    With 4 levels and 4 L0 files = ~8 locations

This is the trade-off: writes are fast (sequential), but reads may check many files.


5.6 Compaction

Compaction is the process of merging SSTables to:

  1. Reduce read amplification (fewer files to check)
  2. Reclaim space from deleted/overwritten keys
  3. Reorganize data for efficient access

Size-Tiered Compaction

Group SSTables of similar size, merge when too many:

    Size-Tiered Compaction:

    Before:
    Small:  [S1] [S2] [S3] [S4]  ← 4 small files, time to merge
    Medium: [M1] [M2]
    Large:  [L1]

    After merging small:
    Small:  [S5]                 ← New file from S1+S2
    Medium: [M1] [M2] [M3]       ← M3 from S3+S4 (grew to medium)
    Large:  [L1]

Pros: Simple, good write amplification Cons: Space amplification (temporary doubling), unpredictable read performance

Leveled Compaction

Organize SSTables into levels with size ratios:

    Leveled Compaction:

    Level 0: [SST] [SST] [SST]     (overlapping ranges, flush destination)
    Level 1: [a-d] [e-h] [i-l]     (non-overlapping, 10MB each)
    Level 2: [a-b][c-d][e-f]...    (non-overlapping, 10x larger total)
    Level 3: [a][b][c][d]...       (non-overlapping, 10x larger total)

    Compaction: Pick L0 file, merge with overlapping L1 files
    Level size ratio = 10 (typical)

    Level 0: ~10MB total
    Level 1: ~100MB total
    Level 2: ~1GB total
    Level 3: ~10GB total
    Level 4: ~100GB total

    90% of data is in the last level

Pros: Better read performance, bounded space amplification Cons: Higher write amplification (data rewritten ~10x per level)

Compaction Trade-offs

Size-TieredLeveled
Write amplificationLowerHigher
Read amplificationHigherLower
Space amplificationHigherLower
Best forWrite-heavyRead-heavy

5.7 Handling Updates and Deletes

Updates

LSM trees don’t update in place. A new version is simply written:

    Time T1: Write key="user:1", value="Alice"
    Time T2: Write key="user:1", value="Alicia"  (update)

    Both versions exist in different SSTables!

    Read sees newest version (T2): "Alicia"
    Compaction eventually removes T1 version

Deletes: Tombstones

Deleting a key writes a special tombstone marker:

    Time T1: Write key="user:1", value="Alice"
    Time T2: Delete key="user:1"  → Write tombstone

    SSTable 1: [user:1 = "Alice"]
    SSTable 2: [user:1 = TOMBSTONE]

    Read checks SSTable 2 first, sees tombstone → Key is deleted

Tombstones persist until compaction merges all SSTables containing the key:

    Compaction merges SSTable 1 and 2:

    See "Alice" at T1
    See TOMBSTONE at T2
    T2 > T1, so tombstone wins
    Result: Key is dropped entirely

Tombstone Accumulation

Deleted keys temporarily increase space usage:

    Heavy delete workload:

    Original data: 10GB
    Delete 5GB worth of data

    Until compaction:
    - Original data still on disk: 10GB
    - Plus tombstones: ~100MB
    - Total: ~10.1GB (not 5GB!)

    After full compaction: 5GB

5.8 Write Path

The complete write flow:

    Write(key, value):

    ┌─────────────────────────────────────────────────────────────┐
    │ 1. APPEND TO WAL                                             │
    │    [LSN: 1001 | key | value | checksum]                     │
    │    Sequential write, fsync for durability                    │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. INSERT INTO MEMTABLE                                      │
    │    Skip list insert: O(log n)                                │
    │    In-memory only                                            │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 3. RETURN SUCCESS                                            │
    │    Client sees success once WAL is durable                   │
    └─────────────────────────────────────────────────────────────┘
                               │
                    (Background, asynchronous)
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 4. MEMTABLE FLUSH (when threshold reached)                   │
    │    - Make memtable immutable                                 │
    │    - Create new memtable for writes                          │
    │    - Write SSTable to disk (sequential)                      │
    │    - Build bloom filter and index                            │
    │    - Delete old WAL segments                                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                    (Background, asynchronous)
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 5. COMPACTION (when thresholds reached)                      │
    │    - Merge SSTables                                          │
    │    - Remove obsolete versions                                │
    │    - Remove tombstones (when safe)                           │
    │    - Delete old SSTables                                     │
    └─────────────────────────────────────────────────────────────┘

The key insight: All disk writes are sequential. The only random I/O is during reads.


5.9 Comparing B-Trees and LSM Trees

Write Performance

    B-tree write (random I/O):
    Read page → Modify → Write page
    ~2 random I/Os per write

    LSM write (sequential I/O):
    Append to WAL → Insert memtable → Done
    1 sequential I/O per write (amortized)

LSM wins on write throughput, especially for SSDs (though gap is smaller than HDD).

Read Performance

    B-tree read (one location):
    Navigate tree → Read leaf → Done
    ~3-4 I/Os

    LSM read (multiple locations):
    Check memtable → Check each level → Maybe found
    Varies: 1 to many I/Os

B-tree wins on point reads, especially for keys that exist.

Space Amplification

    B-tree:
    Data + indexes
    ~1.5-2x raw data size

    LSM (before compaction):
    Multiple versions, tombstones
    Can be 2-10x temporarily

    LSM (after compaction):
    ~1.1-1.2x raw data size

LSM has variable space usage, depending on compaction state.

Summary Table

FactorB-treeLSM Tree
Write throughputGoodExcellent
Point read latencyExcellentGood
Range scanGoodGood
Space efficiencyGoodVariable
Write amplificationMediumLow-High
Read amplificationLowMedium-High
Predictable latencyYesLess so

5.10 LSM Trees in Practice

RocksDB

Facebook’s embedded storage engine, widely used:

    Used by:
    - MySQL (MyRocks storage engine)
    - CockroachDB
    - TiKV (TiDB)
    - Apache Flink (state backend)
    - Many more

Key features:

  • Leveled and size-tiered compaction options
  • Column families (multiple LSM trees)
  • Rate limiting for compaction
  • Block cache and bloom filters

LevelDB

Google’s original embedded LSM implementation:

  • Simpler than RocksDB
  • Single-threaded compaction
  • Foundation for many derivatives

Cassandra

Distributed database using LSM trees:

  • Size-tiered compaction by default
  • Leveled compaction option
  • Time-window compaction for time-series data

HBase

Hadoop-based distributed database:

  • LSM trees (called “HFiles”)
  • Automatic region splitting
  • Bloom filters and block cache

5.11 Tuning LSM Trees

Memtable Size

Larger memtable:

  • More writes batched before flush
  • Better write throughput
  • More memory usage
  • Longer recovery time (more WAL to replay)
# RocksDB
write_buffer_size = 64MB
max_write_buffer_number = 3

Level Sizes and Ratios

# Typical configuration:
Level 0: 4 files max before compaction
Level 1: 256MB
Level multiplier: 10

Result:
L0: ~256MB (4 × 64MB memtables)
L1: 256MB
L2: 2.56GB
L3: 25.6GB
L4: 256GB

Bloom Filter Configuration

More bits per key = lower false positive rate = fewer unnecessary reads:

Bits per key | False positive rate
     8       |       2%
    10       |       1%
    12       |      0.3%
    14       |      0.1%

Compaction Throttling

Limit compaction I/O to avoid interfering with foreground operations:

# RocksDB rate limiter
rate_limiter = 100MB/s  # Cap compaction I/O

5.12 Summary

LSM trees optimize for writes by trading off reads:

  • Memtable: In-memory buffer for recent writes
  • SSTables: Immutable, sorted files on disk
  • Compaction: Merges files to reduce read amplification
  • Bloom filters: Skip files that definitely don’t have the key
  • Tombstones: Logical deletes until compaction

LSM trees excel for:

  • Write-heavy workloads
  • Time-series data
  • Log aggregation
  • Situations where sequential I/O matters

B-trees are better for:

  • Read-heavy workloads
  • Predictable latency requirements
  • Frequent updates to existing data

Many modern databases offer both or hybrid approaches.


What’s Next

In Chapter 6, we’ll explore hash indexes and other specialized index structures—bloom filters in depth, bitmap indexes, and when to use alternatives to B-trees and LSM trees.


“Every write is expensive eventually—the question is whether you pay now (B-tree) or later (LSM tree).”

Chapter 6: Hash Indexes and Specialized Structures

“The right data structure for the job makes all the difference.”

B-trees and LSM trees are general-purpose workhorses, but sometimes specialized data structures offer better performance for specific access patterns. In this chapter, we’ll explore hash indexes, bitmap indexes, and other specialized structures.


6.1 Hash Indexes

Hash indexes provide O(1) average-case lookup by key—faster than B-trees for exact matches.

How Hash Indexes Work

    Hash Function: key → bucket number

    Example: hash("alice@example.com") mod 8 = 3

    ┌─────────────────────────────────────────────────────────────┐
    │  HASH TABLE                                                  │
    │                                                              │
    │  Bucket 0: [empty]                                          │
    │  Bucket 1: [bob@... → row] → [zoe@... → row]               │
    │  Bucket 2: [empty]                                          │
    │  Bucket 3: [alice@... → row] → [mike@... → row]            │
    │  Bucket 4: [carol@... → row]                                │
    │  Bucket 5: [empty]                                          │
    │  Bucket 6: [dave@... → row]                                 │
    │  Bucket 7: [eve@... → row] → [frank@... → row]             │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

Lookup Process

    Find "alice@example.com":

    1. Compute hash: hash("alice@example.com") = 3
    2. Go to bucket 3
    3. Search chain for exact match
    4. Found! Return row pointer

    Time: O(1) average (O(n) worst case with bad hash distribution)

Hash Index Limitations

No range queries: Hash destroys key ordering

    -- Can use hash index:
    SELECT * FROM users WHERE email = 'alice@example.com';

    -- CANNOT use hash index:
    SELECT * FROM users WHERE email > 'alice@example.com';
    SELECT * FROM users WHERE email LIKE 'alice%';
    SELECT * FROM users ORDER BY email;

No prefix matching: Must match entire key

No partial index scans: Can’t stop early

On-Disk Hash Indexes

In-memory hash tables are straightforward, but on-disk is harder:

Problem: Hash table resizing requires rewriting everything

Solution: Extendible hashing or linear hashing

    EXTENDIBLE HASHING

    Use first N bits of hash to select bucket

    Global depth = 2 (use 2 bits)

    00 ──► Bucket A: [keys with hash starting 00...]
    01 ──► Bucket B: [keys with hash starting 01...]
    10 ──► Bucket C: [keys with hash starting 10...]
    11 ──► Bucket C: [same bucket for 10 and 11]

    When Bucket A overflows:
    - Split it into A1 (000...) and A2 (001...)
    - Increase global depth to 3 for those entries
    - Other buckets unchanged

6.2 Hash Indexes in Practice

PostgreSQL Hash Indexes

PostgreSQL supports hash indexes but with caveats:

-- Create hash index
CREATE INDEX idx_email_hash ON users USING hash (email);

-- Only for equality comparisons
SELECT * FROM users WHERE email = 'alice@example.com';  -- Uses index
SELECT * FROM users WHERE email > 'a';  -- Cannot use hash index

Historical note: Before PostgreSQL 10, hash indexes weren’t crash-safe (no WAL logging). Now they are, but B-trees are still usually preferred.

MySQL Memory Engine

The Memory (HEAP) storage engine supports hash indexes:

-- Memory table with hash index
CREATE TABLE cache (
    key_col VARCHAR(255),
    value_col TEXT,
    INDEX USING HASH (key_col)
) ENGINE=MEMORY;

Useful for temporary tables and caches where data is in memory anyway.

When to Use Hash Indexes

Good use cases:

  • High-cardinality exact-match lookups (UUIDs, email addresses)
  • In-memory tables/caches
  • Join keys when only equality joins are needed

Bad use cases:

  • Range queries of any kind
  • Columns used in ORDER BY
  • Low cardinality columns
  • General purpose indexing

6.3 Bloom Filters Deep Dive

We introduced bloom filters in Chapter 5. Let’s understand them more deeply.

Structure

A bloom filter is a bit array with k hash functions:

    Bloom Filter (m=16 bits, k=3 hash functions)

    Initial state (empty):
    [0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Inserting

    Insert "alice":
    h1("alice") = 3
    h2("alice") = 7
    h3("alice") = 11

    Set bits 3, 7, 11 to 1:
    [0][0][0][1][0][0][0][1][0][0][0][1][0][0][0][0]
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

    Insert "bob":
    h1("bob") = 1
    h2("bob") = 7  (already set!)
    h3("bob") = 14

    [0][1][0][1][0][0][0][1][0][0][0][1][0][0][1][0]
     0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Querying

    Query "alice":
    Check bits 3, 7, 11 → All are 1 → MAYBE present

    Query "carol":
    h1("carol") = 5
    h2("carol") = 9
    h3("carol") = 11

    Check bits 5, 9, 11
    Bit 5 = 0 → DEFINITELY NOT present

    (We can stop at first 0 bit)

False Positive Rate

    Probability of false positive ≈ (1 - e^(-kn/m))^k

    Where:
    m = number of bits
    n = number of inserted elements
    k = number of hash functions

    Example:
    m = 10 bits per element
    k = 7 hash functions
    False positive rate ≈ 0.8%

Optimal Configuration

    Given n elements and desired false positive rate p:

    Optimal m (bits) = -n * ln(p) / (ln(2))^2
    Optimal k (hash functions) = (m/n) * ln(2)

    Example: 1 million keys, 1% false positive rate
    m ≈ 9.6 million bits ≈ 1.2 MB
    k ≈ 7 hash functions

Bloom Filter Variants

Counting Bloom Filter: Use counters instead of bits, allows deletes Cuckoo Filter: Supports deletes, often more space-efficient Quotient Filter: Cache-friendly, supports merging


6.4 Bitmap Indexes

Bitmap indexes are highly efficient for low-cardinality columns.

Structure

For each distinct value, store a bitmap indicating which rows have that value:

    Table: users
    ┌────┬─────────┬────────┐
    │ id │  name   │ status │
    ├────┼─────────┼────────┤
    │ 1  │ Alice   │ active │
    │ 2  │ Bob     │ active │
    │ 3  │ Carol   │ pending│
    │ 4  │ Dave    │ active │
    │ 5  │ Eve     │ deleted│
    │ 6  │ Frank   │ pending│
    └────┴─────────┴────────┘

    Bitmap Index on status:

    "active":  [1][1][0][1][0][0]  (rows 1, 2, 4)
    "pending": [0][0][1][0][0][1]  (rows 3, 6)
    "deleted": [0][0][0][0][1][0]  (row 5)

Query Evaluation

Queries become bitwise operations:

    SELECT * FROM users WHERE status = 'active';

    → Return rows where bitmap "active" has bit = 1
    → Rows 1, 2, 4
    SELECT * FROM users WHERE status = 'active' OR status = 'pending';

    "active":  [1][1][0][1][0][0]
    "pending": [0][0][1][0][0][1]
    OR result: [1][1][1][1][0][1]

    → Rows 1, 2, 3, 4, 6
    SELECT * FROM users WHERE status = 'active' AND country = 'US';

    "status=active": [1][1][0][1][0][0]
    "country=US":    [1][0][1][0][0][1]
    AND result:      [1][0][0][0][0][0]

    → Row 1 only

Advantages

  • Compact for low cardinality: 3 status values × n bits << n × pointer
  • Fast bitwise operations: Hardware-accelerated AND, OR, NOT
  • Excellent for analytics: Count queries, complex predicates

Disadvantages

  • Poor for high cardinality: 1 million distinct values = 1 million bitmaps
  • Expensive updates: Insert requires updating bitmaps
  • Space for sparse bitmaps: Mostly-zero bitmaps waste space

Bitmap Compression

Run-length encoding compresses sparse bitmaps:

    Raw bitmap:     [1][0][0][0][0][0][0][0][0][0][1]

    RLE compressed: [1 at position 0, 1 at position 10]
    Or: "1, skip 9, 1"

Word-Aligned Hybrid (WAH) and Roaring Bitmaps are common compression schemes:

    Roaring Bitmaps:
    - Divide into chunks of 65536 integers
    - For each chunk, choose representation:
      - Dense: raw bitmap (when > 4096 values)
      - Sparse: sorted array of values (when ≤ 4096 values)
      - Run: RLE (for runs of consecutive values)

6.5 GiST and GIN Indexes

PostgreSQL offers specialized index types for complex data.

GiST (Generalized Search Tree)

A framework for building tree-structured indexes for custom data types:

-- Full-text search
CREATE INDEX idx_content ON documents USING gist (to_tsvector('english', content));

-- Geometric queries
CREATE INDEX idx_location ON places USING gist (coordinates);

-- Range types
CREATE INDEX idx_dates ON events USING gist (date_range);

GiST supports operators:

  • Containment: @>, <@
  • Overlap: &&
  • Nearest neighbor: <->
  • Full-text search: @@

GIN (Generalized Inverted Index)

For indexing composite values (arrays, JSONB, full-text):

-- Array containment
CREATE INDEX idx_tags ON articles USING gin (tags);
SELECT * FROM articles WHERE tags @> ARRAY['database', 'index'];

-- JSONB queries
CREATE INDEX idx_data ON events USING gin (data);
SELECT * FROM events WHERE data @> '{"type": "click"}';

-- Full-text search
CREATE INDEX idx_body ON posts USING gin (to_tsvector('english', body));

GIN structure:

    Inverted Index for tags column:

    "database" → [article_ids: 1, 5, 12, 47, ...]
    "index"    → [article_ids: 1, 23, 47, 89, ...]
    "postgres" → [article_ids: 5, 12, 89, ...]

    Query: tags @> ARRAY['database', 'index']

    Intersect posting lists:
    "database": {1, 5, 12, 47}
    "index":    {1, 23, 47, 89}
    Result:     {1, 47}

GiST vs GIN

AspectGiSTGIN
Build timeFasterSlower
Index sizeSmallerLarger
Query speedSlowerFaster
UpdatesFasterSlower
Best forGeometric, rangesFull-text, arrays, JSONB

6.6 BRIN Indexes (Block Range INdexes)

For very large tables with naturally ordered data, BRIN indexes are extremely compact.

Concept

Store min/max values for ranges of physical pages:

    Table: time_series_data (sorted by timestamp)

    Pages 0-127:    min_ts = 2024-01-01, max_ts = 2024-01-15
    Pages 128-255:  min_ts = 2024-01-15, max_ts = 2024-01-31
    Pages 256-383:  min_ts = 2024-02-01, max_ts = 2024-02-14
    ...

    Query: WHERE timestamp BETWEEN '2024-01-20' AND '2024-01-25'

    Check BRIN:
    Pages 0-127:   max < query_min → SKIP
    Pages 128-255: range overlaps  → SCAN
    Pages 256-383: min > query_max → SKIP

When BRIN Excels

    Requirements:
    1. Very large table (millions+ rows)
    2. Data naturally ordered by indexed column
    3. Queries filter by that column

    Perfect for:
    - Time-series data (ordered by time)
    - Log tables (ordered by log_time)
    - Append-only data

    Example: 1 billion row table
    B-tree index: ~30 GB
    BRIN index:   ~1 MB

Creating BRIN Indexes

CREATE INDEX idx_logs_ts ON logs USING brin (created_at);

-- With custom pages per range
CREATE INDEX idx_logs_ts ON logs USING brin (created_at)
    WITH (pages_per_range = 32);

BRIN Limitations

  • Only useful if data is physically ordered
  • False positives: must scan matching ranges
  • Not useful for randomly distributed data

6.7 Partial and Expression Indexes

Sometimes you don’t need to index everything.

Partial Indexes

Index only a subset of rows:

-- Only index active users
CREATE INDEX idx_active_users ON users (email)
    WHERE status = 'active';

-- This query uses the partial index:
SELECT * FROM users WHERE email = 'alice@example.com' AND status = 'active';

-- This query cannot use it (no status filter):
SELECT * FROM users WHERE email = 'alice@example.com';

Benefits:

  • Smaller index size
  • Faster index maintenance (fewer entries)
  • Better cache efficiency

Common patterns:

-- Unprocessed items
CREATE INDEX idx_unprocessed ON jobs (created_at) WHERE processed = false;

-- Recent data
CREATE INDEX idx_recent ON orders (customer_id)
    WHERE created_at > '2024-01-01';

-- Non-null values
CREATE INDEX idx_phone ON contacts (phone) WHERE phone IS NOT NULL;

Expression Indexes

Index the result of an expression:

-- Case-insensitive search
CREATE INDEX idx_lower_email ON users (lower(email));
SELECT * FROM users WHERE lower(email) = 'alice@example.com';

-- Date extraction
CREATE INDEX idx_order_year ON orders (extract(year FROM created_at));
SELECT * FROM orders WHERE extract(year FROM created_at) = 2024;

-- JSON field
CREATE INDEX idx_type ON events ((data->>'type'));
SELECT * FROM events WHERE data->>'type' = 'click';

The query must match the expression exactly to use the index.


6.8 Covering Indexes and Index-Only Scans

Include extra columns in the index to avoid heap lookups:

-- Regular index
CREATE INDEX idx_email ON users (email);

-- Query needs name too:
SELECT name FROM users WHERE email = 'alice@example.com';
-- Must: Look up email in index → Get row pointer → Read heap for name

-- Covering index
CREATE INDEX idx_email_name ON users (email) INCLUDE (name);
-- or
CREATE INDEX idx_email_name ON users (email, name);

-- Now: Look up email in index → Name is already there!

PostgreSQL INCLUDE Clause

-- Include columns that aren't searchable, just needed in results
CREATE INDEX idx_email ON users (email) INCLUDE (name, avatar_url);

-- The included columns aren't in the search key
-- But they're stored in leaf pages

Trade-off: Larger index, but avoids heap access for covered queries.


6.9 Choosing the Right Index

Decision Flow

    ┌─────────────────────────────────────────┐
    │ What queries will use this index?       │
    └─────────────────────────────────────────┘
                        │
           ┌────────────┼────────────┐
           ▼            ▼            ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │ Equality │  │  Range   │  │ Full-text│
    │   only   │  │  scans   │  │  search  │
    └──────────┘  └──────────┘  └──────────┘
           │            │            │
           ▼            ▼            ▼
    Consider:     B-tree is      GIN for
    Hash index    usually best   tsvector
                        │
                        ▼
              ┌─────────────────┐
              │ Data naturally  │
              │ ordered by col? │
              └─────────────────┘
                   │        │
                  YES       NO
                   │        │
                   ▼        ▼
              Consider    Stick with
              BRIN        B-tree

Index Type Summary

Index TypeBest ForAvoid When
B-treeGeneral purpose, rangesN/A (default choice)
HashEquality only, high cardinalityNeed ranges/ordering
GINArrays, JSONB, full-textWrite-heavy, simple queries
GiSTGeometric, ranges, nearest neighborSimple equality queries
BRINHuge tables, naturally orderedRandom data distribution
Bitmap*Low cardinality, analyticsHigh cardinality, OLTP

*Bitmap indexes created on-the-fly in PostgreSQL for complex queries


6.10 Summary

Different index types serve different purposes:

  • Hash indexes: O(1) lookup but no ranges
  • Bloom filters: Probabilistic membership testing
  • Bitmap indexes: Efficient for low-cardinality analytics
  • GiST/GIN: Specialized queries (full-text, geometric, JSONB)
  • BRIN: Extremely compact for ordered data
  • Partial indexes: Index only what you query
  • Expression indexes: Index computed values
  • Covering indexes: Avoid heap lookups

Choose indexes based on your query patterns, not just data types.


What’s Next

In Chapter 7, we’ll switch from indexing to durability. We’ll explore Write-Ahead Logging (WAL)—the mechanism that ensures your data survives crashes.


“The best index is the one that turns a table scan into a direct lookup. The second best is knowing when no index is needed.”

Chapter 7: Write-Ahead Logging (WAL)

“First, log it. Then, do it. Always in that order.”

What happens when your database server crashes mid-transaction? Without careful engineering, you could lose committed data or end up with corrupted state. Write-Ahead Logging (WAL) is the fundamental technique that makes databases durable and recoverable.


7.1 The Durability Problem

Consider this scenario:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Without WAL, the database might:

  1. Modify Account 1’s page in memory
  2. CRASH before modifying Account 2
  3. After restart: Account 1 lost $100, Account 2 gained nothing!

Money has vanished. The transaction was supposed to be atomic.

                    THE DURABILITY PROBLEM

    Transaction commits
           │
           ▼
    ┌──────────────────────────────────────────────────────┐
    │ Buffer Pool (Memory)                                  │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Changes are here   │
    │ │ $900      │  │ $1100     │                        │
    │ │ (dirty)   │  │ (dirty)   │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘
                    │
                    │ Not yet written to disk!
                    ▼
              [POWER FAILURE]
                    │
                    ▼
    ┌──────────────────────────────────────────────────────┐
    │ Disk (Persistent Storage)                             │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Old values!        │
    │ │ $1000     │  │ $1000     │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘

    Data loss!

7.2 The WAL Protocol

Write-Ahead Logging solves this with a simple rule:

Before any change is written to a data page on disk, the log record describing that change must first be written to the WAL on disk.

                    THE WAL PROTOCOL

    1. FIRST: Write to WAL (log the intent)
    ┌──────────────────────────────────────────────────────┐
    │ WAL (on disk, append-only)                           │
    │ [LSN 1001: UPDATE accounts SET balance=900 WHERE id=1]│
    │ [LSN 1002: UPDATE accounts SET balance=1100 WHERE id=2]│
    │ [LSN 1003: COMMIT Transaction 42]                    │
    └──────────────────────────────────────────────────────┘
           │
           │ fsync() - Force to disk!
           ▼

    2. THEN: Return success to client
           │
           ▼

    3. LATER: Write data pages to disk (background)
    ┌──────────────────────────────────────────────────────┐
    │ Data Files (on disk)                                  │
    │ Account 1: $900                                       │
    │ Account 2: $1100                                      │
    └──────────────────────────────────────────────────────┘

Key insight: WAL writes are sequential and append-only—fast even on spinning disks. Data page writes can be deferred and batched.


7.3 WAL Record Structure

Each WAL record contains enough information to redo (or undo) a change:

                    WAL RECORD FORMAT

    ┌─────────────────────────────────────────────────────────────┐
    │ LSN (Log Sequence Number): Unique identifier for this record│
    ├─────────────────────────────────────────────────────────────┤
    │ Transaction ID: Which transaction made this change          │
    ├─────────────────────────────────────────────────────────────┤
    │ Previous LSN: For chaining records in same transaction      │
    ├─────────────────────────────────────────────────────────────┤
    │ Page ID: Which page is affected                             │
    ├─────────────────────────────────────────────────────────────┤
    │ Record Type: INSERT, UPDATE, DELETE, COMMIT, ABORT, etc.   │
    ├─────────────────────────────────────────────────────────────┤
    │ Before Image: Original data (for UNDO)                      │
    ├─────────────────────────────────────────────────────────────┤
    │ After Image: New data (for REDO)                            │
    ├─────────────────────────────────────────────────────────────┤
    │ Checksum: Data integrity verification                       │
    └─────────────────────────────────────────────────────────────┘

LSN (Log Sequence Number)

Every WAL record has a unique, monotonically increasing LSN:

    LSN 1001: BEGIN Transaction 42
    LSN 1002: UPDATE on page 7
    LSN 1003: UPDATE on page 12
    LSN 1004: COMMIT Transaction 42
    LSN 1005: BEGIN Transaction 43
    ...

Pages track which LSN they were last modified at:

    Page 7 header: page_lsn = 1002
    Page 12 header: page_lsn = 1003

This enables recovery to know which changes have been applied.


7.4 WAL and Checkpoints

Writing every change to WAL is great for durability, but:

  1. WAL files grow forever
  2. Recovery would replay from the beginning of time

Checkpoints solve this:

                    CHECKPOINT PROCESS

    Before checkpoint:
    ┌─────────────────────────────────────────────────────────────┐
    │ Buffer Pool: dirty pages scattered across time              │
    │ [Page 3: LSN 950] [Page 7: LSN 1002] [Page 12: LSN 1003]   │
    └─────────────────────────────────────────────────────────────┘

    Checkpoint (LSN 1100):
    1. Write all dirty pages to disk
    2. Write checkpoint record to WAL
    3. Record: "All changes up to LSN 1100 are on disk"

    After checkpoint:
    - WAL records before LSN 1100 can be deleted
    - Recovery only needs to replay from LSN 1100

    ┌─────────────────────────────────────────────────────────────┐
    │ WAL:                                                         │
    │ [OLD - can delete] │ CHECKPOINT │ [NEEDED for recovery]      │
    │  LSN < 1100        │  LSN 1100  │  LSN > 1100                │
    └─────────────────────────────────────────────────────────────┘

Checkpoint Trade-offs

Frequent checkpoints:

  • Faster recovery (less WAL to replay)
  • More I/O during normal operation
  • Can cause latency spikes

Infrequent checkpoints:

  • Longer recovery time
  • Less normal I/O
  • Risk of running out of WAL space
-- PostgreSQL checkpoint configuration
checkpoint_timeout = 5min       -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread writes over 90% of interval
max_wal_size = 1GB             -- Trigger checkpoint if WAL reaches this

7.5 ARIES: The Industry Standard

Most modern databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in the early 1990s. ARIES provides three key guarantees:

1. Write-Ahead Logging

Log before data. We’ve covered this.

2. Repeating History During Redo

On recovery, replay ALL changes (even for aborted transactions) to restore the exact pre-crash state, then undo incomplete transactions.

3. Logging Changes During Undo

When rolling back a transaction, log the undo operations too. This ensures idempotent recovery.

                    ARIES RECOVERY PHASES

    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 1: ANALYSIS                                          │
    │   - Scan WAL from last checkpoint                          │
    │   - Build dirty page table (pages that might need redo)    │
    │   - Build active transaction table (incomplete transactions)│
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 2: REDO                                              │
    │   - Scan forward from oldest dirty page LSN                │
    │   - Reapply all changes (committed or not)                 │
    │   - Restore database to exact pre-crash state              │
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 3: UNDO                                              │
    │   - Roll back uncommitted transactions                     │
    │   - Process in reverse LSN order                           │
    │   - Log compensation records (CLRs)                        │
    └───────────────────────────────────────────────────────────┘

Compensation Log Records (CLRs)

When undoing a change, we log the undo:

    Original: LSN 1002 - UPDATE balance from 1000 to 900

    Undo: LSN 1050 - CLR: UPDATE balance from 900 to 1000
                    (undoes LSN 1002)

    If crash during undo:
    - Redo phase reapplies CLR 1050
    - Undo phase sees CLR, skips LSN 1002
    - No infinite undo loop!

7.6 Physical vs Logical Logging

Physical Logging

Log exact bytes changed:

    Page 7, offset 1024, old bytes: [00 00 03 E8], new bytes: [00 00 03 84]

Pros: Simple, exact replay Cons: Large logs, tied to physical layout

Logical Logging

Log the operation:

    UPDATE accounts SET balance = 900 WHERE id = 1

Pros: Smaller logs, replication-friendly Cons: Must handle schema changes, non-determinism

Physiological Logging (Most Common)

Hybrid: Physical page reference, logical within page:

    Page 7: UPDATE row in slot 3, set column 2 to 900

This is what PostgreSQL and most databases use:

  • Page-level addressing (physical)
  • Row-level operations (logical)

7.7 Full Page Writes

What if a crash happens mid-page-write, leaving a torn page?

    8KB page write in progress:
    ┌────────────────────────────────────────────────────┐
    │ First 4KB written (new data)                       │
    │────────────────────────────────────────────────────│
    │ Last 4KB NOT written (old data)                   │ ← Crash here!
    └────────────────────────────────────────────────────┘

    Page is now corrupted: Half old, half new

Solution: Full Page Writes (FPW)

After each checkpoint, the first modification to any page writes the entire page to WAL:

    Checkpoint at LSN 1000

    First modification to Page 7 after checkpoint:
    WAL record LSN 1050:
    ┌─────────────────────────────────────────────────────────────┐
    │ Type: FULL PAGE IMAGE                                        │
    │ Page: 7                                                       │
    │ Content: [entire 8KB page content]                           │
    │ Plus: The actual change                                       │
    └─────────────────────────────────────────────────────────────┘

    Subsequent modifications to Page 7 before next checkpoint:
    Only log the delta (normal small records)

On recovery, if page is torn, restore from full page image in WAL.

Trade-off: Full page writes increase WAL volume significantly but ensure recoverability.


7.8 WAL Archiving and PITR

WAL enables Point-in-Time Recovery (PITR):

                    POINT-IN-TIME RECOVERY

    ┌─────────────────────────────────────────────────────────────┐
    │ Full Backup (Sunday midnight)                                │
    │ [Snapshot of all data files]                                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Apply WAL...
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ WAL Segments (Sunday → Today)                                │
    │ [Mon] → [Tue] → [Wed] → [Thu] → [Fri] → [Sat]               │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Stop at specific point
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ "Restore database as of Thursday 3:47:23 PM"                 │
    └─────────────────────────────────────────────────────────────┘

Setting Up WAL Archiving (PostgreSQL)

# postgresql.conf
wal_level = replica          # or 'logical'
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

Performing PITR

# 1. Restore base backup
pg_restore -d mydb /backup/base_backup

# 2. Configure recovery
# recovery.conf or postgresql.conf (v12+)
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 15:47:23'

# 3. Start server - it will replay WAL to target time

7.9 WAL and Replication

WAL is the foundation of database replication:

                    STREAMING REPLICATION

    Primary Server                      Standby Server
    ┌───────────────────┐              ┌───────────────────┐
    │ Transactions      │              │                   │
    │      │            │              │                   │
    │      ▼            │              │                   │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │   WAL   │───────│─ Stream ────│►│   WAL   │       │
    │ └─────────┘       │   WAL        │ └─────────┘       │
    │      │            │              │      │            │
    │      ▼            │              │      ▼            │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │  Data   │       │              │ │  Data   │       │
    │ └─────────┘       │              │ └─────────┘       │
    └───────────────────┘              └───────────────────┘

    Standby applies WAL records to stay in sync

Synchronous vs Asynchronous Replication

Asynchronous (default):

  • Primary doesn’t wait for standby
  • Possible data loss on failover
  • Better performance

Synchronous:

  • Primary waits for standby acknowledgment
  • No data loss on failover
  • Higher latency
-- PostgreSQL synchronous replication
-- postgresql.conf on primary:
synchronous_standby_names = 'standby1'

-- Commit waits for standby confirmation
synchronous_commit = remote_apply  -- or: on, remote_write

7.10 WAL Performance Considerations

WAL Write Latency

Every commit requires:

  1. Write WAL record(s) to buffer
  2. fsync() WAL to disk
  3. Return to client

The fsync() is the bottleneck:

    HDD: fsync ≈ 5-10ms  → ~100-200 commits/second
    SSD: fsync ≈ 0.1-1ms → ~1,000-10,000 commits/second
    NVMe: fsync ≈ 20-50μs → ~20,000-50,000 commits/second

Group Commit

Batch multiple transactions’ fsync into one:

    Without group commit:
    T1: write → fsync (5ms)
    T2: write → fsync (5ms)
    T3: write → fsync (5ms)
    Total: 15ms for 3 transactions

    With group commit:
    T1: write → wait
    T2: write → wait
    T3: write → wait
    All: single fsync (5ms)
    Total: 5ms for 3 transactions
-- PostgreSQL group commit tuning
commit_delay = 10        -- Wait 10μs for more commits to batch
commit_siblings = 5      -- Only delay if 5+ active transactions

WAL Compression

Compress WAL to reduce I/O:

-- PostgreSQL 15+
wal_compression = lz4    -- or: pglz, zstd, on (pglz)

Reduces WAL size by ~50% with minimal CPU overhead.


7.11 WAL Segment Management

WAL is stored in fixed-size segment files:

    PostgreSQL WAL directory:
    pg_wal/
    ├── 000000010000000000000001   (16MB segment)
    ├── 000000010000000000000002   (16MB segment)
    ├── 000000010000000000000003   (16MB segment)
    └── ...

    Filename encodes: timeline + segment number

WAL Retention

Old segments are recycled or removed based on:

  • wal_keep_size: Minimum WAL to retain
  • Replication slots: Keep WAL needed by standbys
  • Archive status: Keep until archived
-- Check WAL usage
SELECT * FROM pg_stat_wal;

-- Check replication slot WAL retention
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots;

WAL Bloat Issues

WAL can grow unexpectedly due to:

  • Inactive replication slots
  • Slow archive commands
  • Long-running transactions
-- Find inactive slots consuming WAL
SELECT slot_name, active, pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) as retained
FROM pg_replication_slots;

-- Drop problematic slot
SELECT pg_drop_replication_slot('unused_slot');

7.12 Double-Write Buffer (InnoDB)

MySQL InnoDB uses a different approach to torn pages:

                    INNODB DOUBLE-WRITE BUFFER

    Dirty Page in Buffer Pool
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. Write to Double-Write Buffer (sequential, 2MB area)       │
    │    [Page1][Page2][Page3]...                                  │
    │    fsync()                                                    │
    └─────────────────────────────────────────────────────────────┘
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. Write to actual page locations (random I/O)               │
    └─────────────────────────────────────────────────────────────┘

    On crash:
    - If torn page detected, restore from double-write buffer
    - Then apply redo log

Trade-off: Doubles write I/O for data pages, but provides torn page protection without full page writes in the redo log.


7.13 Summary

Write-Ahead Logging is fundamental to database durability:

  • WAL rule: Log before data, always
  • LSN: Unique identifier for log ordering
  • Checkpoints: Bound recovery time, enable WAL recycling
  • ARIES: Industry-standard recovery algorithm
  • Full page writes: Handle torn pages
  • PITR: Restore to any point in time
  • Replication: Stream WAL to standbys
  • Group commit: Batch fsyncs for performance

Understanding WAL helps you configure durability guarantees and diagnose recovery issues.


What’s Next

In Chapter 8, we’ll explore MVCC (Multi-Version Concurrency Control) and transaction isolation—how databases let multiple users read and write simultaneously without stepping on each other.


“The WAL is your safety net. Everything else is just performance.”

Chapter 8: MVCC and Transaction Isolation

“MVCC: Because readers should never wait for writers.”

When multiple transactions access the database simultaneously, chaos ensues—unless the database carefully manages what each transaction can see. MVCC (Multi-Version Concurrency Control) and transaction isolation levels are the mechanisms that maintain order.


8.1 The Concurrency Problem

Consider two transactions running simultaneously:

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $1000
                                    UPDATE accounts
                                      SET balance = balance - 100
                                      WHERE id = 1;
                                    COMMIT;

SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $1000 or $900?

Should Transaction A see B’s change? It depends on what guarantees we want.


8.2 ACID: The Transaction Guarantees

Transactions provide four guarantees, known as ACID:

Atomicity

All or nothing. Either all changes commit, or none do.

    Transfer $100:
    1. Deduct from Account A
    2. Add to Account B

    If crash between steps:
    - Without atomicity: A lost $100, B gained nothing
    - With atomicity: Transaction rolls back, both unchanged

Consistency

Database moves from one valid state to another. Constraints are maintained.

    Constraint: balance >= 0

    Account has $50, tries to deduct $100:
    - Without consistency: Balance = -$50 (invalid)
    - With consistency: Transaction rejected

Isolation

Concurrent transactions don’t interfere. Each sees a consistent view.

    Transaction A reads balance
    Transaction B modifies balance
    Transaction A reads balance again

    - Without isolation: A sees different values
    - With isolation: A sees consistent view

Durability

Once committed, changes survive crashes.

    COMMIT succeeds
    Server crashes
    Server restarts

    - Without durability: Data might be lost
    - With durability: Committed data is there

8.3 MVCC: The Key Insight

Multi-Version Concurrency Control maintains multiple versions of each row:

                    MVCC: MULTIPLE VERSIONS

    Logical row: accounts WHERE id = 1

    Physical storage:
    ┌─────────────────────────────────────────────────────────────┐
    │ Version 1: balance=$1000, created=TxID 100, deleted=TxID 150│
    │ Version 2: balance=$900,  created=TxID 150, deleted=∞       │
    └─────────────────────────────────────────────────────────────┘

    Transaction 140 (started before 150):
    - Sees Version 1 (created 100 ≤ 140, deleted 150 > 140)

    Transaction 160 (started after 150):
    - Sees Version 2 (created 150 ≤ 160, deleted ∞)

Key Benefits

  1. Readers don’t block writers: Readers see old version while writer modifies
  2. Writers don’t block readers: Writer creates new version, readers see old
  3. Consistent snapshots: Each transaction sees data as of its start time

8.4 How MVCC Works: PostgreSQL

Row Structure

Each row has hidden system columns:

    Physical Row (tuple):
    ┌─────────────────────────────────────────────────────────────┐
    │ xmin:  Transaction that created this version (100)          │
    │ xmax:  Transaction that deleted/updated this version (150)  │
    │ ctid:  Current tuple ID (physical location)                 │
    │ data:  balance=$1000, name='Alice', ...                     │
    └─────────────────────────────────────────────────────────────┘

Visibility Rules

A row version is visible to transaction T if:

  1. xmin is committed AND xmin started before T
  2. xmax is empty OR xmax is not committed OR xmax started after T
    Transaction 140 checks visibility of Version 1:
    - xmin = 100 (committed, before 140) ✓
    - xmax = 150 (not yet committed OR after 140) ✓
    - VISIBLE

    Transaction 160 checks visibility of Version 1:
    - xmin = 100 (committed, before 160) ✓
    - xmax = 150 (committed, before 160) ✗
    - NOT VISIBLE (deleted)

UPDATE Creates New Version

UPDATE accounts SET balance = 900 WHERE id = 1;
    Before (TxID 150):
    ┌────────────────────────────────────────────────────────────┐
    │ xmin=100, xmax=∞, balance=$1000                            │
    └────────────────────────────────────────────────────────────┘

    After:
    ┌────────────────────────────────────────────────────────────┐
    │ xmin=100, xmax=150, balance=$1000  ← Old version (deleted) │
    └────────────────────────────────────────────────────────────┘
    ┌────────────────────────────────────────────────────────────┐
    │ xmin=150, xmax=∞, balance=$900     ← New version           │
    └────────────────────────────────────────────────────────────┘

The old row isn’t physically deleted—it’s marked as deleted by setting xmax.

VACUUM: Cleaning Up Old Versions

Old versions accumulate. VACUUM removes them when no transaction can see them:

    Before VACUUM:
    [xmin=100, xmax=150, balance=$1000]  ← Dead, no one can see
    [xmin=150, xmax=∞, balance=$900]     ← Current

    After VACUUM:
    [xmin=150, xmax=∞, balance=$900]     ← Only current remains

VACUUM is crucial for PostgreSQL performance. Without it, tables bloat with dead rows.


8.5 How MVCC Works: MySQL InnoDB

InnoDB uses a different MVCC implementation:

Undo Logs

Instead of keeping old versions in the table, InnoDB stores them in undo logs:

                    INNODB MVCC

    Clustered Index (current data):
    ┌─────────────────────────────────────────────────────────────┐
    │ id=1, balance=$900, TxID=150, roll_ptr=→                   │
    └────────────────────────────────────┬────────────────────────┘
                                         │
                                         ▼
    Undo Log:
    ┌─────────────────────────────────────────────────────────────┐
    │ Previous version: balance=$1000, TxID=100, roll_ptr=NULL    │
    └─────────────────────────────────────────────────────────────┘

Read View

When a transaction starts, it gets a read view:

    Read View for Transaction 140:
    - m_low_limit_id: 160 (TxIDs ≥ this are invisible)
    - m_up_limit_id: 100 (TxIDs < this are visible)
    - m_ids: [150, 155] (active transactions at start, invisible)

To read a row, follow undo chain until finding a visible version.

Purge Thread

InnoDB’s purge thread removes old undo records when no transaction needs them.


8.6 Transaction Isolation Levels

SQL defines four isolation levels, offering different trade-offs:

                    ISOLATION LEVEL SPECTRUM

    Less Isolation                              More Isolation
    Faster                                      More Correct
    ◄───────────────────────────────────────────────────────────►

    Read         Read          Repeatable      Serializable
    Uncommitted  Committed     Read

Read Uncommitted

See uncommitted changes from other transactions. (Rarely used)

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
                                    UPDATE accounts SET balance = 0
                                      WHERE id = 1;
                                    -- NOT committed yet!

SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $0 (dirty read!)
                                    ROLLBACK;

-- A saw data that never existed!

Anomaly allowed: Dirty reads

Read Committed (PostgreSQL/Oracle default)

Only see committed changes. Each statement sees latest committed data.

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $1000
                                    UPDATE accounts SET balance = 900
                                      WHERE id = 1;
                                    COMMIT;

SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $900 (different!)

-- Same query, different result = Non-repeatable read

Anomaly allowed: Non-repeatable reads, phantoms

Repeatable Read (MySQL default)

See a snapshot as of transaction start. Same query returns same result.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $1000
                                    UPDATE accounts SET balance = 900
                                      WHERE id = 1;
                                    COMMIT;

SELECT balance FROM accounts
  WHERE id = 1;
-- Returns: $1000 (same as before!)

COMMIT;

Anomaly allowed: Phantom reads (in standard SQL; PostgreSQL prevents them)

Serializable

Transactions execute as if in serial order. No anomalies.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Fully isolated: equivalent to running one after another

Most restrictive but guarantees correctness.


8.7 Isolation Anomalies Explained

Dirty Read

Reading uncommitted data that may be rolled back.

    T1: UPDATE row SET x = 2
    T2: SELECT x  -- Returns 2
    T1: ROLLBACK
    -- T2 read value that never existed!

Non-Repeatable Read

Same query returns different results within one transaction.

    T1: SELECT x  -- Returns 1
    T2: UPDATE row SET x = 2
    T2: COMMIT
    T1: SELECT x  -- Returns 2 (different!)

Phantom Read

New rows appear in repeated queries.

    T1: SELECT * FROM accounts WHERE balance > 100
        -- Returns: Alice, Bob

    T2: INSERT INTO accounts (name, balance) VALUES ('Carol', 150)
    T2: COMMIT

    T1: SELECT * FROM accounts WHERE balance > 100
        -- Returns: Alice, Bob, Carol (phantom!)

Write Skew

Two transactions read overlapping data, make decisions, write non-overlapping data, creating invalid state.

    Constraint: At least one doctor must be on call

    T1: SELECT count(*) FROM on_call  -- Returns 2
    T2: SELECT count(*) FROM on_call  -- Returns 2

    T1: DELETE FROM on_call WHERE doctor = 'Alice'
    T2: DELETE FROM on_call WHERE doctor = 'Bob'

    T1: COMMIT (1 doctor remains, OK)
    T2: COMMIT (0 doctors remain, VIOLATES CONSTRAINT!)

Neither transaction saw the other’s delete.


8.8 Isolation Levels and Anomalies Matrix

Isolation LevelDirty ReadNon-RepeatablePhantomWrite Skew
Read UncommittedYesYesYesYes
Read CommittedNoYesYesYes
Repeatable ReadNoNoPossible*Yes
SerializableNoNoNoNo

*PostgreSQL prevents phantoms at Repeatable Read; MySQL allows them


8.9 Snapshot Isolation

Many databases implement Snapshot Isolation (SI), which is between Repeatable Read and Serializable:

    Snapshot Isolation Rules:
    1. Transaction sees consistent snapshot from start time
    2. Write conflicts detected at commit time
    3. First committer wins

    T1: BEGIN (snapshot at time 100)
    T2: BEGIN (snapshot at time 100)

    T1: UPDATE accounts SET balance = 900 WHERE id = 1
    T2: UPDATE accounts SET balance = 800 WHERE id = 1

    T1: COMMIT  -- Succeeds
    T2: COMMIT  -- FAILS! Write conflict on id=1

PostgreSQL: Repeatable Read is actually Snapshot Isolation.

Difference from true Serializable: SI allows write skew anomalies.


8.10 Serializable Snapshot Isolation (SSI)

PostgreSQL’s Serializable level uses SSI (Serializable Snapshot Isolation):

                    SSI: DETECTING CONFLICTS

    Track read/write dependencies:

    T1: Read A
    T2: Write A  ← conflict: T2 wrote what T1 read
    T1: Write B  ← conflict: T1 wrote after reading T2's input

    Dependency graph:
    T1 ──rw──► T2 ──wr──► T1  (cycle detected!)

    Result: Abort one transaction to break cycle

SSI adds overhead for tracking dependencies but provides true serializability without the locking overhead of traditional approaches.


8.11 Practical Isolation Level Selection

When to Use Read Committed

  • Default for most OLTP applications
  • Acceptable when slight inconsistency is OK
  • When you want maximum concurrency
-- OK: User profile view (momentary inconsistency acceptable)
SELECT name, email, last_login FROM users WHERE id = ?;

When to Use Repeatable Read / Snapshot Isolation

  • Reports that need consistent view of data
  • Transactions reading the same data multiple times
  • When non-repeatable reads would cause bugs
-- Generate consistent report
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT SUM(balance) FROM accounts;
SELECT COUNT(*) FROM accounts WHERE balance < 0;
COMMIT;
-- Both queries see same snapshot

When to Use Serializable

  • Financial transactions where write skew is dangerous
  • When correctness is critical
  • Simpler application logic (fewer edge cases to handle)
-- On-call schedule modification
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM on_call;
DELETE FROM on_call WHERE doctor = ?;
-- Database prevents invalid state
COMMIT;

8.12 MVCC Overhead and Maintenance

Table Bloat

MVCC creates dead rows that consume space:

-- PostgreSQL: Check table bloat
SELECT
    schemaname || '.' || relname as table,
    n_live_tup,
    n_dead_tup,
    round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

VACUUM Strategies

-- Manual VACUUM
VACUUM VERBOSE accounts;

-- Aggressive space reclamation
VACUUM FULL accounts;  -- Warning: locks table!

-- Autovacuum configuration
ALTER TABLE accounts SET (
    autovacuum_vacuum_threshold = 100,
    autovacuum_vacuum_scale_factor = 0.1
);

Transaction ID Wraparound

PostgreSQL transaction IDs are 32-bit and wrap around. Long-running transactions or failed vacuums can cause issues:

-- Check transaction ID age
SELECT
    datname,
    age(datfrozenxid) as xid_age,
    2^31 - age(datfrozenxid) as remaining
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

-- If approaching 2^31, VACUUM urgently!

8.13 Read-Only Transactions

For read-only workloads, declare transactions as read-only:

BEGIN READ ONLY;
SELECT * FROM large_table;
COMMIT;

Benefits:

  • Database can optimize (skip locking, use replicas)
  • Prevents accidental writes
  • May use older snapshot if acceptable
-- PostgreSQL: Use old snapshot if within limits
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY;
SET TRANSACTION SNAPSHOT '00000003-0000001A-1';

8.14 Summary

MVCC enables concurrent access without readers blocking writers:

  • Multiple versions of each row coexist
  • Visibility rules determine which version each transaction sees
  • Isolation levels trade correctness for concurrency
  • Snapshot Isolation provides repeatable reads efficiently
  • SSI enables true serializability without traditional locking
  • VACUUM cleans up old versions to prevent bloat

Understanding MVCC helps you choose appropriate isolation levels and diagnose concurrency issues.


What’s Next

In Chapter 9, we’ll explore locking and other concurrency control mechanisms—the locks, latches, and protocols that databases use when MVCC alone isn’t enough.


“In MVCC, every writer is a creator, and every reader sees their own private reality.”

Chapter 9: Locking and Concurrency Control

“Locks are pessimism made manifest: assume the worst, prevent it.”

MVCC handles most read/write concurrency, but some scenarios require explicit coordination. When two transactions want to modify the same row, one must wait. Locking mechanisms manage this coordination.


9.1 Why Locking Still Matters

Even with MVCC, we need locks for:

  1. Write-write conflicts: Two transactions can’t modify the same row simultaneously
  2. Schema changes: ALTER TABLE must exclude all other access
  3. Explicit coordination: Application-level locks for critical sections
  4. Foreign key checks: Ensuring referenced rows aren’t deleted
  5. Serializable isolation: Preventing certain anomalies
    Transaction A                    Transaction B
    UPDATE row WHERE id = 1          UPDATE row WHERE id = 1

    Both want to modify the same row!
    Without locking: Lost update anomaly
    With locking: One waits for the other

9.2 Lock Types

Shared vs Exclusive Locks

The fundamental lock types:

    SHARED (S) Lock - For reading
    - Multiple transactions can hold simultaneously
    - "I'm reading this, don't change it"

    EXCLUSIVE (X) Lock - For writing
    - Only one transaction can hold
    - "I'm modifying this, everyone else wait"

    Compatibility Matrix:
              | Existing S | Existing X |
    ──────────┼────────────┼────────────┤
    Request S |    OK      |   BLOCK    │
    Request X |   BLOCK    |   BLOCK    │

Lock Granularity

Locks can apply at different levels:

                    LOCK HIERARCHY

    Database Lock ──────────────────────────────────────────
         │
         └── Schema Lock ─────────────────────────────────
                  │
                  └── Table Lock ──────────────────────
                           │
                           └── Page Lock ─────────────
                                    │
                                    └── Row Lock ───

Coarse granularity (table locks):

  • Few locks to manage
  • Less overhead
  • More contention (blocking)

Fine granularity (row locks):

  • Many locks to manage
  • More overhead
  • Less contention

9.3 Row-Level Locking

Most OLTP databases use row-level locking for data modifications:

-- Transaction A
BEGIN;
UPDATE accounts SET balance = 900 WHERE id = 1;
-- Holds exclusive lock on row id=1

-- Transaction B (blocked)
UPDATE accounts SET balance = 800 WHERE id = 1;
-- Waits for A's lock...

-- Transaction A
COMMIT;
-- Lock released

-- Transaction B
-- Now acquires lock, proceeds

Lock Escalation

When too many row locks are held, some databases escalate to table locks:

    SQL Server Lock Escalation:

    Transaction locks 5000 rows on table T
    → System says "too many locks"
    → Escalate to single table lock
    → Release all row locks

    Trade-off: Reduces lock memory but increases contention

PostgreSQL doesn’t escalate; it manages many row locks efficiently.


9.4 Intent Locks

To lock a row, we need to ensure no one has locked the entire table. Intent locks solve this:

                    INTENT LOCKS

    To lock a row:
    1. Acquire Intent Shared (IS) or Intent Exclusive (IX) on table
    2. Acquire actual S or X lock on row

    Intent Lock Compatibility:
              | IS  | IX  |  S  |  X  |
    ──────────┼─────┼─────┼─────┼─────┤
    IS        | OK  | OK  | OK  |BLOCK│
    IX        | OK  | OK  |BLOCK|BLOCK│
    S         | OK  |BLOCK| OK  |BLOCK│
    X         |BLOCK|BLOCK|BLOCK|BLOCK│

Example:

    Transaction A wants to UPDATE row
    1. Acquire IX lock on table (intent to write somewhere)
    2. Acquire X lock on row

    Transaction B wants to ALTER TABLE
    1. Try to acquire X lock on table
    2. BLOCKED by A's IX lock
    3. Wait until A commits

9.5 Deadlocks

When transactions wait for each other in a cycle:

    DEADLOCK SCENARIO

    T1: Lock row A
    T2: Lock row B
    T1: Try to lock row B → WAIT (T2 has it)
    T2: Try to lock row A → WAIT (T1 has it)

    ┌─────┐        ┌─────┐
    │ T1  │───────►│  B  │ waits for
    └──▲──┘        └──┬──┘
       │              │
       │ holds        │ holds
       │              │
    ┌──┴──┐        ┌──▼──┐
    │  A  │◄───────│ T2  │ waits for
    └─────┘        └─────┘

    CYCLE! Neither can proceed.

Deadlock Detection

Databases maintain a wait-for graph:

    Wait-For Graph:
    T1 → T2 (T1 waits for T2)
    T2 → T1 (T2 waits for T1)

    Cycle detected! Deadlock exists.

Deadlock Resolution

When detected, abort one transaction (the “victim”):

-- PostgreSQL error
ERROR:  deadlock detected
DETAIL:  Process 1234 waits for ShareLock on transaction 5678;
         blocked by process 5678.
         Process 5678 waits for ShareLock on transaction 1234;
         blocked by process 1234.
HINT:  See server log for query details.

Victim selection criteria:

  • Transaction with least work done
  • Transaction that’s youngest
  • Transaction with fewest locks held

Deadlock Prevention

Approach 1: Lock ordering

    Always lock resources in consistent order (e.g., by ID)

    Correct:
    T1: Lock A, Lock B
    T2: Lock A, Lock B

    Wrong (potential deadlock):
    T1: Lock A, Lock B
    T2: Lock B, Lock A

Approach 2: Lock timeout

-- PostgreSQL: Give up after 1 second
SET lock_timeout = '1s';
UPDATE accounts SET balance = 0 WHERE id = 1;
-- ERROR: canceling statement due to lock timeout

Approach 3: NOWAIT

-- Fail immediately if lock unavailable
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;
-- ERROR: could not obtain lock on row

9.6 Explicit Locking in SQL

SELECT FOR UPDATE

Lock rows for later modification:

BEGIN;
SELECT * FROM inventory WHERE product_id = 42 FOR UPDATE;
-- Row is now locked exclusively

-- Check quantity, decide to decrement
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 42;
COMMIT;

Without FOR UPDATE, another transaction could modify between SELECT and UPDATE.

SELECT FOR SHARE

Lock rows to prevent modification (but allow other readers):

BEGIN;
SELECT * FROM parent_table WHERE id = 1 FOR SHARE;
-- Now inserting child row - parent can't be deleted
INSERT INTO child_table (parent_id, data) VALUES (1, 'something');
COMMIT;

SKIP LOCKED

For job queues—skip rows that are locked:

-- Worker 1
BEGIN;
SELECT * FROM jobs WHERE status = 'pending'
    ORDER BY created_at
    LIMIT 1
    FOR UPDATE SKIP LOCKED;
-- Gets job A (locks it)

-- Worker 2 (concurrent)
BEGIN;
SELECT * FROM jobs WHERE status = 'pending'
    ORDER BY created_at
    LIMIT 1
    FOR UPDATE SKIP LOCKED;
-- Skips job A (locked), gets job B

This enables efficient work distribution without blocking.


9.7 Table Locks

Sometimes you need to lock entire tables:

-- PostgreSQL table lock modes (partial list)
LOCK TABLE accounts IN ACCESS SHARE MODE;      -- Blocks nothing
LOCK TABLE accounts IN ROW SHARE MODE;         -- Blocks exclusive
LOCK TABLE accounts IN ROW EXCLUSIVE MODE;     -- Blocks share & exclusive
LOCK TABLE accounts IN SHARE MODE;             -- Blocks writers
LOCK TABLE accounts IN EXCLUSIVE MODE;         -- Blocks almost all
LOCK TABLE accounts IN ACCESS EXCLUSIVE MODE;  -- Blocks everything

Common uses:

  • Schema changes (ACCESS EXCLUSIVE)
  • Bulk loads (SHARE or EXCLUSIVE)
  • Preventing concurrent DDL

9.8 Advisory Locks

Application-defined locks not tied to database objects:

-- PostgreSQL advisory locks
-- Acquire lock on resource "user:42:profile"
SELECT pg_advisory_lock(42);

-- Do something critical
UPDATE user_profiles SET ... WHERE user_id = 42;

-- Release lock
SELECT pg_advisory_unlock(42);

Use cases:

  • Distributed coordination
  • Rate limiting
  • Singleton job execution
  • Custom critical sections
-- Try to get lock without waiting
SELECT pg_try_advisory_lock(12345);
-- Returns true if acquired, false if already held

-- Session-level lock (released when connection closes)
SELECT pg_advisory_lock(12345);

-- Transaction-level lock (released on commit/rollback)
SELECT pg_advisory_xact_lock(12345);

9.9 Optimistic vs Pessimistic Concurrency

Pessimistic Locking

Lock first, then operate. Assumes conflicts are common.

BEGIN;
SELECT * FROM inventory WHERE id = 42 FOR UPDATE;  -- Lock now!
-- Calculate new quantity
UPDATE inventory SET quantity = ? WHERE id = 42;
COMMIT;

Pros: Prevents conflicts Cons: Reduces concurrency, potential deadlocks

Optimistic Locking

Check for conflicts at commit time. Assumes conflicts are rare.

-- Include version column
SELECT id, quantity, version FROM inventory WHERE id = 42;
-- Returns: quantity=100, version=5

-- Later, try to update with version check
UPDATE inventory
SET quantity = 99, version = version + 1
WHERE id = 42 AND version = 5;

-- If rows_affected = 0, someone else changed it!
-- Application must retry

Pros: More concurrency, no deadlocks Cons: Must handle retries, wasted work on conflict

When to Use Which

    Pessimistic:
    - High contention expected
    - Short transactions
    - Can't afford to retry

    Optimistic:
    - Low contention expected
    - Long transactions (avoid holding locks)
    - Read-heavy with occasional writes
    - Web forms (user think time)

9.10 Latches vs Locks

Databases distinguish between:

Locks (heavyweight):

  • Protect logical data (rows, tables)
  • Held for duration of transaction
  • Involved in deadlock detection
  • Visible to users

Latches (lightweight):

  • Protect physical structures (pages, B-tree nodes)
  • Held for very short duration
  • No deadlock detection (ordered acquisition)
  • Internal implementation detail
    B-tree insertion:
    1. Acquire latch on root page (read)
    2. Navigate down, acquiring latches
    3. Find leaf page
    4. Acquire latch on leaf (write)
    5. Release latches on parent pages
    6. Modify leaf
    7. Release leaf latch

    Very fast - microseconds, not milliseconds

9.11 Lock Monitoring

PostgreSQL

-- View current locks
SELECT
    pid,
    locktype,
    relation::regclass,
    mode,
    granted
FROM pg_locks
WHERE relation IS NOT NULL;

-- View blocking relationships
SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.query AS blocked_query,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

MySQL InnoDB

-- View current locks
SELECT * FROM performance_schema.data_locks;

-- View lock waits
SELECT * FROM performance_schema.data_lock_waits;

-- Traditional (deprecated)
SHOW ENGINE INNODB STATUS\G
-- Look for "TRANSACTIONS" and "LATEST DETECTED DEADLOCK"

9.12 Reducing Lock Contention

1. Shorter Transactions

-- Bad: Long transaction holding locks
BEGIN;
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;
-- ... application does slow processing ...
UPDATE accounts SET balance = ? WHERE id = 1;
COMMIT;

-- Better: Do processing outside transaction
-- Read data (no lock)
SELECT * FROM accounts WHERE id = 1;
-- Application processing...
-- Then quick update
BEGIN;
UPDATE accounts SET balance = ? WHERE id = 1 WHERE balance = original_balance;
COMMIT;

2. Lock Only What’s Needed

-- Bad: Lock entire table
LOCK TABLE accounts IN EXCLUSIVE MODE;
UPDATE accounts SET balance = 0 WHERE id = 1;

-- Better: Lock only the row
UPDATE accounts SET balance = 0 WHERE id = 1;

3. Access Rows in Consistent Order

-- Bad: Different orders → Deadlock risk
-- Transaction 1: UPDATE ... WHERE id = 1; UPDATE ... WHERE id = 2;
-- Transaction 2: UPDATE ... WHERE id = 2; UPDATE ... WHERE id = 1;

-- Better: Always order by ID
-- Transaction 1: UPDATE ... WHERE id = 1; UPDATE ... WHERE id = 2;
-- Transaction 2: UPDATE ... WHERE id = 1; UPDATE ... WHERE id = 2;

4. Use Appropriate Isolation Level

-- Don't use SERIALIZABLE if READ COMMITTED is sufficient
BEGIN ISOLATION LEVEL READ COMMITTED;
-- Less locking overhead

9.13 Two-Phase Locking (2PL)

Traditional concurrency control protocol:

    Two-Phase Locking Rules:
    1. Growing phase: Acquire locks, never release
    2. Shrinking phase: Release locks, never acquire

    Transaction Timeline:
    ┌────────────────────────────────────────────────────────┐
    │ Lock A │ Lock B │ Lock C │ Unlock A │ Unlock B │ Unlock C │
    └────────────────────────────────────────────────────────┘
    |←── Growing Phase ──→|←────── Shrinking Phase ──────→|

Strict 2PL: Hold all locks until commit/abort (common practice).

Guarantees serializability but can cause:

  • Blocking (transactions wait for locks)
  • Deadlocks (must be detected and resolved)

Most databases combine 2PL with MVCC: reads use MVCC snapshots, writes use 2PL.


9.14 Summary

Locking mechanisms coordinate concurrent access:

  • Shared/Exclusive locks control read/write access
  • Row-level locking provides fine-grained concurrency
  • Intent locks enable hierarchical locking
  • Deadlocks are detected and resolved by aborting a transaction
  • Explicit locks (FOR UPDATE, LOCK TABLE) provide application control
  • Advisory locks enable application-level coordination
  • Optimistic locking uses version checks instead of explicit locks
  • Latches protect internal structures

Understanding locking helps you design high-concurrency systems and debug performance issues.


What’s Next

In Chapter 10, we’ll explore query parsing and planning—how databases transform SQL text into executable operations.


“The best lock is the one you don’t need. The second best is the one you hold briefly.”

Chapter 10: Query Parsing and Planning

“SQL is declarative: you say what you want, not how to get it. The planner figures out the how.”

When you write SQL, you express intent—“give me users older than 30 sorted by name.” The database must figure out how to actually retrieve that data. This transformation from SQL text to executable plan happens through parsing, analysis, and planning.


10.1 The Query Processing Pipeline

                    QUERY PROCESSING STAGES

    "SELECT name FROM users WHERE age > 30 ORDER BY name"
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. LEXER/TOKENIZER                                          │
    │    Break into tokens: SELECT, name, FROM, users, WHERE, ... │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. PARSER                                                    │
    │    Build Abstract Syntax Tree (AST) from tokens              │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 3. ANALYZER/BINDER                                           │
    │    Resolve names, check types, validate semantics            │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 4. REWRITER                                                  │
    │    Apply rules: views, RLS policies, query rewriting         │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 5. PLANNER/OPTIMIZER                                         │
    │    Generate and evaluate execution plans, choose best        │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 6. EXECUTOR                                                  │
    │    Run the chosen plan, return results                       │
    └─────────────────────────────────────────────────────────────┘

10.2 Lexical Analysis (Tokenization)

The lexer breaks SQL text into tokens:

SELECT name, age FROM users WHERE age > 30
    Tokens:
    ┌──────────┬─────────────┐
    │ Type     │ Value       │
    ├──────────┼─────────────┤
    │ KEYWORD  │ SELECT      │
    │ IDENT    │ name        │
    │ COMMA    │ ,           │
    │ IDENT    │ age         │
    │ KEYWORD  │ FROM        │
    │ IDENT    │ users       │
    │ KEYWORD  │ WHERE       │
    │ IDENT    │ age         │
    │ OPERATOR │ >           │
    │ NUMBER   │ 30          │
    └──────────┴─────────────┘

Handling Strings and Identifiers

SELECT "column-with-dash" FROM 'my_table'
WHERE name = 'O''Brien'  -- Escaped quote

The lexer handles:

  • Quoted identifiers ("column-with-dash")
  • String literals ('O''Brien')
  • Escape sequences
  • Comments (-- comment, /* block */)

10.3 Parsing: Building the AST

The parser uses grammar rules to build an Abstract Syntax Tree:

                    ABSTRACT SYNTAX TREE

                        SelectStmt
                       /    |    \
                      /     |     \
                 target   from   where
                  list    clause  clause
                   |        |        |
              ┌────┴────┐   │    BinaryOp
              │         │   │      (>)
           ColRef   ColRef │     /   \
           (name)   (age)  │  ColRef  Const
                           │  (age)   (30)
                       TableRef
                       (users)

Parser Implementation

Most databases use parser generators (Bison/Yacc) with grammar rules:

select_stmt:
    SELECT target_list FROM from_clause where_clause
    ;

target_list:
    target_entry
    | target_list ',' target_entry
    ;

where_clause:
    /* empty */
    | WHERE expr
    ;

Syntax Errors

The parser catches syntax errors:

SELCT name FROM users;
-- Error: syntax error at or near "SELCT"

SELECT FROM users;
-- Error: syntax error at or near "FROM" (missing column list)

10.4 Semantic Analysis (Binding)

The analyzer validates that the AST makes sense:

Name Resolution

SELECT name FROM users WHERE age > 30;
  1. Does table users exist?
  2. Does users have column name?
  3. Does users have column age?
  4. Is comparing age to 30 valid (types compatible)?

Type Checking

SELECT name FROM users WHERE name > 30;
-- PostgreSQL: Actually allowed (compares text to integer after cast)

SELECT name FROM users WHERE name > ARRAY[1,2,3];
-- Error: operator does not exist: text > integer[]

Ambiguity Resolution

SELECT name FROM users, orders WHERE id = order_id;
-- Which table's id? Must specify: users.id or orders.id

Output Schema

The analyzer determines the result’s schema:

SELECT name, age + 1 AS next_age FROM users;
-- Result columns:
--   name: text
--   next_age: integer (expression type)

10.5 Query Rewriting

Before optimization, databases apply rewriting rules:

View Expansion

-- View definition
CREATE VIEW active_users AS
    SELECT * FROM users WHERE status = 'active';

-- Original query
SELECT name FROM active_users WHERE age > 30;

-- After expansion
SELECT name FROM users WHERE status = 'active' AND age > 30;

Subquery Flattening

-- Original
SELECT * FROM orders WHERE customer_id IN (
    SELECT id FROM customers WHERE country = 'USA'
);

-- Rewritten as join
SELECT orders.* FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE customers.country = 'USA';

Predicate Pushdown

-- Original
SELECT * FROM (
    SELECT * FROM orders WHERE year = 2024
) sub
WHERE amount > 100;

-- Rewritten
SELECT * FROM orders WHERE year = 2024 AND amount > 100;

Row-Level Security (RLS)

-- Policy: users can only see their own data
CREATE POLICY user_isolation ON accounts
    USING (user_id = current_user_id());

-- Original query
SELECT * FROM accounts;

-- After RLS expansion
SELECT * FROM accounts WHERE user_id = current_user_id();

10.6 Logical vs Physical Plans

Logical Plan

Describes what operations are needed, abstractly:

    Logical Plan:
    ┌─────────────────────────────────────────────────────────────┐
    │ Project [name]                                               │
    │   └── Filter [age > 30]                                     │
    │         └── Scan [users]                                    │
    └─────────────────────────────────────────────────────────────┘

Physical Plan

Describes how to execute, concretely:

    Physical Plan:
    ┌─────────────────────────────────────────────────────────────┐
    │ Project [name]                                               │
    │   └── Filter [age > 30]                                     │
    │         └── Index Scan [idx_users_age]                      │
    │               range: age > 30                                │
    └─────────────────────────────────────────────────────────────┘

The same logical plan might have many possible physical plans.


10.7 Plan Operators

Scan Operators

    Sequential Scan:    Read all rows in table
    Index Scan:         Use index to find rows
    Index Only Scan:    Get data from index alone
    Bitmap Scan:        Build bitmap from index, then scan heap

Join Operators

    Nested Loop:        For each row in A, scan B
    Hash Join:          Build hash table from A, probe with B
    Merge Join:         Merge two sorted inputs

Other Operators

    Sort:               Order rows by key
    Aggregate:          Compute SUM, COUNT, etc.
    Limit:              Return only N rows
    Unique:             Remove duplicates
    Append:             Concatenate results (UNION)

Example Plan Tree

SELECT d.name, COUNT(*)
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
ORDER BY COUNT(*) DESC
LIMIT 10;
    Plan:
    Limit (10)
      └── Sort (count DESC)
            └── Aggregate (GROUP BY d.name)
                  └── Hash Join (e.dept_id = d.id)
                        ├── Seq Scan (employees)
                        └── Hash
                              └── Seq Scan (departments)

10.8 Plan Representation in Databases

PostgreSQL EXPLAIN

EXPLAIN SELECT * FROM users WHERE age > 30;
                        QUERY PLAN
─────────────────────────────────────────────────────────
 Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
   Filter: (age > 30)

EXPLAIN ANALYZE (with execution)

EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;
                        QUERY PLAN
─────────────────────────────────────────────────────────
 Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                    (actual time=0.012..0.015 rows=3 loops=1)
   Filter: (age > 30)
   Rows Removed by Filter: 2
 Planning Time: 0.045 ms
 Execution Time: 0.029 ms

MySQL EXPLAIN

EXPLAIN SELECT * FROM users WHERE age > 30;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | users | ALL  | idx_age       | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

10.9 Understanding EXPLAIN Output

Cost Estimates

Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                         │      │    │      │
                         │      │    │      └── Average row width (bytes)
                         │      │    └── Estimated rows returned
                         │      └── Total cost
                         └── Startup cost

Startup cost: Cost before first row can be returned Total cost: Cost to return all rows

Node Types

    Seq Scan         - Full table scan
    Index Scan       - Use index, then fetch rows
    Index Only Scan  - Use index, don't fetch rows
    Bitmap Index Scan - Build bitmap of matching rows
    Bitmap Heap Scan - Fetch rows indicated by bitmap

    Nested Loop      - For each outer row, scan inner
    Hash Join        - Build hash table, probe
    Merge Join       - Merge sorted inputs

    Sort             - Sort rows
    HashAggregate    - Aggregate via hash table
    GroupAggregate   - Aggregate pre-sorted groups

Reading Nested Plans

Plans are trees, read bottom-up (leaves execute first):

    Limit  (cost=1.15..1.15 rows=1 width=68)
      ->  Sort  (cost=1.15..1.15 rows=2 width=68)
            Sort Key: age DESC
            ->  Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                  Filter: (age > 30)

    Execution order:
    1. Seq Scan users (filter age > 30)
    2. Sort results by age DESC
    3. Limit to 1 row

10.10 Prepared Statements

Parsing and planning can be cached:

-- Prepare once
PREPARE find_user (int) AS
    SELECT * FROM users WHERE id = $1;

-- Execute many times (skip parse/plan)
EXECUTE find_user(42);
EXECUTE find_user(100);
EXECUTE find_user(7);

Generic vs Custom Plans

For prepared statements with parameters, databases choose:

Custom plan: Re-plan with actual parameter values. Better estimates. Generic plan: Cache plan, use for all parameter values. Less planning overhead.

    First 5 executions: Custom plans (learning phase)
    After 5 executions: Compare costs
    If generic plan is close enough: Switch to generic
-- PostgreSQL: Control plan caching
SET plan_cache_mode = auto;           -- Default: decide automatically
SET plan_cache_mode = force_generic;  -- Always generic
SET plan_cache_mode = force_custom;   -- Always custom

10.11 Plan Caching and Invalidation

What Gets Cached

  • Prepared statement plans
  • PL/pgSQL function plans
  • View definitions

Invalidation Triggers

Plans are invalidated when:

  • Table structure changes (ALTER TABLE)
  • Indexes are created or dropped
  • Statistics are updated (ANALYZE)
  • Related objects are modified
-- This invalidates plans using the users table
ALTER TABLE users ADD COLUMN nickname TEXT;
CREATE INDEX idx_users_email ON users(email);
ANALYZE users;

10.12 Parser Limitations and Quirks

SQL Injection

The parser doesn’t prevent injection—you must use parameters:

-- DANGEROUS: String interpolation
"SELECT * FROM users WHERE name = '" + user_input + "'"

-- If user_input = "'; DROP TABLE users; --"
-- Parsed as:
SELECT * FROM users WHERE name = ''; DROP TABLE users; --'

-- SAFE: Parameterized query
PREPARE stmt AS SELECT * FROM users WHERE name = $1;
EXECUTE stmt('Alice');
-- $1 is always treated as a value, never as SQL

Case Sensitivity

-- These are the same column
SELECT Name FROM users;
SELECT name FROM users;
SELECT NAME FROM users;

-- Unless quoted (PostgreSQL)
SELECT "Name" FROM users;  -- Different from "name"

Operator Precedence

-- What does this mean?
SELECT * FROM t WHERE a = 1 OR b = 2 AND c = 3;

-- Parsed as (AND has higher precedence):
SELECT * FROM t WHERE a = 1 OR (b = 2 AND c = 3);

-- Use explicit parentheses!
SELECT * FROM t WHERE (a = 1 OR b = 2) AND c = 3;

10.13 Summary

Query parsing and planning transforms SQL into executable operations:

  • Lexer breaks SQL into tokens
  • Parser builds Abstract Syntax Tree
  • Analyzer validates semantics and resolves names
  • Rewriter expands views and applies rules
  • Planner generates execution plans

Understanding this pipeline helps you:

  • Debug SQL syntax errors
  • Interpret EXPLAIN output
  • Understand why plans are chosen
  • Use prepared statements effectively

What’s Next

In Chapter 11, we’ll dive deep into query optimization—how the planner evaluates alternatives and chooses the best execution plan.


“The parser is the bouncer at the SQL club: strict about syntax, indifferent to whether your query makes business sense.”

Chapter 11: Query Optimization

“The optimizer’s job is to find a needle in a haystack of possible plans—then realize it should be looking for a different needle.”

A single query can be executed in thousands of different ways. The optimizer’s job is to find a good one—ideally the best, but at least not terrible. This is one of the most complex components in any database system.


11.1 Why Optimization Matters

Consider a simple join:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';

Plan A: Scan orders first

For each of 10,000,000 orders:
    Look up customer
    Check if country = 'USA'
Cost: ~10,000,000 index lookups

Plan B: Filter customers first

Find customers where country = 'USA' (100,000 customers)
For each, find their orders via index
Cost: ~100,000 index lookups + filtered scan

Plan B is 100x faster. The optimizer’s job is to find it.


11.2 The Optimization Process

                    QUERY OPTIMIZATION PHASES

    ┌─────────────────────────────────────────────────────────────┐
    │ 1. GENERATE CANDIDATE PLANS                                  │
    │    - Different join orders                                   │
    │    - Different access methods (seq scan, index scan)        │
    │    - Different join methods (nested loop, hash, merge)      │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. ESTIMATE COSTS                                            │
    │    - Use statistics about data                               │
    │    - Model I/O, CPU, memory costs                           │
    │    - Estimate result sizes (cardinality)                    │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 3. SELECT BEST PLAN                                          │
    │    - Compare total costs                                     │
    │    - Return cheapest plan                                    │
    └─────────────────────────────────────────────────────────────┘

11.3 The Plan Space Explosion

For a query joining N tables, there are:

  • N! different join orders
  • Multiple access methods per table
  • Multiple join algorithms per join
    3 tables: 3! = 6 join orders
    5 tables: 5! = 120 join orders
    10 tables: 10! = 3,628,800 join orders

    Plus access methods and join types!

Exhaustive search is impossible for large queries. Optimizers use heuristics and dynamic programming.


11.4 Cost Estimation

The Cost Model

Optimizers estimate cost based on:

    Total Cost = CPU Cost + I/O Cost + Memory Cost

    I/O Cost:
    - Sequential page reads (cheap)
    - Random page reads (expensive)
    - Index accesses

    CPU Cost:
    - Tuple processing
    - Expression evaluation
    - Sorting overhead

    Memory Cost:
    - Hash table building
    - Sort buffers

Cost Parameters (PostgreSQL)

-- View current cost parameters
SHOW seq_page_cost;        -- 1.0 (baseline)
SHOW random_page_cost;     -- 4.0 (4x sequential)
SHOW cpu_tuple_cost;       -- 0.01
SHOW cpu_index_tuple_cost; -- 0.005
SHOW cpu_operator_cost;    -- 0.0025

-- Adjust for SSDs (random I/O is faster)
SET random_page_cost = 1.1;

Example Cost Calculation

    Seq Scan on users (10,000 rows, 100 pages):
    Cost = (100 pages × seq_page_cost) + (10,000 rows × cpu_tuple_cost)
    Cost = (100 × 1.0) + (10,000 × 0.01) = 200

    Index Scan (estimate 100 rows, 100 index + heap accesses):
    Cost = (100 × random_page_cost) + (100 × cpu_index_tuple_cost)
    Cost = (100 × 4.0) + (100 × 0.005) = 400.5

    Sequential scan wins! (Until we add WHERE clause selectivity)

11.5 Statistics and Cardinality Estimation

What Statistics Are Stored

Databases maintain statistics about each column:

-- PostgreSQL: View statistics
SELECT
    attname,
    n_distinct,     -- Number of distinct values
    null_frac,      -- Fraction of NULLs
    avg_width,      -- Average value width in bytes
    correlation     -- Physical vs logical ordering
FROM pg_stats
WHERE tablename = 'users';

Histograms

For non-uniform distributions, histograms capture value frequencies:

    Column: age
    Histogram buckets:

    [0-18]:  500 rows   (5%)
    [19-25]: 2000 rows  (20%)
    [26-35]: 3500 rows  (35%)
    [36-50]: 2500 rows  (25%)
    [51+]:   1500 rows  (15%)

    Query: WHERE age > 35
    Estimated selectivity: 25% + 15% = 40%
    Estimated rows: 10,000 × 0.40 = 4,000

Most Common Values (MCV)

For skewed distributions, store the most common values:

    Column: country
    Most Common Values:
    USA:     4000 (40%)
    UK:      1500 (15%)
    Canada:  1000 (10%)
    Germany: 800  (8%)
    Other:   2700 (27%)

    Query: WHERE country = 'USA'
    Estimated rows: 4000 (exact from MCV!)

    Query: WHERE country = 'France'
    Not in MCV, estimate from remaining: 2700 / (distinct - 4) ≈ 270

Updating Statistics

-- PostgreSQL: Manual statistics update
ANALYZE users;
ANALYZE users(email);  -- Specific column

-- View when last analyzed
SELECT relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables;

Autovacuum updates statistics automatically, but manual ANALYZE may be needed after bulk loads.


11.6 Join Optimization

Join Order Selection

For a join of tables A, B, C, possible orders include:

  • (A ⋈ B) ⋈ C
  • (A ⋈ C) ⋈ B
  • (B ⋈ C) ⋈ A
  • (B ⋈ A) ⋈ C
  • etc.

The optimizer uses dynamic programming to find the best order:

    1. Find best way to access each single table
    2. Find best way to join each pair of tables
    3. Find best way to join each triple (using best pairs)
    4. Continue until all tables are joined

Join Methods

Nested Loop Join

For each row in outer:
    For each row in inner:
        If join condition matches:
            Output combined row

Best when: Outer is small, inner has index
Cost: O(N × M) without index, O(N × log M) with index

Hash Join

Build hash table from smaller relation
For each row in larger relation:
    Probe hash table
    Output matches

Best when: No useful indexes, one relation fits in memory
Cost: O(N + M) if hash table fits in memory

Merge Join

Sort both relations by join key
Merge like merging sorted lists

Best when: Both relations already sorted, or sort is cheap
Cost: O(N log N + M log M) for sorting + O(N + M) for merge
                    JOIN METHOD COMPARISON

    Nested Loop       Hash Join         Merge Join
    ┌─────────┐      ┌─────────┐       ┌─────────┐
    │ Outer   │      │  Hash   │       │ Sorted  │
    │  ↓      │      │  Table  │       │    A    │
    │ For each│      │ ┌─┬─┬─┐ │       │   ↓↓↓   │
    │  row:   │      │ │ │ │ │ │       │  Merge  │
    │ ┌─────┐ │      │ └─┴─┴─┘ │       │   ↓↓↓   │
    │ │Index│ │      │    ▲    │       │ Sorted  │
    │ │Scan │ │      │    │    │       │    B    │
    │ └─────┘ │      │ Probe   │       │         │
    └─────────┘      └─────────┘       └─────────┘

11.7 Access Path Selection

For each table, choose how to access it:

Sequential Scan

Read all pages, filter in memory.

Best when:
- Need most rows from table
- No useful index
- Table is small

Index Scan

Use index to find matching rows, fetch from heap.

Best when:
- High selectivity (few rows match)
- Index exists on filter column
- Results need to be ordered by indexed column

Index-Only Scan

Get all needed data from index, skip heap.

Best when:
- All needed columns are in the index
- Visibility map shows pages are all-visible

Bitmap Index Scan

Build bitmap from index, then scan heap in physical order.

Best when:
- Medium selectivity
- Multiple conditions can use different indexes
- Random I/O from regular index scan would be expensive
                    ACCESS PATH DECISION TREE

    ┌─────────────────────────────────────┐
    │ What percentage of rows needed?     │
    └─────────────────────────────────────┘
                    │
        ┌───────────┼───────────┐
        │           │           │
      >20%        5-20%       <5%
        │           │           │
        ▼           ▼           ▼
    Seq Scan    Bitmap Scan   Index Scan

11.8 Predicate Pushdown

Move filters as close to data as possible:

-- Original
SELECT * FROM (
    SELECT * FROM orders JOIN customers ON orders.cust_id = customers.id
) sub
WHERE customers.country = 'USA';

-- After pushdown
SELECT * FROM orders
JOIN (SELECT * FROM customers WHERE country = 'USA') c
ON orders.cust_id = c.id;

Benefit: Filter 90% of customers before the join instead of after.


11.9 Join Reordering

Optimizer chooses the best join order:

SELECT * FROM A, B, C WHERE A.x = B.x AND B.y = C.y;

-- Possible orders:
-- 1. (A ⋈ B) ⋈ C
-- 2. (B ⋈ C) ⋈ A
-- 3. (A ⋈ C) ⋈ B  -- Cartesian product, usually bad!

-- If A has 1000 rows, B has 100 rows, C has 10 rows:
-- Order 2 is best: (B ⋈ C) produces small result, then join A

Controlling Join Order

-- PostgreSQL: Disable join reordering
SET join_collapse_limit = 1;

-- Force specific order with explicit JOIN
SELECT * FROM A
JOIN B ON A.x = B.x
JOIN C ON B.y = C.y;

-- vs implicit join (optimizer chooses order)
SELECT * FROM A, B, C WHERE A.x = B.x AND B.y = C.y;

11.10 Common Optimization Techniques

Constant Folding

Evaluate constant expressions at plan time:

-- Before
SELECT * FROM orders WHERE date > '2024-01-01'::date + interval '30 days';

-- After constant folding
SELECT * FROM orders WHERE date > '2024-01-31';

Predicate Simplification

-- Before
SELECT * FROM users WHERE age > 10 AND age > 20;

-- After
SELECT * FROM users WHERE age > 20;

IN to JOIN Transformation

-- Before
SELECT * FROM orders WHERE customer_id IN (
    SELECT id FROM customers WHERE country = 'USA'
);

-- After
SELECT orders.* FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE customers.country = 'USA';

EXISTS to Semi-Join

-- Before
SELECT * FROM departments d
WHERE EXISTS (SELECT 1 FROM employees e WHERE e.dept_id = d.id);

-- After (semi-join)
SELECT d.* FROM departments d SEMI JOIN employees e ON e.dept_id = d.id;

11.11 When the Optimizer Gets It Wrong

Outdated Statistics

-- Just loaded 1M rows, stats say table has 1000 rows
ANALYZE my_table;  -- Fix it

Correlation Not Captured

-- city='NYC' AND state='NY' are correlated
-- Optimizer assumes independence: P(NYC) × P(NY)
-- Reality: P(NYC AND NY) ≈ P(NYC)

-- PostgreSQL 10+: Extended statistics
CREATE STATISTICS stats_city_state ON city, state FROM addresses;
ANALYZE addresses;

Parameter Sniffing (Prepared Statements)

-- First call: WHERE id = 1 (1 row, use index)
-- Plan cached
-- Later call: WHERE id = NULL (matches many, should seq scan)
-- But cached plan uses index scan!

Function Calls Hide Information

-- Optimizer can't see inside function
SELECT * FROM users WHERE my_function(email) = 'something';
-- Can't use index on email!

11.12 Influencing the Optimizer

Hints (MySQL, Oracle)

-- MySQL
SELECT /*+ INDEX(users idx_age) */ * FROM users WHERE age > 30;
SELECT /*+ NO_INDEX(users) */ * FROM users WHERE age > 30;

-- Oracle
SELECT /*+ FULL(users) */ * FROM users WHERE age > 30;
SELECT /*+ USE_HASH(orders customers) */ * FROM orders JOIN customers ...;

PostgreSQL: No Hints, But…

-- Disable specific plan types
SET enable_seqscan = off;        -- Discourage seq scans
SET enable_hashjoin = off;       -- Discourage hash joins
SET enable_mergejoin = off;      -- Discourage merge joins

-- After running query, re-enable
RESET enable_seqscan;

Better: Fix the Root Cause

  1. Update statistics: ANALYZE table
  2. Create appropriate indexes
  3. Create extended statistics for correlations
  4. Rewrite the query
  5. Adjust cost parameters for your hardware

11.13 Analyzing Query Plans

What to Look For

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 42;
Index Scan using idx_orders_customer on orders
    (cost=0.43..8.45 rows=1 width=48)
    (actual time=0.019..0.021 rows=5 loops=1)
  Index Cond: (customer_id = 42)
  Buffers: shared hit=4
Planning Time: 0.065 ms
Execution Time: 0.035 ms

Red flags to watch for:

  • rows=1 estimated vs rows=1000000 actual (bad estimate!)
  • Seq Scan on large table with selective filter
  • Nested Loop with large inner table
  • Sort with external merge (spilling to disk)
  • Very high Buffers: shared read (cache misses)

Comparing Estimated vs Actual

    Estimated rows=100, Actual rows=100000

    The optimizer thought this would return 100 rows.
    It actually returned 100,000.

    Every downstream operation is based on wrong estimate!
    Likely cause: Outdated statistics or unrecognized correlation

11.14 Summary

Query optimization finds efficient execution plans:

  • Cost model estimates I/O, CPU, and memory costs
  • Statistics provide data distribution information
  • Cardinality estimation predicts result sizes
  • Join optimization chooses order and method
  • Access path selection chooses how to read tables

The optimizer can be wrong due to:

  • Outdated or missing statistics
  • Unrecognized correlations
  • Parameter sniffing
  • Complex expressions

Understanding optimization helps you:

  • Write optimizer-friendly queries
  • Diagnose slow queries
  • Create appropriate indexes
  • Know when to intervene

What’s Next

In Chapter 12, we’ll explore buffer pools and caching—how databases keep frequently-used data in memory.


“The optimizer is an oracle that predicts the future based on the past. When the past is wrong, so is the future.”

Chapter 12: Buffer Pools and Caching

“The fastest disk read is the one that never happens.”

Memory is fast; disk is slow. The buffer pool is the database’s primary defense against I/O latency—a sophisticated cache that keeps frequently-accessed pages in memory. Understanding how it works is essential for tuning database performance.


12.1 Why Buffer Pools Exist

                    THE MEMORY-DISK GAP

    RAM Access:    ~100 nanoseconds
    SSD Access:    ~100,000 nanoseconds (100 microseconds)
    HDD Access:    ~10,000,000 nanoseconds (10 milliseconds)

    RAM is 1,000x faster than SSD
    RAM is 100,000x faster than HDD

    If we can serve requests from memory, huge win!

The buffer pool bridges this gap by:

  1. Caching frequently-used pages in memory
  2. Batching writes to reduce I/O
  3. Managing memory pressure intelligently

12.2 Buffer Pool Architecture

                    BUFFER POOL STRUCTURE

    ┌─────────────────────────────────────────────────────────────┐
    │                     BUFFER POOL                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │                   PAGE TABLE                             ││
    │  │  PageID → Buffer Frame mapping                          ││
    │  │  ┌────────────┬─────────────┐                           ││
    │  │  │ (file, page) │ frame #    │                           ││
    │  │  ├────────────┼─────────────┤                           ││
    │  │  │ (1, 42)    │    17       │                           ││
    │  │  │ (1, 100)   │    3        │                           ││
    │  │  │ (2, 5)     │    42       │                           ││
    │  │  └────────────┴─────────────┘                           ││
    │  └─────────────────────────────────────────────────────────┘│
    │                                                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │                  BUFFER FRAMES                           ││
    │  │  ┌───────┬───────┬───────┬───────┬───────┬───────┐      ││
    │  │  │ Frame │ Frame │ Frame │ Frame │ Frame │ Frame │ ...  ││
    │  │  │   0   │   1   │   2   │   3   │   4   │   5   │      ││
    │  │  │ 8KB   │ 8KB   │ 8KB   │ 8KB   │ 8KB   │ 8KB   │      ││
    │  │  │       │       │       │(1,100)│       │       │      ││
    │  │  └───────┴───────┴───────┴───────┴───────┴───────┘      ││
    │  │                                                          ││
    │  │  Each frame holds one page and metadata:                 ││
    │  │  - Page content (8KB)                                    ││
    │  │  - Dirty flag (modified since read?)                     ││
    │  │  - Pin count (currently in use?)                         ││
    │  │  - Reference count (for eviction decisions)              ││
    │  └─────────────────────────────────────────────────────────┘│
    └─────────────────────────────────────────────────────────────┘

12.3 Page Table and Hash Index

The page table maps (file_id, page_number) to buffer frames:

    Request: Read page 42 of file 1

    1. Compute hash: hash(1, 42) = bucket 17
    2. Search bucket 17 for (1, 42)
    3. If found: Return frame pointer (HIT!)
    4. If not found: Load from disk (MISS)

Hash Table Considerations

  • Must handle concurrent access (latches)
  • Bucket chains for collisions
  • Resize when too full

12.4 The Page Request Flow

                    PAGE REQUEST FLOW

    Request page (1, 42)
           │
           ▼
    ┌─────────────────┐
    │ Check page table│
    └────────┬────────┘
             │
     ┌───────┴───────┐
     │               │
    HIT            MISS
     │               │
     ▼               ▼
    Pin page    ┌─────────────────┐
    in frame    │ Find free frame │
     │          │ (or evict one)  │
     │          └────────┬────────┘
     │                   │
     │                   ▼
     │          ┌─────────────────┐
     │          │ Read page from  │
     │          │ disk into frame │
     │          └────────┬────────┘
     │                   │
     │                   ▼
     │          ┌─────────────────┐
     │          │ Add to page     │
     │          │ table           │
     │          └────────┬────────┘
     │                   │
     └──────────┬────────┘
                │
                ▼
         Return frame pointer

12.5 Page Replacement Policies

When the buffer pool is full and we need to load a new page, which page do we evict?

LRU (Least Recently Used)

Evict the page that hasn’t been accessed for the longest time.

    Access sequence: A, B, C, D, A, B, E (pool size = 4)

    A: [A, _, _, _]
    B: [A, B, _, _]
    C: [A, B, C, _]
    D: [A, B, C, D]
    A: [B, C, D, A] (A moves to front)
    B: [C, D, A, B] (B moves to front)
    E: [D, A, B, E] (C evicted - least recently used)

Problem: Sequential scan evicts everything!

    Sequential scan of 1M pages:
    Each page accessed once, never again
    But they fill the entire buffer pool
    Evicting actually useful pages!

LRU-K

Track last K accesses, not just most recent. Evict page with oldest K-th access.

Clock (Second Chance)

Approximate LRU with less overhead:

                    CLOCK ALGORITHM

    Frames arranged in circle with a clock hand

         ┌─────┐
         │  A  │ ref=1
         │     │
    ┌────┴─────┴────┐
    │ D │         │ B │
    │ref=0        │ref=1
    │     ▲       │
    └─────│───────┘
          │  clock hand
         ┌┴────┐
         │  C  │ ref=0
         └─────┘

    To find victim:
    1. Check current frame
    2. If ref=0: Evict it
    3. If ref=1: Set ref=0, advance hand, repeat

    Accessing a page sets its ref=1

2Q (Two Queues)

Protect hot pages from sequential scans:

    Queue 1 (A1): Recently added pages (FIFO)
    Queue 2 (Am): Frequently accessed pages (LRU)

    New page → A1
    Accessed again while in A1 → Promote to Am
    Evicted from A1 without reaccess → Gone

    Sequential scan pages never make it to Am!

12.6 Dirty Page Management

A dirty page has been modified but not yet written to disk.

    Page state transitions:

    CLEAN ───modify───► DIRTY
      ▲                   │
      │                   │ write to disk
      └───────────────────┘

Why Not Write Immediately?

  1. Batching: Accumulate multiple changes, write once
  2. Write ordering: WAL must be written first
  3. Reduced I/O: If same page modified multiple times, only write final state

Background Writer

A background process periodically writes dirty pages:

    Background Writer Loop:
    1. Sleep for configured interval
    2. Find dirty pages not recently accessed
    3. Write some to disk
    4. Mark as clean
    5. Repeat

    Benefits:
    - Spreads I/O over time (no spike at checkpoint)
    - Keeps some clean frames available for eviction
-- PostgreSQL background writer configuration
bgwriter_delay = 200ms          -- Sleep between rounds
bgwriter_lru_maxpages = 100     -- Max pages per round
bgwriter_lru_multiplier = 2.0   -- Aggressiveness

12.7 Buffer Pool Sizing

How much memory should the buffer pool get?

General Guidelines

    Dedicated database server:
    Buffer pool = 70-80% of total RAM

    Shared server:
    Buffer pool = 25-40% of total RAM

    Leave room for:
    - OS file cache
    - Connection memory
    - Sort/hash operations
    - Other processes

PostgreSQL

-- Shared buffers (main buffer pool)
shared_buffers = 8GB

-- Typical sizing:
-- <1GB RAM: 25% of RAM
-- >1GB RAM: 25% of RAM, up to ~40%

-- Check current usage
SELECT
    pg_size_pretty(count(*) * 8192) as buffer_usage
FROM pg_buffercache;

MySQL InnoDB

-- InnoDB buffer pool
innodb_buffer_pool_size = 12G

-- Typical sizing:
-- 70-80% of available RAM on dedicated server

-- Multiple instances for parallelism
innodb_buffer_pool_instances = 8

12.8 Buffer Pool Hit Ratio

The hit ratio measures cache effectiveness:

    Hit Ratio = Buffer Hits / (Buffer Hits + Disk Reads)

    Target: > 99% for OLTP workloads
    Acceptable: > 95% for mixed workloads
    Concerning: < 90%

PostgreSQL

SELECT
    sum(blks_hit) as hits,
    sum(blks_read) as reads,
    sum(blks_hit) * 100.0 / nullif(sum(blks_hit) + sum(blks_read), 0) as ratio
FROM pg_stat_database;

MySQL InnoDB

SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';

-- Calculate hit ratio
-- Innodb_buffer_pool_read_requests /
-- (Innodb_buffer_pool_read_requests + Innodb_buffer_pool_reads)

When Hit Ratio Is Low

  1. Buffer pool too small: Increase memory
  2. Working set too large: Data doesn’t fit
  3. Sequential scans: Full table scans evict useful pages
  4. Cold start: Cache is still warming up

12.9 Multiple Buffer Pools

Large databases use multiple buffer pools for scalability:

                    MULTIPLE BUFFER POOLS

    ┌─────────────────────────────────────────────────────────────┐
    │ Connection assigns to pool based on hash of page ID        │
    │                                                              │
    │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
    │   │ Buffer Pool │  │ Buffer Pool │  │ Buffer Pool │        │
    │   │     #1      │  │     #2      │  │     #3      │        │
    │   │             │  │             │  │             │        │
    │   │  Latch #1   │  │  Latch #2   │  │  Latch #3   │        │
    │   └─────────────┘  └─────────────┘  └─────────────┘        │
    │                                                              │
    │   Reduces contention on page table latch                    │
    └─────────────────────────────────────────────────────────────┘
-- MySQL: Multiple InnoDB buffer pool instances
innodb_buffer_pool_instances = 8

-- Pages hash to instances, reducing mutex contention

12.10 Page Prefetching

Prefetching loads pages before they’re requested:

Sequential Prefetch

Detect sequential access, load next pages:

    Read pages: 1, 2, 3, 4, ...

    System detects sequential pattern
    Prefetch pages: 5, 6, 7, 8, ... in background

    When query needs page 5: Already in memory!

Index Prefetch

When traversing an index, prefetch heap pages:

    Index scan returns row pointers:
    (page 5, slot 1), (page 12, slot 3), (page 5, slot 7), (page 20, slot 2)

    Sort by page: 5, 5, 12, 20
    Prefetch pages 5, 12, 20 in sorted order

    Reduces random I/O

PostgreSQL

-- effective_io_concurrency: prefetch depth
SET effective_io_concurrency = 200;  -- Good for SSDs

-- Controls bitmap heap scan prefetching

12.11 Buffer Pool and Large Operations

Bulk Loads

Loading large amounts of data shouldn’t evict the entire working set:

-- PostgreSQL: Use COPY, which is optimized
COPY users FROM '/data/users.csv' WITH CSV;

-- Or use unlogged tables for intermediate data
CREATE UNLOGGED TABLE staging (...);

Large Sequential Scans

PostgreSQL uses a small ring buffer for large sequential scans:

    Normal query: Uses main buffer pool
    Large sequential scan: Uses 256KB ring buffer

    Prevents one big scan from evicting everything

Sort Operations

When sorting exceeds work_mem, spill to disk:

-- PostgreSQL work_mem
SET work_mem = '256MB';  -- Per-operation sort memory

-- MySQL sort buffer
SET sort_buffer_size = 256 * 1024 * 1024;

12.12 Observing Buffer Pool Behavior

PostgreSQL pg_buffercache

-- Enable extension
CREATE EXTENSION pg_buffercache;

-- See what's in the buffer pool
SELECT
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database()))
  AND c.relname NOT LIKE 'pg_%'
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 20;

MySQL InnoDB Buffer Pool Stats

-- Buffer pool status
SHOW ENGINE INNODB STATUS\G

-- Look for "BUFFER POOL AND MEMORY" section:
-- Buffer pool size
-- Free buffers
-- Database pages
-- Modified db pages
-- Pages read, created, written

12.13 Double Buffering Problem

When using buffered I/O, data may be cached twice:

    Database Buffer Pool: Page 42
    OS Page Cache: Page 42 (same data!)

    Memory wasted!

Solutions

Direct I/O: Bypass OS cache entirely

-- MySQL InnoDB (default)
innodb_flush_method = O_DIRECT

Coordinated caching: Tell OS not to cache

posix_fadvise(fd, offset, length, POSIX_FADV_DONTNEED);

Accept double buffering: For small databases, not a big deal


12.14 Summary

The buffer pool is crucial for database performance:

  • Purpose: Cache pages in memory to avoid disk I/O
  • Page table: Maps page IDs to buffer frames
  • Replacement policies: LRU, Clock, 2Q balance hit rate and overhead
  • Dirty pages: Modified pages written by background writer
  • Sizing: Typically 25-80% of RAM depending on workload
  • Hit ratio: Target >99% for OLTP
  • Prefetching: Load pages before they’re needed

Understanding the buffer pool helps you:

  • Size memory appropriately
  • Diagnose cache-related performance issues
  • Understand why queries are slow after restart
  • Tune for your specific workload

What’s Next

In Chapter 13, we’ll explore recovery and crash safety—how databases ensure durability and recover from failures.


“A well-tuned buffer pool makes your database feel like everything is in memory. A poorly-tuned one makes everything feel like tape.”

Chapter 13: Recovery and Crash Safety

“Crashes are not a matter of if, but when. The question is: what happens to your data?”

Servers crash. Power fails. Disks corrupt. Kubernetes restarts your pod. A database must survive all of these and emerge with data intact. This chapter explores how databases achieve crash safety and recover from failures.


13.1 The Recovery Challenge

Consider what’s in flight when a database crashes:

    State at crash:
    ┌─────────────────────────────────────────────────────────────┐
    │ Buffer Pool:                                                 │
    │   Page 42: Modified by committed Txn 100 (not on disk!)     │
    │   Page 67: Modified by uncommitted Txn 101                  │
    │                                                              │
    │ WAL:                                                         │
    │   Txn 100: COMMIT record written and fsynced                │
    │   Txn 101: INSERT record written, no COMMIT                 │
    │   Txn 102: Several records, no COMMIT                       │
    │                                                              │
    │ Disk:                                                        │
    │   Page 42: Contains old data (before Txn 100)               │
    │   Page 67: Contains old data (before Txn 101)               │
    └─────────────────────────────────────────────────────────────┘

    After restart, we need:
    - Txn 100 changes: RESTORED (it committed!)
    - Txn 101 changes: UNDONE (never committed)
    - Txn 102 changes: UNDONE (never committed)

13.2 The ARIES Recovery Algorithm

Most databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics):

                    ARIES RECOVERY PHASES

    ┌─────────────────────────────────────────────────────────────┐
    │                     1. ANALYSIS                              │
    │   Scan WAL from last checkpoint                              │
    │   Determine:                                                 │
    │   - Which transactions were active at crash                 │
    │   - Which pages might be dirty (need redo)                  │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │                       2. REDO                                │
    │   Replay WAL forward from oldest dirty page                  │
    │   Reapply ALL changes (committed or not)                    │
    │   Restore database to exact pre-crash state                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │                       3. UNDO                                │
    │   Roll back uncommitted transactions                        │
    │   Process in reverse order                                  │
    │   Log compensation records for idempotence                  │
    └─────────────────────────────────────────────────────────────┘

Phase 1: Analysis

Scan the WAL starting from the last checkpoint:

    Checkpoint Record (LSN 1000):
    - Active transactions: {Txn 100, Txn 101}
    - Dirty pages: {Page 42, Page 67}

    Continue scanning WAL:
    - LSN 1001: Txn 100 INSERT page 42
    - LSN 1002: Txn 100 COMMIT ← Txn 100 is done!
    - LSN 1003: Txn 101 UPDATE page 67
    - LSN 1004: Txn 102 BEGIN
    - LSN 1005: Txn 102 INSERT page 89
    [CRASH]

    Analysis results:
    - Winner (committed): Txn 100
    - Losers (uncommitted): Txn 101, Txn 102
    - Dirty pages to check: 42, 67, 89

Phase 2: Redo

Replay all logged changes to restore pre-crash state:

    For each WAL record from oldest dirty page LSN:
        If page_lsn < record_lsn:  (change not yet applied)
            Reapply the change
            Set page_lsn = record_lsn

    This handles:
    - Committed changes not yet on disk (REDO them)
    - Uncommitted changes not yet on disk (REDO them too!)

    Wait, why redo uncommitted changes?
    → Restores exact pre-crash state
    → Then undo phase can cleanly roll back

Phase 3: Undo

Roll back uncommitted transactions:

    Loser transactions: {Txn 101, Txn 102}

    Process in reverse LSN order:
    - LSN 1005: Txn 102 INSERT → UNDO: Delete the row
    - LSN 1003: Txn 101 UPDATE → UNDO: Restore old value

    For each undo, write Compensation Log Record (CLR):
    - CLR for LSN 1005: "Deleted row inserted by 1005"
    - CLR for LSN 1003: "Restored value changed by 1003"

    CLRs ensure idempotence:
    If crash during undo, redo phase replays CLRs,
    then undo skips already-undone records

13.3 Checkpoints

Checkpoints limit how much WAL must be replayed:

                    CHECKPOINT OPERATION

    1. Write CHECKPOINT BEGIN to WAL
    2. Note active transactions and dirty pages
    3. Flush dirty pages to disk (spread over time)
    4. Write CHECKPOINT END to WAL with:
       - Active transaction list
       - Dirty page list
       - Oldest LSN needed for recovery

    Recovery starts from checkpoint, not beginning of time

Fuzzy Checkpoints

Don’t stop the world to checkpoint:

    Fuzzy Checkpoint:
    - Mark checkpoint start
    - Continue normal operations
    - Background writer flushes dirty pages
    - Eventually all pages from before start are flushed
    - Mark checkpoint complete

    Database never stops serving requests

Checkpoint Frequency Trade-offs

    Frequent checkpoints:
    + Faster recovery (less WAL to replay)
    - More I/O during normal operation
    - Potential performance spikes

    Infrequent checkpoints:
    + Less I/O during normal operation
    + Smoother performance
    - Longer recovery time
    - More WAL space needed
-- PostgreSQL checkpoint tuning
checkpoint_timeout = 5min       -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread I/O over 90% of interval

13.4 WAL Record Types

Different record types serve different purposes:

    Data Records:
    - INSERT: New row data
    - UPDATE: Before and after images
    - DELETE: Deleted row data

    Transaction Records:
    - BEGIN: Transaction started
    - COMMIT: Transaction committed
    - ABORT: Transaction rolled back
    - PREPARE: For two-phase commit

    Control Records:
    - CHECKPOINT: Checkpoint marker
    - CLR: Compensation log record (undo result)

    DDL Records:
    - CREATE/DROP TABLE
    - CREATE/DROP INDEX
    - ALTER TABLE

13.5 Logical vs Physical Logging

Physical Logging

Record exact bytes changed:

    Page 42, offset 1024:
    Old bytes: 00 00 03 E8
    New bytes: 00 00 03 84

Pros: Simple, exact replay Cons: Large logs, doesn’t survive page reorganization

Logical Logging

Record the operation:

    UPDATE accounts SET balance = 900 WHERE id = 1

Pros: Compact, works across page changes Cons: Must be deterministic, harder to implement

Physiological Logging

Most databases use a hybrid:

    Page 42: UPDATE row in slot 3, set column 2 to 900

    - Physical: References specific page
    - Logical: Describes row-level operation

13.6 Torn Page Protection

What if a crash occurs mid-page-write?

    8KB page write:
    First 4KB written → [CRASH] → Last 4KB not written

    Page is now "torn" - partially old, partially new
    Checksum will fail, but how to recover?

Solution 1: Full Page Writes

Write complete page to WAL on first modification after checkpoint:

    After checkpoint, first write to page 42:
    1. Write full 8KB page image to WAL
    2. Write actual change to WAL
    3. Modify page in buffer pool

    On recovery:
    - If page is torn, restore from WAL full page image
    - Then apply subsequent changes

Solution 2: Double-Write Buffer (InnoDB)

    1. Write dirty pages to double-write buffer (sequential)
    2. fsync double-write buffer
    3. Write pages to actual locations (random)

    On recovery:
    - If page is torn, restore from double-write buffer
    - Then apply redo log

13.7 Crash Recovery Scenarios

Scenario 1: Clean Shutdown

    CHECKPOINT completes
    All dirty pages flushed
    WAL ends with CHECKPOINT record

    Recovery: Check WAL, confirm clean shutdown, done!
    (Almost instant)

Scenario 2: Crash After Commit, Before Flush

    Txn 100: INSERT, UPDATE, COMMIT (in WAL)
    Dirty pages still in buffer pool
    [CRASH]

    Recovery:
    1. Analysis: Txn 100 committed
    2. Redo: Replay Txn 100 changes to disk
    3. Undo: Nothing to undo
    → Data preserved!

Scenario 3: Crash During Transaction

    Txn 101: INSERT, UPDATE (in WAL)
    No COMMIT record
    [CRASH]

    Recovery:
    1. Analysis: Txn 101 uncommitted (loser)
    2. Redo: Replay Txn 101 changes (restore state)
    3. Undo: Roll back Txn 101 changes
    → Transaction never happened!

Scenario 4: Crash During Recovery

    Recovering from crash #1
    In undo phase, rolling back Txn 101
    [CRASH #2]

    Recovery from crash #2:
    1. Analysis: See CLRs from previous recovery
    2. Redo: Replay CLRs
    3. Undo: Continue where we left off (CLRs mark progress)
    → Eventually completes!

13.8 Point-in-Time Recovery (PITR)

Restore to any point in time:

                    PITR ARCHITECTURE

    Base Backup (Sunday)
    ┌─────────────────────────────────────────────────────────────┐
    │ pg_basebackup or similar                                     │
    │ Contains: All data files as of Sunday 00:00                 │
    └─────────────────────────────────────────────────────────────┘
                               +
    WAL Archive (Sunday → Friday)
    ┌─────────────────────────────────────────────────────────────┐
    │ Every WAL segment archived                                   │
    │ Contains: All changes since Sunday                          │
    └─────────────────────────────────────────────────────────────┘
                               =
    Restore to Wednesday 14:30
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. Restore base backup                                       │
    │ 2. Replay WAL up to Wednesday 14:30                         │
    │ 3. Stop recovery, open database                             │
    └─────────────────────────────────────────────────────────────┘

Use Cases

  • Recover from accidental DELETE
  • Restore before corrupting UPDATE
  • Create database clone at specific time
  • Test “what if” scenarios

PostgreSQL PITR

# Take base backup
pg_basebackup -D /backup/base -Fp -Xs -P

# Configure WAL archiving
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

# Restore to point in time
# postgresql.conf on restore server:
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 14:30:00'

13.9 Backup Strategies

Full Backup

Copy everything:

    Pros: Simple restore (just copy back)
    Cons: Large, slow to create
    Frequency: Weekly or less

Incremental Backup

Copy only changed pages since last backup:

    Pros: Fast, small
    Cons: Restore requires all increments
    Frequency: Daily

    Restore: Full + Inc1 + Inc2 + Inc3 + ...

Continuous Archiving (WAL)

Archive every WAL segment:

    Pros: Point-in-time recovery
    Cons: Requires base backup + all WAL
    Frequency: Continuous

Backup Strategy Example

    Sunday 00:00: Full backup
    Mon-Sat 00:00: Incremental backup
    Continuous: WAL archiving

    Restore options:
    - Sunday data: Full backup
    - Wednesday data: Full + Mon + Tue + Wed incrementals
    - Wednesday 14:30: Full + incrementals + WAL to 14:30

13.10 Testing Recovery

If you haven’t tested recovery, you don’t have backups.

What to Test

    1. Full restore from backup
    2. Point-in-time recovery to specific timestamp
    3. Recovery from simulated crash
    4. Recovery time (is it acceptable?)
    5. Incremental restore chain
    6. Restore to different hardware

Automated Testing

#!/bin/bash
# Monthly recovery test

# 1. Restore latest backup to test server
pg_restore -d test_restore /backup/latest.dump

# 2. Verify data integrity
psql test_restore -c "SELECT count(*) FROM users"

# 3. Run application smoke tests
./run_smoke_tests.sh test_restore

# 4. Record recovery time
echo "Recovery completed in $SECONDS seconds"

13.11 High Availability and Failover

Beyond crash recovery: survive hardware failures.

Streaming Replication

                    PRIMARY → STANDBY

    Primary                         Standby
    ┌───────────────────┐          ┌───────────────────┐
    │ Active database   │          │ Hot standby       │
    │                   │   WAL    │ (read-only)       │
    │ ┌─────────────┐   │ ──────► │ ┌─────────────┐   │
    │ │    WAL      │   │ stream  │ │    WAL      │   │
    │ └─────────────┘   │          │ └─────────────┘   │
    └───────────────────┘          └───────────────────┘

    If primary fails:
    1. Promote standby to primary
    2. Redirect clients to new primary

Synchronous vs Asynchronous

    Asynchronous:
    - Primary doesn't wait for standby
    - Possible data loss on failover (standby behind)
    - Lower latency

    Synchronous:
    - Primary waits for standby acknowledgment
    - Zero data loss on failover
    - Higher latency

Failover Process

    1. Detect primary failure
       - Heartbeat timeout
       - Failed health check

    2. Verify primary is really down
       - Avoid split-brain!
       - Use fencing/STONITH if needed

    3. Promote standby
       - pg_ctl promote
       - or trigger file

    4. Redirect clients
       - Update DNS/load balancer
       - Connection pooler reconfiguration

13.12 Corruption Detection and Handling

Checksum Validation

-- PostgreSQL: Enable checksums (at initdb or pg_checksums)
-- Detects: Disk corruption, memory errors, torn pages

-- Check for corruption
SELECT * FROM pg_stat_database WHERE checksum_failures > 0;

When Corruption Is Detected

    Options:
    1. Restore from backup (safest)
    2. Use pg_resetwal to skip corrupted records (data loss!)
    3. Attempt to read good data, skip bad pages

    Prevention:
    - Enable checksums
    - Use ECC RAM
    - Use reliable storage
    - Regular backups
    - Monitor for errors

13.13 Recovery Time Objectives

RTO (Recovery Time Objective)

How long can you be down?

    RTO = 1 hour:
    - Crash recovery acceptable
    - Maybe PITR restore acceptable
    - Single server might be OK

    RTO = 1 minute:
    - Need hot standby
    - Automatic failover
    - Pre-warmed cache

    RTO = 0:
    - Multi-master or distributed database
    - No single point of failure

RPO (Recovery Point Objective)

How much data can you lose?

    RPO = 1 hour:
    - Hourly backups sufficient
    - Some data loss acceptable

    RPO = 1 minute:
    - Continuous WAL archiving
    - Frequent checkpoints

    RPO = 0:
    - Synchronous replication
    - Every commit acknowledged by standby

13.14 Summary

Database recovery ensures durability and consistency:

  • ARIES provides industry-standard recovery
  • Three phases: Analysis, Redo, Undo
  • Checkpoints bound recovery time
  • Full page writes handle torn pages
  • PITR enables point-in-time recovery
  • Replication provides high availability
  • Test your backups or they’re worthless

Recovery is the foundation of database reliability—everything else depends on it working correctly.


What’s Next

In Chapter 14, we’ll compare column stores and row stores—different storage architectures optimized for different workloads.


“The best time to test your recovery process was before you needed it. The second best time is now.”

Chapter 14: Column Stores vs Row Stores

“How you store data determines what you can do with it efficiently.”

Traditional databases store data row by row. But for analytical queries that scan millions of rows but only need a few columns, this is wasteful. Column stores flip the model, storing data column by column. This chapter explores both approaches and when to use each.


14.1 The Row Store Model (NSM)

Row stores use N-ary Storage Model (NSM)—all columns of a row stored together:

                    ROW STORE LAYOUT

    Table: users (id, name, email, age, city)

    Disk/Page Layout:
    ┌─────────────────────────────────────────────────────────────┐
    │ Row 1: [id=1, name='Alice', email='a@b.com', age=30, city='NYC'] │
    │ Row 2: [id=2, name='Bob', email='b@b.com', age=25, city='LA']    │
    │ Row 3: [id=3, name='Carol', email='c@b.com', age=35, city='NYC'] │
    │ ...                                                              │
    └─────────────────────────────────────────────────────────────┘

    Query: SELECT * FROM users WHERE id = 1
    → Read one contiguous chunk → All columns available → Fast!

Row Store Strengths

    OLTP Workloads:
    ┌─────────────────────────────────────────────────────────────┐
    │ SELECT * FROM users WHERE id = 42                            │
    │ → Fetch one row, all columns, one I/O                       │
    │                                                              │
    │ INSERT INTO users VALUES (...)                               │
    │ → Write one contiguous chunk                                 │
    │                                                              │
    │ UPDATE users SET email = '...' WHERE id = 42                │
    │ → Read row, modify, write back                               │
    └─────────────────────────────────────────────────────────────┘

    Excellent for: Point queries, full-row access, transactions

Row Store Weaknesses

    Analytical Query:
    SELECT AVG(age) FROM users;

    ┌─────────────────────────────────────────────────────────────┐
    │ Row 1: [id=1, name='Alice', email='a@b.com', AGE=30, city='NYC'] │
    │ Row 2: [id=2, name='Bob', email='b@b.com', AGE=25, city='LA']    │
    │ Row 3: [id=3, name='Carol', email='c@b.com', AGE=35, city='NYC'] │
    └─────────────────────────────────────────────────────────────┘
                                          │
                              Only need AGE column!
                              But must read entire rows

    If each row is 200 bytes and age is 4 bytes:
    Reading 1M rows = 200MB of I/O
    Actually need only 4MB (2% of data!)

14.2 The Column Store Model (DSM)

Column stores use Decomposition Storage Model (DSM)—each column stored separately:

                    COLUMN STORE LAYOUT

    Table: users (id, name, email, age, city)

    id column:    [1, 2, 3, 4, 5, 6, 7, ...]
    name column:  ['Alice', 'Bob', 'Carol', 'Dave', ...]
    email column: ['a@b.com', 'b@b.com', 'c@b.com', ...]
    age column:   [30, 25, 35, 28, 42, 31, 27, ...]
    city column:  ['NYC', 'LA', 'NYC', 'Chicago', ...]

    Query: SELECT AVG(age) FROM users
    → Read only age column
    → 4MB instead of 200MB
    → 50x less I/O!

Column Store Strengths

    Analytical Queries:
    ┌─────────────────────────────────────────────────────────────┐
    │ SELECT AVG(age), city FROM users GROUP BY city               │
    │ → Read only age and city columns                             │
    │ → Ignore id, name, email                                     │
    │                                                              │
    │ SELECT COUNT(*) FROM users WHERE age > 30                   │
    │ → Read only age column                                       │
    │ → Extremely fast filtering                                   │
    └─────────────────────────────────────────────────────────────┘

    Benefits:
    - Read only columns you need
    - Better compression (similar values together)
    - Vectorized processing (operate on column chunks)
    - Better CPU cache utilization

Column Store Weaknesses

    Point Query:
    SELECT * FROM users WHERE id = 42

    ┌─────────────────────────────────────────────────────────────┐
    │ Must read from 5 different locations (one per column)       │
    │ Then reconstruct the row                                     │
    │ Much slower than row store for this pattern!                │
    └─────────────────────────────────────────────────────────────┘

    INSERT INTO users VALUES (...)
    ┌─────────────────────────────────────────────────────────────┐
    │ Must write to 5 different locations                         │
    │ Columnar format may require append buffers                  │
    │ Slower for single-row operations                            │
    └─────────────────────────────────────────────────────────────┘

14.3 Compression in Column Stores

Same-type values stored together compress extremely well:

Run-Length Encoding (RLE)

    city column: ['NYC', 'NYC', 'NYC', 'LA', 'LA', 'NYC', ...]

    RLE: [(NYC, 3), (LA, 2), (NYC, 1), ...]

    If sorted by city:
    ['LA', 'LA', ..., 'NYC', 'NYC', ...]
    RLE: [(LA, 50000), (NYC, 150000), ...]
    → Massive compression!

Dictionary Encoding

    city column: ['NYC', 'LA', 'NYC', 'Chicago', 'NYC', ...]

    Dictionary: {0: 'NYC', 1: 'LA', 2: 'Chicago'}
    Encoded:    [0, 1, 0, 2, 0, ...]

    Benefits:
    - Fixed-width codes (faster processing)
    - Much smaller storage
    - Comparisons on codes, not strings

Bit-Packing

    age column: [30, 25, 35, 28, 42, 31, 27, ...]

    All values < 128, so need only 7 bits each!
    Instead of 32 bits per integer → 4.5x compression

Compression Ratios

    Row Store: 1:1 to 1:3 typical
    Column Store: 1:5 to 1:20 typical

    Example:
    Raw data: 100GB
    Row store (compressed): 50GB
    Column store: 5-10GB

14.4 Vectorized Execution

Column stores enable vectorized processing—operating on batches of values:

                    SCALAR VS VECTORIZED

    Scalar (row-at-a-time):
    for each row:
        if row.age > 30:
            output row

    → Function call overhead per row
    → Poor CPU cache utilization
    → Branch mispredictions

    Vectorized (column-at-a-time):
    age_vector = load 1000 ages
    mask = age_vector > 30       # SIMD comparison
    output = apply mask          # Batch filtering

    → Process 1000 values with one CPU instruction
    → Excellent cache utilization
    → No branches, predictable execution

SIMD Operations

Modern CPUs can process multiple values simultaneously:

    AVX-512: Process 16 32-bit integers at once

    Traditional:
    for i in 0..16:
        result[i] = a[i] + b[i]
    → 16 add instructions

    SIMD:
    result = _mm512_add_epi32(a, b)
    → 1 instruction for 16 adds!

14.5 Late Materialization

Column stores delay row reconstruction:

    Query: SELECT name, email FROM users WHERE age > 30 AND city = 'NYC'

    Early Materialization (reconstruct rows first):
    1. Scan age column, get matching row IDs: {1, 3, 5, 7, ...}
    2. For each ID, reconstruct full row
    3. Filter by city
    4. Extract name, email

    Late Materialization (defer reconstruction):
    1. Scan age column, get positions: {1, 3, 5, 7, ...}
    2. Scan city column at those positions only
    3. Filter: positions {1, 3, ...} remain
    4. Only now fetch name, email for final positions

    Benefits:
    - Skip columns that get filtered out
    - Less data movement
    - Better cache utilization

14.6 Column Store Architectures

Pure Column Stores

Store nothing but columns:

    Examples: ClickHouse, DuckDB, Vertica

    Each column in separate file
    Row reconstruction via position matching

Hybrid Row/Column

Store data in columnar format within row-oriented pages:

    PAX (Partition Attributes Across):
    ┌─────────────────────────────────────────────────────────────┐
    │ Page contains N rows, stored column-by-column within page   │
    │                                                              │
    │ [all ids] [all names] [all emails] [all ages] [all cities] │
    │     │          │          │           │           │         │
    │   Row 1-N    Row 1-N    Row 1-N     Row 1-N     Row 1-N    │
    └─────────────────────────────────────────────────────────────┘

    Benefits:
    - Column-oriented within page (good compression, vectorization)
    - Row locality across pages (good for reconstruction)

PostgreSQL Column Extensions

-- cstore_fdw (now Citus columnar)
CREATE TABLE events (
    event_id bigint,
    event_type text,
    data jsonb,
    created_at timestamp
) USING columnar;

-- Benefits: Compression, column pruning
-- Trade-off: Slower updates, best for append-only

14.7 OLTP vs OLAP

The fundamental difference in workload patterns:

                    OLTP vs OLAP

    OLTP (Online Transaction Processing):
    ┌─────────────────────────────────────────────────────────────┐
    │ • Many small transactions                                    │
    │ • Point queries: SELECT * WHERE id = ?                      │
    │ • Single-row INSERT, UPDATE, DELETE                         │
    │ • Low latency critical                                       │
    │ • Concurrent users                                           │
    │                                                              │
    │ Best: Row stores (PostgreSQL, MySQL, Oracle)                │
    └─────────────────────────────────────────────────────────────┘

    OLAP (Online Analytical Processing):
    ┌─────────────────────────────────────────────────────────────┐
    │ • Few large queries                                          │
    │ • Full scans: SELECT AVG(x) FROM big_table                  │
    │ • Read-mostly, batch loads                                   │
    │ • Throughput critical                                        │
    │ • Few concurrent queries                                     │
    │                                                              │
    │ Best: Column stores (ClickHouse, DuckDB, Snowflake)         │
    └─────────────────────────────────────────────────────────────┘

14.8 Column Store Examples

ClickHouse

Open-source column store for real-time analytics:

CREATE TABLE events (
    event_date Date,
    event_type String,
    user_id UInt64,
    data String
) ENGINE = MergeTree()
ORDER BY (event_date, event_type);

-- Query billions of rows in seconds
SELECT
    event_type,
    count() as cnt
FROM events
WHERE event_date >= '2024-01-01'
GROUP BY event_type
ORDER BY cnt DESC;

DuckDB

Embedded analytical database:

import duckdb

# Analyze Parquet files directly
result = duckdb.sql("""
    SELECT
        product_category,
        SUM(amount) as total
    FROM 'sales/*.parquet'
    GROUP BY product_category
""").fetchall()

# Or query pandas DataFrames
import pandas as pd
df = pd.read_csv('large_file.csv')
duckdb.sql("SELECT AVG(value) FROM df WHERE category = 'A'")

Parquet File Format

Columnar storage format for data lakes:

    Parquet File Structure:
    ┌─────────────────────────────────────────────────────────────┐
    │ Row Group 1                                                  │
    │   Column Chunk: id (compressed)                             │
    │   Column Chunk: name (compressed)                           │
    │   Column Chunk: email (compressed)                          │
    │                                                              │
    │ Row Group 2                                                  │
    │   Column Chunk: id (compressed)                             │
    │   Column Chunk: name (compressed)                           │
    │   Column Chunk: email (compressed)                          │
    │                                                              │
    │ Footer: Schema, column statistics, offsets                  │
    └─────────────────────────────────────────────────────────────┘

    Benefits:
    - Column pruning (skip columns)
    - Row group pruning (skip row groups via statistics)
    - Excellent compression
    - Language-independent format

14.9 Hybrid Architectures

HTAP (Hybrid Transactional/Analytical Processing)

Single system for both workloads:

                    HTAP ARCHITECTURE

    ┌─────────────────────────────────────────────────────────────┐
    │                     HTAP Database                            │
    │                                                              │
    │  ┌──────────────────────┐  ┌──────────────────────┐        │
    │  │   Row Store          │  │   Column Store        │        │
    │  │   (OLTP Engine)      │  │   (OLAP Engine)       │        │
    │  │                      │  │                       │        │
    │  │  Transactions        │  │  Analytics            │        │
    │  │  Point queries       │  │  Aggregations         │        │
    │  │  Updates             │  │  Scans                │        │
    │  └──────────┬───────────┘  └───────────┬──────────┘        │
    │             │                          │                     │
    │             └──────────┬───────────────┘                     │
    │                        │                                     │
    │              Sync/Replication                                │
    │                                                              │
    │  Examples: TiDB, SingleStore, Oracle Dual Format            │
    └─────────────────────────────────────────────────────────────┘

PostgreSQL + Analytics

-- Use foreign data wrapper to columnar store
CREATE EXTENSION clickhouse_fdw;

CREATE SERVER clickhouse_server
    FOREIGN DATA WRAPPER clickhouse_fdw
    OPTIONS (host 'analytics-server');

CREATE FOREIGN TABLE events_analytical (...)
    SERVER clickhouse_server;

-- OLTP in PostgreSQL, OLAP in ClickHouse
-- Sync via logical replication

14.10 When to Choose What

Choose Row Store When:

    ✓ OLTP workload
    ✓ Point queries by primary key
    ✓ Full-row access common
    ✓ Many small transactions
    ✓ Need strong ACID guarantees
    ✓ Complex transactions with updates

    Examples: User data, orders, inventory

Choose Column Store When:

    ✓ OLAP workload
    ✓ Aggregate queries (SUM, AVG, COUNT)
    ✓ Accessing few columns from wide tables
    ✓ Large sequential scans
    ✓ Append-mostly data
    ✓ Historical/time-series data

    Examples: Event logs, metrics, analytics

Consider Hybrid When:

    ✓ Need both OLTP and OLAP
    ✓ Real-time analytics on transactional data
    ✓ Can tolerate some complexity
    ✓ Single source of truth required

14.11 Performance Comparison

    Benchmark: 1 billion rows, 100 columns

    Query: SELECT AVG(col1) FROM table

    Row Store:    ~300 seconds
    Column Store: ~3 seconds
    Improvement:  100x

    Query: SELECT * FROM table WHERE id = 12345

    Row Store:    ~1 millisecond (with index)
    Column Store: ~100 milliseconds
    Degradation:  100x

    The right tool for the right job!

14.12 Summary

Column stores and row stores serve different needs:

AspectRow StoreColumn Store
StorageRow by rowColumn by column
Best forOLTPOLAP
Point queriesFastSlow
Full scansSlowFast
CompressionModerateExcellent
Write speedFastSlower
Update speedFastSlow

Key takeaways:

  • Row stores excel at transactional workloads
  • Column stores excel at analytical workloads
  • Compression is much better in column stores
  • Vectorized execution accelerates column processing
  • Hybrid architectures try to provide both

What’s Next

In Chapter 15, we’ll explore distributed databases and replication—how databases scale beyond a single machine.


“A column store and a row store looking at the same data see entirely different things. Choose the perspective that matches your questions.”

Chapter 15: Distributed Databases and Replication

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport

Single-server databases have limits: storage capacity, query throughput, availability. Distributed databases spread data across multiple machines to overcome these limits—but at the cost of significant complexity.


15.1 Why Distribute?

Scaling Vertically vs Horizontally

    Vertical Scaling (Scale Up):
    ┌───────────────────┐      ┌───────────────────┐
    │ Small Server      │  →   │ Big Server        │
    │ 4 cores, 16GB RAM │      │ 64 cores, 1TB RAM │
    └───────────────────┘      └───────────────────┘

    Limits: Hardware maxes out, cost increases exponentially

    Horizontal Scaling (Scale Out):
    ┌───────────────────┐      ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
    │ Small Server      │  →   │Node1│ │Node2│ │Node3│ │Node4│
    │                   │      └─────┘ └─────┘ └─────┘ └─────┘
    └───────────────────┘

    Benefits: Linear cost scaling, no hardware ceiling
    Challenge: Distributed systems are HARD

Reasons to Distribute

  1. Capacity: Data doesn’t fit on one machine
  2. Throughput: More machines = more queries/second
  3. Availability: Survive machine failures
  4. Latency: Place data closer to users geographically

15.2 Replication

Replication copies data to multiple nodes:

                    REPLICATION

    ┌─────────────────────────────────────────────────────────────┐
    │                        Primary                               │
    │                     ┌─────────┐                             │
    │                     │  Data   │                             │
    │                     └────┬────┘                             │
    │                          │                                   │
    │            ┌─────────────┼─────────────┐                    │
    │            │             │             │                    │
    │            ▼             ▼             ▼                    │
    │      ┌─────────┐   ┌─────────┐   ┌─────────┐              │
    │      │ Replica │   │ Replica │   │ Replica │              │
    │      │   #1    │   │   #2    │   │   #3    │              │
    │      └─────────┘   └─────────┘   └─────────┘              │
    │                                                              │
    │  Benefits:                                                   │
    │  - Fault tolerance (survive node failure)                   │
    │  - Read scalability (spread reads across replicas)          │
    │  - Geographic distribution (low latency)                    │
    └─────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

    Synchronous:
    ┌──────────────────────────────────────────────────────────┐
    │ 1. Client writes to primary                               │
    │ 2. Primary sends to replica                               │
    │ 3. Replica acknowledges                                   │
    │ 4. Primary acknowledges to client                         │
    │                                                           │
    │ Pros: No data loss on failure                            │
    │ Cons: Higher latency (wait for replica)                  │
    └──────────────────────────────────────────────────────────┘

    Asynchronous:
    ┌──────────────────────────────────────────────────────────┐
    │ 1. Client writes to primary                               │
    │ 2. Primary acknowledges to client immediately            │
    │ 3. Primary sends to replica (background)                 │
    │                                                           │
    │ Pros: Lower latency                                       │
    │ Cons: Potential data loss if primary fails               │
    └──────────────────────────────────────────────────────────┘

Leader-Based Replication

                    LEADER-FOLLOWER

    ┌─────────────┐
    │   Leader    │ ← All writes go here
    │  (Primary)  │
    └──────┬──────┘
           │ Replication stream (WAL)
    ┌──────┴───────────────────┐
    │              │           │
    ▼              ▼           ▼
┌─────────┐  ┌─────────┐  ┌─────────┐
│Follower │  │Follower │  │Follower │  ← Serve reads
│   #1    │  │   #2    │  │   #3    │
└─────────┘  └─────────┘  └─────────┘

PostgreSQL, MySQL, and most traditional databases use this model.

Leaderless Replication

                    LEADERLESS

    ┌─────────────────────────────────────────────────────────────┐
    │                    Client writes to N nodes                  │
    │                                                              │
    │      ┌─────────┐   ┌─────────┐   ┌─────────┐               │
    │      │  Node   │   │  Node   │   │  Node   │               │
    │      │   A     │   │   B     │   │   C     │               │
    │      └─────────┘   └─────────┘   └─────────┘               │
    │          ▲             ▲             ▲                      │
    │          │             │             │                      │
    │          └─────────────┴─────────────┘                      │
    │                        │                                     │
    │                   Write to all                               │
    │              Wait for W acknowledgments                      │
    │                                                              │
    │  Quorum: W + R > N ensures read sees latest write           │
    │  Example: N=3, W=2, R=2                                     │
    └─────────────────────────────────────────────────────────────┘

Cassandra and DynamoDB use leaderless replication.


15.3 Partitioning (Sharding)

Partitioning splits data across multiple nodes:

                    HORIZONTAL PARTITIONING

    Users Table (10 million rows):

    ┌─────────────────────────────────────────────────────────────┐
    │                      Partitioning Strategy                   │
    │                    Hash(user_id) mod 4                       │
    └─────────────────────────────────────────────────────────────┘
                                  │
         ┌────────────────────────┼────────────────────────┐
         │                        │                        │
         ▼                        ▼                        ▼
    ┌─────────┐             ┌─────────┐             ┌─────────┐
    │ Shard 0 │             │ Shard 1 │             │ Shard 2 │ ...
    │ 2.5M    │             │ 2.5M    │             │ 2.5M    │
    │ rows    │             │ rows    │             │ rows    │
    │         │             │         │             │         │
    │user_id  │             │user_id  │             │user_id  │
    │mod 4 = 0│             │mod 4 = 1│             │mod 4 = 2│
    └─────────┘             └─────────┘             └─────────┘

Partitioning Strategies

Hash Partitioning:

    partition = hash(key) mod num_partitions

    Pros: Even distribution
    Cons: Can't do range queries efficiently

Range Partitioning:

    user_id 1-1M → Partition 1
    user_id 1M-2M → Partition 2
    ...

    Pros: Range queries stay on one partition
    Cons: Hot spots if data is skewed

Directory-Based Partitioning:

    Lookup table maps keys to partitions
    Maximum flexibility
    But: Lookup table is a bottleneck

Cross-Partition Queries

    -- Single partition (fast):
    SELECT * FROM users WHERE user_id = 12345;
    -- Routes to one shard

    -- Cross-partition (slow):
    SELECT * FROM users WHERE name = 'Alice';
    -- Must query ALL shards, combine results

    -- Aggregation (very slow):
    SELECT COUNT(*) FROM users;
    -- Query all shards, sum results

15.4 The CAP Theorem

You can’t have all three:

                    CAP THEOREM

              Consistency
                  /\
                 /  \
                /    \
               /      \
              /   CA   \
             /  (RDBMS) \
            /____________\
           /\            /\
          /  \          /  \
         / CP \        / AP \
        /(HBase)\     /(Cassandra)
       /________\    /________\
    Partition        Availability
    Tolerance

Consistency: All nodes see the same data Availability: Every request gets a response Partition Tolerance: System works despite network failures

In a distributed system, network partitions WILL happen. So you must choose:

  • CP: Consistent but may be unavailable during partition
  • AP: Available but may return stale data during partition

CAP in Practice

    PostgreSQL (single node): CA
    - Consistent and available
    - Not partition tolerant (single node!)

    PostgreSQL with sync replication: CP
    - Consistent (waits for replica)
    - Unavailable if replica unreachable

    Cassandra: AP
    - Available (writes succeed)
    - Eventually consistent (reads may be stale)

    CockroachDB, Spanner: CP
    - Consistent (distributed transactions)
    - Unavailable if can't reach quorum

15.5 Consistency Models

Strong Consistency

Every read sees the most recent write:

    Time:     T1          T2          T3
    Write:   X=1    →
    Read:             →   X=1   →   X=1

    All reads after write see the new value
    Requires coordination (slow)

Eventual Consistency

Reads may see old data, but eventually converge:

    Time:     T1          T2          T3          T4
    Write:   X=1
    Read A:             X=0 (stale!)
    Read B:                         X=1
    Read C:                                     X=1

    Eventually all reads return 1
    No coordination required (fast)

Read-Your-Writes Consistency

A user always sees their own writes:

    User A writes X=1
    User A reads X → sees 1 (guaranteed)
    User B reads X → might see 0 (stale OK)

    Common in session-scoped scenarios

Causal Consistency

If A causes B, everyone sees them in that order:

    User posts message, then replies to it
    Viewers always see post before reply (never reply without post)

15.6 Distributed Transactions

Transactions spanning multiple nodes are hard:

Two-Phase Commit (2PC)

                    TWO-PHASE COMMIT

    ┌─────────────┐
    │ Coordinator │
    └──────┬──────┘
           │
    Phase 1: PREPARE
           │
    ┌──────┴──────────────────────────────┐
    │              │                      │
    ▼              ▼                      ▼
┌───────┐      ┌───────┐             ┌───────┐
│Node A │      │Node B │             │Node C │
│       │      │       │             │       │
│Prepare│      │Prepare│             │Prepare│
│  OK   │      │  OK   │             │  OK   │
└───┬───┘      └───┬───┘             └───┬───┘
    │              │                      │
    └──────────────┴──────────────────────┘
           │
           │ All said OK
           │
    Phase 2: COMMIT
           │
    ┌──────┴──────────────────────────────┐
    │              │                      │
    ▼              ▼                      ▼
┌───────┐      ┌───────┐             ┌───────┐
│Node A │      │Node B │             │Node C │
│COMMIT │      │COMMIT │             │COMMIT │
└───────┘      └───────┘             └───────┘

Problems with 2PC:

  • Coordinator is single point of failure
  • Blocking: if coordinator dies after prepare, nodes stuck
  • High latency (multiple round trips)

Consensus Protocols

Paxos and Raft provide fault-tolerant consensus:

                    RAFT CONSENSUS

    Leader Election:
    - Nodes vote for leader
    - Leader handles all writes
    - Followers replicate

    Write:
    1. Client sends to leader
    2. Leader writes to log
    3. Leader replicates to followers
    4. Majority acknowledge
    5. Leader commits
    6. Leader responds to client

    If leader fails: New election
    Safety: Never lose committed writes

CockroachDB, etcd, and TiKV use Raft.


15.7 Distributed SQL Databases

NewSQL databases provide distributed ACID:

CockroachDB

-- Looks like PostgreSQL
CREATE TABLE accounts (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    balance DECIMAL NOT NULL
);

-- But data is distributed across nodes
-- Transactions work across shards!
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT;  -- Distributed transaction, ACID guaranteed

Google Spanner

Global distribution with strong consistency:

    TrueTime: GPS-synchronized clocks
    Enables: Global consistent reads
    Trade-off: Higher latency (cross-region coordination)

TiDB

MySQL-compatible distributed database:

                    TIDB ARCHITECTURE

    ┌─────────────────────────────────────────────────────────────┐
    │                        TiDB Server                           │
    │                    (SQL processing)                          │
    └────────────────────────────┬────────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │                         │
              ┌─────┴─────┐             ┌─────┴─────┐
              │    TiKV   │             │   TiFlash │
              │ (Row OLTP)│             │ (Column   │
              │           │             │   OLAP)   │
              └───────────┘             └───────────┘

    HTAP: Same data accessible via row or column engine

15.8 NoSQL Approaches

Key-Value Stores

    Redis, etcd, DynamoDB

    Simple model: GET(key) → value, PUT(key, value)

    Distribution: Hash key → partition
    Replication: Each partition replicated

    No joins, limited queries
    But: Extremely scalable

Document Stores

    MongoDB, CouchDB

    Store JSON/BSON documents
    Query within documents
    Flexible schema

    Distribution: By document ID or shard key

Wide-Column Stores

    Cassandra, HBase, Bigtable

    Row key → columns (sparse, dynamic)

    ┌─────────────────────────────────────────────────────────────┐
    │ Row: user:12345                                              │
    │   email: alice@example.com                                   │
    │   profile:name: Alice                                        │
    │   profile:city: NYC                                          │
    │   settings:theme: dark                                       │
    │   login:2024-03-15: 10:30:00                                │
    │   login:2024-03-14: 09:15:00                                │
    └─────────────────────────────────────────────────────────────┘

    Columns are dynamic, rows can have different columns

15.9 Challenges in Distribution

Handling Failures

    Network Partitions:
    - Nodes can't communicate
    - Must decide: consistency or availability

    Node Failures:
    - Detect via heartbeat/gossip
    - Failover to replica
    - Rebalance data

    Data Center Failures:
    - Entire DC goes offline
    - Multi-DC replication essential

Consistency vs Performance

    Synchronous replication:
    - Write latency = max(replica latencies)
    - WAN: 50-200ms
    - Users notice!

    Asynchronous replication:
    - Write latency = primary only
    - But: data loss on failure

    Trade-off: How much data can you lose?

Clock Synchronization

    Without synchronized clocks:
    - Can't order events globally
    - Transaction conflicts ambiguous

    Solutions:
    - Logical clocks (Lamport, Vector clocks)
    - TrueTime (GPS + atomic clocks)
    - Hybrid logical clocks (HLC)

15.10 Practical Considerations

When to Distribute

    ✓ Data exceeds single machine capacity
    ✓ Need higher availability than one server
    ✓ Read throughput exceeds single server
    ✓ Users are geographically distributed

    ✗ "Just in case" - Premature optimization
    ✗ Single server handles load fine
    ✗ Data fits on one machine

Starting Simple

    Stage 1: Single PostgreSQL
    - Handles more than you think
    - Vertical scaling first

    Stage 2: Read Replicas
    - Primary for writes
    - Replicas for reads
    - Still relatively simple

    Stage 3: Sharding
    - When single primary is bottleneck
    - Significant complexity increase

    Stage 4: Distributed Database
    - When you truly need it
    - Or use managed service

Managed Services

    AWS RDS: Managed PostgreSQL/MySQL with read replicas
    AWS Aurora: Distributed storage, PostgreSQL/MySQL compatible
    Google Cloud Spanner: Globally distributed, strong consistency
    CockroachDB Serverless: Distributed PostgreSQL-like
    PlanetScale: Distributed MySQL (Vitess)
    MongoDB Atlas: Managed distributed MongoDB

    Trade-offs:
    + No operational burden
    + Expertise baked in
    - Less control
    - Vendor lock-in
    - Cost at scale

15.11 Summary

Distributed databases trade complexity for scalability:

Replication:

  • Copies data for availability and read scaling
  • Synchronous = consistent but slow
  • Asynchronous = fast but potential data loss

Partitioning:

  • Splits data across nodes for capacity
  • Hash partitioning for even distribution
  • Range partitioning for range queries

CAP Theorem:

  • Choose consistency or availability during partitions
  • Most systems are AP or CP

Distributed Transactions:

  • 2PC works but has issues
  • Consensus protocols (Raft) are more robust

Key Advice:

  • Don’t distribute prematurely
  • Start with single node + read replicas
  • Use managed services when possible
  • Understand your consistency requirements

What’s Next

Congratulations! You’ve completed the main content of “Database Internals.” The appendices provide a glossary of terms and recommendations for further reading.


“Distributed systems are not more complex because engineers like complexity. They’re more complex because the real world is messy, networks fail, and we still want things to work.”

Appendix A: Glossary of Terms

This glossary defines key terms used throughout the book.


A

ACID
The four properties of database transactions: Atomicity, Consistency, Isolation, and Durability.
Analyzer
The query processing component that validates semantic correctness, resolves names, and checks types.
ARIES
Algorithms for Recovery and Isolation Exploiting Semantics. The industry-standard crash recovery algorithm.
AST (Abstract Syntax Tree)
A tree representation of parsed SQL, used for semantic analysis and optimization.

B

B-tree
A self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in logarithmic time. The foundation of most database indexes.
B+ tree
A variant of B-tree where all values are stored in leaf nodes, which are linked together for efficient range scans.
Bitmap Index
An index that uses bitmaps to represent which rows contain each distinct value. Efficient for low-cardinality columns.
Bloom Filter
A probabilistic data structure that can tell you if an element is definitely not in a set. Used by LSM trees to avoid unnecessary disk reads.
BRIN (Block Range Index)
A compact index storing min/max values for ranges of pages. Efficient for naturally ordered large tables.
Buffer Pool
An area of memory that caches disk pages to reduce I/O. Also called buffer cache or page cache.

C

CAP Theorem
The principle that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance.
Cardinality
The number of distinct values in a column, or the estimated number of rows a query will return.
Checkpoint
A point at which the database ensures all committed changes are written to disk, limiting recovery time.
Clustered Index
An index where the table data is physically stored in index order. InnoDB primary keys are clustered indexes.
CLR (Compensation Log Record)
A WAL record written during undo operations to ensure idempotent recovery.
Column Store
A storage format where each column is stored separately, optimized for analytical queries.
Compaction
The process of merging LSM tree SSTables to reduce read amplification and reclaim space.
Covering Index
An index containing all columns needed by a query, enabling index-only scans.

D

Deadlock
A situation where two or more transactions are waiting for each other’s locks, creating a cycle where none can proceed.
Dirty Page
A page in the buffer pool that has been modified but not yet written to disk.
Double-Write Buffer
An InnoDB mechanism that writes pages to a sequential buffer before their final location, protecting against torn pages.
Durability
The ACID guarantee that committed transactions survive system failures.

E

Eventual Consistency
A consistency model where reads may return stale data, but all replicas will eventually converge to the same state.
Exclusive Lock (X Lock)
A lock that prevents all other access to the locked resource.
Executor
The query processing component that runs the execution plan and returns results.

F

Fanout
The number of children a B-tree node can have. Higher fanout means shallower trees.
Fill Factor
The percentage of space to use when initially filling index pages, leaving room for future inserts.
Foreign Data Wrapper (FDW)
PostgreSQL mechanism to access external data sources as if they were local tables.
Free Space Map (FSM)
A data structure tracking available space in each page for efficient insert placement.
fsync
A system call that forces buffered data to be written to physical storage, ensuring durability.
Full Page Write
Writing an entire page to WAL after a checkpoint, enabling torn page recovery.

G

GIN (Generalized Inverted Index)
A PostgreSQL index type for composite values like arrays, JSONB, and full-text search.
GiST (Generalized Search Tree)
A PostgreSQL index type supporting custom data types and operations like geometric queries.

H

Hash Index
An index using a hash table for O(1) lookup, but not supporting range queries.
Hash Join
A join algorithm that builds a hash table from one relation and probes it with the other.
Heap
A table organization where rows are stored in no particular order. PostgreSQL uses heap storage.
Histogram
Statistics showing the distribution of values in a column, used by the query optimizer.
Hot Standby
A replica that can serve read queries while replicating from the primary.
HTAP
Hybrid Transactional/Analytical Processing. Systems handling both OLTP and OLAP workloads.

I

Index-Only Scan
A scan that retrieves all needed data from the index without accessing the heap.
Isolation Level
A transaction setting controlling what data modifications are visible. Levels include Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
Intent Lock
A lock indicating intention to lock finer-grained resources, enabling hierarchical locking.

J

Join
An operation combining rows from two or more tables based on a related column.

K

Key
A value or set of values used to identify or locate data, such as a primary key or index key.

L

Latch
A lightweight lock protecting internal data structures (like B-tree pages) for very short durations.
Leader-Based Replication
A replication topology where one node (leader/primary) handles all writes and replicates to followers.
Leveled Compaction
An LSM tree compaction strategy organizing SSTables into levels with size ratios.
Lock
A mechanism preventing concurrent access to data to maintain consistency.
Lock Escalation
Converting many fine-grained locks to fewer coarse-grained locks to reduce overhead.
Logical Logging
Recording the operation performed rather than the bytes changed.
LSM Tree (Log-Structured Merge Tree)
A write-optimized data structure that buffers writes in memory and periodically flushes sorted files to disk.
LSN (Log Sequence Number)
A unique, monotonically increasing identifier for each WAL record.

M

MCV (Most Common Values)
Statistics listing the most frequent values in a column and their frequencies.
Memtable
The in-memory component of an LSM tree where recent writes are stored.
Merge Join
A join algorithm that merges two sorted inputs.
MVCC (Multi-Version Concurrency Control)
A technique maintaining multiple versions of data to allow concurrent reads and writes without blocking.

N

Nested Loop Join
A join algorithm that, for each row in the outer relation, scans the inner relation.
NVMe
A storage protocol optimized for SSDs, providing lower latency than SATA.

O

OLAP (Online Analytical Processing)
Workloads characterized by complex queries over large datasets, typically aggregations and scans.
OLTP (Online Transaction Processing)
Workloads characterized by many small transactions with point queries and updates.
Optimizer
The query processing component that evaluates execution strategies and chooses the best plan.

P

Page
The fixed-size unit of storage and I/O, typically 4KB, 8KB, or 16KB.
Parser
The query processing component that converts SQL text into an Abstract Syntax Tree.
Partition Tolerance
The ability of a distributed system to continue operating despite network partitions.
Partitioning (Sharding)
Dividing data across multiple nodes based on a partition key.
Phantom Read
An anomaly where new rows appear in repeated queries within a transaction.
Physical Logging
Recording the exact bytes changed in storage.
PITR (Point-in-Time Recovery)
Restoring a database to a specific moment by replaying WAL.
Plan Cache
Storage for prepared statement execution plans to avoid repeated planning.
Predicate Pushdown
Moving filter conditions closer to data sources for earlier elimination.
Prefetching
Reading data into cache before it’s requested, anticipating future access.
Primary Key
A column or set of columns uniquely identifying each row in a table.

Q

Query Plan
A tree of operators describing how to execute a query.
Quorum
The minimum number of nodes that must agree for a distributed operation to succeed.

R

Raft
A consensus protocol for maintaining a replicated log across distributed nodes.
Range Partitioning
Distributing data based on value ranges of the partition key.
Read Amplification
The ratio of data read from storage to data actually needed.
Read Committed
An isolation level where transactions only see committed data, but may see different data on repeated reads.
Read-Your-Writes Consistency
A guarantee that a user always sees their own recent writes.
Redo
Reapplying logged changes during recovery to restore committed data.
Repeatable Read
An isolation level where transactions see a consistent snapshot, preventing non-repeatable reads.
Replication
Copying data to multiple nodes for availability and read scalability.
Row Store
A storage format where all columns of a row are stored together, optimized for transactional queries.
RTO (Recovery Time Objective)
The maximum acceptable time to restore service after a failure.
RPO (Recovery Point Objective)
The maximum acceptable data loss measured in time.

S

Selectivity
The fraction of rows that match a predicate, used in cost estimation.
Sequential Scan (Seq Scan)
Reading all pages of a table in order.
Serializable
The strictest isolation level, guaranteeing transactions execute as if in serial order.
Shared Lock (S Lock)
A lock allowing multiple readers but preventing writers.
Sharding
See Partitioning.
Slotted Page
A page format with a slot array pointing to variable-length records.
Snapshot Isolation
A consistency level where transactions see a snapshot from their start time.
SSD (Solid State Drive)
Flash-based storage with no moving parts, providing faster random I/O than HDDs.
SSI (Serializable Snapshot Isolation)
A technique providing serializable isolation using snapshot isolation with conflict detection.
SSTable (Sorted String Table)
An immutable, sorted file in an LSM tree containing key-value pairs.
Statistics
Data about table contents (row counts, distinct values, distributions) used by the optimizer.
Strong Consistency
A guarantee that reads always see the most recent write.

T

TID (Tuple ID)
PostgreSQL’s row identifier consisting of (page number, slot number).
TOAST (The Oversized-Attribute Storage Technique)
PostgreSQL’s mechanism for storing large values out-of-line.
Tombstone
A marker indicating a deleted key in an LSM tree, persisting until compaction.
Torn Page
A page partially written due to a crash, containing a mix of old and new data.
Transaction
A unit of work that is atomic, consistent, isolated, and durable.
Two-Phase Commit (2PC)
A protocol for distributed transactions where participants first prepare, then commit.
Two-Phase Locking (2PL)
A protocol where transactions acquire all locks before releasing any.

U

Undo
Rolling back uncommitted changes during recovery.

V

Vacuum
PostgreSQL’s process for removing dead rows and updating statistics.
Vectorized Execution
Processing data in batches (vectors) rather than row-by-row for better performance.
Visibility Map
PostgreSQL structure tracking which pages contain only visible (all-committed) tuples.

W

WAL (Write-Ahead Log)
A log of changes written before data modifications, ensuring durability and enabling recovery.
Working Set
The portion of data actively being accessed, ideally fitting in the buffer pool.
Write Amplification
The ratio of data written to storage versus data logically changed.

X, Y, Z

xmin/xmax
PostgreSQL tuple header fields recording the creating and deleting transaction IDs.

Appendix B: Further Reading

This appendix provides recommendations for deepening your understanding of database internals.


Books

Database Fundamentals

“Database Internals” by Alex Petrov (O’Reilly, 2019) A comprehensive deep-dive into storage engines, distributed systems, and the algorithms powering modern databases. Covers both B-trees and LSM trees in depth.

“Designing Data-Intensive Applications” by Martin Kleppmann (O’Reilly, 2017) Essential reading for anyone building data systems. Covers storage, replication, partitioning, transactions, and distributed systems with exceptional clarity.

“Database System Concepts” by Silberschatz, Korth, and Sudarshan The standard academic textbook covering database theory, SQL, storage, indexing, transactions, and distributed databases.

Specific Topics

“Transaction Processing: Concepts and Techniques” by Jim Gray and Andreas Reuter (1993) The definitive reference on transaction processing, written by pioneers in the field. Covers ACID, recovery, and distributed transactions in depth.

“The Art of PostgreSQL” by Dimitri Fontaine A practical guide to PostgreSQL, covering SQL techniques, performance tuning, and real-world applications.

“High Performance MySQL” by Silvia Botros and Jeremy Tinley (O’Reilly, 4th Edition) Comprehensive coverage of MySQL architecture, optimization, replication, and operations.


Papers

Foundational Papers

“A Relational Model of Data for Large Shared Data Banks” — E.F. Codd (1970) The paper that started it all. Introduced the relational model that underlies SQL databases.

“The Design and Implementation of INGRES” — Stonebraker et al. (1976) Describes one of the first relational database implementations, influencing PostgreSQL’s design.

“ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking” — Mohan et al. (1992) The definitive paper on write-ahead logging and recovery. Still the basis for most database recovery systems.

Index Structures

“Organization and Maintenance of Large Ordered Indexes” — Bayer and McCreight (1972) The original B-tree paper.

“The Ubiquitous B-Tree” — Douglas Comer (1979) An accessible survey of B-tree variants and applications.

“The Log-Structured Merge-Tree (LSM-Tree)” — O’Neil et al. (1996) Introduced LSM trees, now used by LevelDB, RocksDB, Cassandra, and many others.

Concurrency and Transactions

“Granularity of Locks and Degrees of Consistency in a Shared Data Base” — Gray et al. (1976) Introduced isolation levels and hierarchical locking.

“A Critique of ANSI SQL Isolation Levels” — Berenson et al. (1995) Analyzes ANSI isolation levels and introduces snapshot isolation.

“Serializable Snapshot Isolation in PostgreSQL” — Ports and Grittner (2012) Describes PostgreSQL’s implementation of serializable isolation using SSI.

Distributed Systems

“Dynamo: Amazon’s Highly Available Key-value Store” — DeCandia et al. (2007) Influenced many NoSQL databases including Cassandra and Riak.

“Bigtable: A Distributed Storage System for Structured Data” — Chang et al. (2006) Google’s wide-column store that influenced HBase and Cassandra.

“Spanner: Google’s Globally-Distributed Database” — Corbett et al. (2012) Describes TrueTime and globally consistent distributed transactions.

“Calvin: Fast Distributed Transactions for Partitioned Database Systems” — Thomson et al. (2012) An alternative approach to distributed transactions using deterministic ordering.

“In Search of an Understandable Consensus Algorithm (Raft)” — Ongaro and Ousterhout (2014) A more accessible alternative to Paxos, widely implemented in modern systems.

Query Processing

“Access Path Selection in a Relational Database Management System” — Selinger et al. (1979) The classic paper on query optimization from System R.

“Volcano—An Extensible and Parallel Query Evaluation System” — Graefe (1994) Describes the iterator model used by most database executors.

“MonetDB/X100: Hyper-Pipelining Query Execution” — Boncz et al. (2005) Introduced vectorized query execution.


Online Resources

Documentation

PostgreSQL Documentation https://www.postgresql.org/docs/ Exceptionally thorough and readable. The internals sections are particularly valuable.

MySQL Reference Manual https://dev.mysql.com/doc/refman/ Comprehensive documentation including InnoDB internals.

SQLite Documentation https://www.sqlite.org/docs.html Includes detailed explanations of B-tree implementation and file format.

Blogs and Articles

The Morning Paper https://blog.acolyer.org/ Adrian Colyer’s summaries of computer science papers, including many database papers.

Use The Index, Luke https://use-the-index-luke.com/ An excellent resource on SQL indexing and performance.

Brandur’s Blog https://brandur.org/ Deep dives into PostgreSQL internals and database design.

CockroachDB Blog https://www.cockroachlabs.com/blog/ Technical articles on distributed databases and Raft implementation.

PlanetScale Blog https://planetscale.com/blog Articles on MySQL, Vitess, and distributed databases.

Video Resources

CMU Database Group (YouTube) Andy Pavlo’s database courses are freely available and excellent.

PGCon Talks Annual PostgreSQL conference talks cover internals in depth.


Source Code

Reading source code is one of the best ways to understand database internals.

SQLite https://sqlite.org/src/doc/trunk/README.md Small, readable codebase. Excellent for understanding B-trees and the query pipeline.

LevelDB https://github.com/google/leveldb Clean implementation of LSM trees. Well-documented code.

PostgreSQL https://github.com/postgres/postgres Large but well-organized. Start with the documentation on source code structure.

DuckDB https://github.com/duckdb/duckdb Modern analytical database with clean, modern C++ codebase.


Tools for Exploration

PostgreSQL

-- See execution plans
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

-- Examine buffer cache contents
CREATE EXTENSION pg_buffercache;
SELECT * FROM pg_buffercache;

-- View index structure
CREATE EXTENSION pageinspect;
SELECT * FROM bt_metap('index_name');
SELECT * FROM bt_page_items('index_name', 1);

-- Statistics
SELECT * FROM pg_stats WHERE tablename = 'your_table';

MySQL

-- Execution plans
EXPLAIN FORMAT=TREE SELECT ...;

-- InnoDB status
SHOW ENGINE INNODB STATUS\G

-- Index statistics
SELECT * FROM mysql.innodb_index_stats WHERE table_name = 'your_table';

System Tools

# I/O monitoring
iostat -x 1

# Memory monitoring
vmstat 1

# Trace system calls
strace -f -e read,write,fsync postgres

# Examine file blocks
xxd -l 8192 /path/to/tablefile

Conferences

SIGMOD (ACM Special Interest Group on Management of Data) Premier academic database conference.

VLDB (Very Large Data Bases) Major academic conference on database systems.

PGCon Annual PostgreSQL conference with deep technical content.

Percona Live MySQL and open-source database conference.

CMU Database Conference Industry/academic conference organized by CMU’s database group.


Practice Projects

The best way to learn is to build:

  1. Build a simple key-value store with persistence
  2. Implement a B-tree with insertion, deletion, and search
  3. Create a simple query executor with scan and filter operators
  4. Build a write-ahead log with crash recovery
  5. Implement MVCC for a simple database

Communities

PostgreSQL Mailing Lists https://www.postgresql.org/list/

MySQL Forums https://forums.mysql.com/

Database Internals Discord Community for discussing database internals

r/PostgreSQL and r/Database on Reddit Active communities for database discussions


“The more you learn about how databases work, the more you appreciate both their complexity and their elegance. Keep exploring!”