Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 3: Disk I/O and Page Management

“Understanding your storage hardware is half the battle of database performance.”

Databases live at the intersection of two worlds: the fast world of CPU and memory, and the slow world of persistent storage. This chapter explores that boundary—how data moves between disk and memory, why I/O is usually the bottleneck, and how databases optimize for the physical realities of storage hardware.


3.1 The Memory-Storage Performance Gap

The performance gap between memory and storage is enormous:

                    ACCESS LATENCY COMPARISON

    L1 Cache         │▌ 1 ns
    L2 Cache         │▌▌ 4 ns
    L3 Cache         │▌▌▌▌ 10 ns
    RAM              │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌ 100 ns
    NVMe SSD         │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌ 20,000 ns (20 μs)
    SATA SSD         │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌ 100,000 ns (100 μs)
    HDD (spinning)   │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌▌▌▌▌ 10,000,000 ns (10 ms)

                     └─────────────────────────────────────────────────►
                     Logarithmic scale!

Key observations:

  • RAM is ~200x faster than NVMe SSD
  • RAM is ~100,000x faster than spinning disk
  • A single HDD seek takes as long as 10 million memory accesses

This is why databases do everything possible to avoid disk I/O.


3.2 Understanding Storage Hardware

Hard Disk Drives (HDDs)

HDDs store data magnetically on spinning platters:

                    HDD ANATOMY

                    Spindle (motor)
                         │
            ┌────────────┴────────────┐
            │                         │
    ┌───────▼───────────────────────▼───────┐
    │  ╭─────────────────────────────────╮  │
    │ ╭│─────────────────────────────────│╮ │
    │╭─│─────────────────────────────────│─╮│
    ││ │         PLATTER                 │ ││
    │╰─│─────────────────────────────────│─╯│
    │ ╰│─────────────────────────────────│╯ │
    │  ╰─────────────────────────────────╯  │
    └───────────────────────────────────────┘
                    ▲
                    │
              Read/Write Head
              (moves in/out)

Access time components:

  1. Seek time: Move head to correct track (~5-10ms average)
  2. Rotational latency: Wait for sector to rotate under head (~4ms for 7200 RPM)
  3. Transfer time: Read the data (fast once positioned)

Key insight: Sequential reads are MUCH faster than random reads. Once the head is positioned, reading consecutive sectors is nearly free.

    Sequential read: 150+ MB/s
    Random read (4KB): ~100 IOPS = 0.4 MB/s

    That's 375x difference!

Solid State Drives (SSDs)

SSDs use flash memory—no moving parts:

                    SSD ARCHITECTURE

    ┌───────────────────────────────────────────────┐
    │                  CONTROLLER                    │
    │   (manages wear leveling, garbage collection)  │
    ├───────┬───────┬───────┬───────┬───────┬──────┤
    │ NAND  │ NAND  │ NAND  │ NAND  │ NAND  │ NAND │
    │ Chip  │ Chip  │ Chip  │ Chip  │ Chip  │ Chip │
    │   0   │   1   │   2   │   3   │   4   │   5  │
    └───────┴───────┴───────┴───────┴───────┴──────┘

SSD characteristics:

  • No seek time—any location accessed equally fast
  • Parallel access across chips
  • Write amplification (must erase blocks before writing)
  • Wear leveling (spreads writes to extend lifespan)

Performance:

  • Random read: 10,000-500,000+ IOPS
  • Sequential read: 500-7000+ MB/s (NVMe)
  • Random writes slower than reads (erase-before-write)

NVMe vs SATA

SATA SSDs: Use the same interface as HDDs. Limited by SATA bandwidth (~550 MB/s).

NVMe SSDs: Connect directly to PCIe. Much higher bandwidth (3000-7000+ MB/s) and lower latency.

    Interface Comparison:

    SATA SSD:  CPU ──► SATA Controller ──► SSD
               Latency: ~100 μs
               Bandwidth: 550 MB/s

    NVMe SSD:  CPU ══════════════════════► SSD
               Latency: ~20 μs
               Bandwidth: 3000+ MB/s

3.3 I/O Patterns and Database Design

Database design is heavily influenced by I/O patterns:

Sequential vs Random I/O

            SEQUENTIAL I/O                    RANDOM I/O

    Read pages: 1, 2, 3, 4, 5, 6         Read pages: 47, 3, 891, 12, 456

    ┌──┬──┬──┬──┬──┬──┬──┬──┐            ┌──┬──┬──┬──┬──┬──┬──┬──┐
    │▓▓│▓▓│▓▓│▓▓│▓▓│▓▓│  │  │            │  │  │▓▓│  │  │  │  │  │ ...
    └──┴──┴──┴──┴──┴──┴──┴──┘            └──┴──┴──┴──┴──┴──┴──┴──┘
       ──────────────────►                  ↑    ↑         ↑
       One seek, continuous read            Multiple seeks required

    HDD: ~150 MB/s                        HDD: ~0.4 MB/s (100 IOPS × 4KB)
    SSD: ~3000 MB/s                       SSD: ~400 MB/s (100K IOPS × 4KB)

Design implications:

  • Log-structured storage (WAL, LSM trees) uses sequential writes
  • B-trees minimize random I/O through high fanout
  • Table scans (sequential) can be faster than index scans (random) for large result sets

Read Amplification vs Write Amplification

Read amplification: Reading more data than needed to answer a query

  • Example: Reading entire 8KB page to get one 100-byte row

Write amplification: Writing more data than the logical change

  • Example: Changing one byte requires rewriting entire 4KB flash block
  • Example: B-tree page splits propagate up the tree

Different storage engines trade off between these:

                READ AMPLIFICATION    WRITE AMPLIFICATION

    B-tree              Low                  Medium
    LSM-tree            High                 High
    Hash index          Very Low             Low

3.4 Page Management Fundamentals

The Page as Unit of I/O

Databases read and write in pages, not bytes:

    Application wants: 100-byte row
    Database reads: 8,192-byte page (8KB)

    Why? Because disk I/O works in blocks.
    Reading 100 bytes costs the same as reading 4KB (or even 8KB).

Page Addressing

Pages are identified by their offset in the file:

    Page Number → File Offset

    Page 0  →  Offset 0
    Page 1  →  Offset 8192
    Page 2  →  Offset 16384
    Page N  →  Offset N × 8192

    To read page 42:
    1. Seek to offset 42 × 8192 = 344,064
    2. Read 8192 bytes

Free Space Management

How does the database know which pages have room for new data?

Free Space Map (FSM): A bitmap or array tracking free space per page.

                    FREE SPACE MAP

    Page    Free Space    Category
    ────    ──────────    ────────
    0       8%            Nearly full
    1       45%           Half full
    2       0%            Full
    3       92%           Mostly empty
    4       100%          Empty
    ...

    INSERT needs 200 bytes → Check FSM → Use Page 3 or 4

PostgreSQL uses a hierarchical FSM:

  • Top level: Coarse-grained (ranges of pages)
  • Bottom level: Fine-grained (individual pages)

This allows quick finding of pages with sufficient space without scanning everything.


3.5 Allocating and Extending Files

File Extension

When a table grows, the database must allocate more space:

Approach 1: Extend on demand

Need new page → Extend file by one page → Use it

Simple but causes many small extends.

Approach 2: Extend in chunks

Need new page → Extend file by 1MB (128 pages) → Use first one

Fewer, larger extends. Better sequential layout.

Segment Files

Large tables split into multiple files:

    users           # First 1GB
    users.1         # Second 1GB
    users.2         # Third 1GB

    Page 150,000 is in users.1 at offset:
    (150,000 - 131,072) × 8192 = 155,025,408

Benefits:

  • Avoids filesystem maximum file size limits
  • Allows parallel I/O to different segments
  • Easier backup of individual segments

3.6 Direct I/O vs Buffered I/O

Operating systems cache file data in the page cache. Databases have a choice:

Buffered I/O (using OS page cache)

    Application ──► Database Buffer Pool ──► OS Page Cache ──► Disk

    Read path:
    1. Check database buffer pool
    2. If miss, read from OS (might be in OS cache)
    3. If miss, read from disk
    4. Data cached in BOTH locations

Problem: Double buffering wastes memory. Data may be cached twice.

Direct I/O (O_DIRECT, bypassing OS cache)

    Application ──► Database Buffer Pool ──────────────────► Disk

    Read path:
    1. Check database buffer pool
    2. If miss, read directly from disk
    3. Data cached only in database buffer pool

PostgreSQL: Uses buffered I/O by default. The OS provides read-ahead and write-behind.

MySQL InnoDB: Uses O_DIRECT by default. Takes full control of caching.

SQLite: Uses buffered I/O. Relies on OS for caching.

    When to use Direct I/O:
    + Database has sophisticated buffer management
    + Want predictable memory usage
    + Large buffer pool relative to OS memory

    When to use Buffered I/O:
    + Simpler implementation
    + Benefit from OS read-ahead for sequential scans
    + Small datasets that fit in OS cache

3.7 Read-Ahead and Prefetching

For sequential access, reading ahead improves performance:

OS Read-Ahead

The OS detects sequential access patterns and prefetches upcoming pages:

    Application reads: Page 1
    OS notices sequential pattern
    OS prefetches: Pages 2, 3, 4, 5, 6, 7, 8, ...

    When application reads Page 2: Already in memory!

Database-Level Prefetching

Databases can be smarter because they know the query plan:

    SELECT * FROM large_table;  -- Full table scan

The database knows it will need all pages, so it can issue large sequential reads.

    -- PostgreSQL effective_io_concurrency
    SET effective_io_concurrency = 200;  -- For SSDs

    -- Tells planner to issue multiple concurrent I/O requests

Bitmap Heap Scans

When an index scan finds many rows across many pages:

    Index scan finds rows on pages: 5, 7, 12, 15, 23, 24, 25, 31, ...

    Naive approach: Read pages in that order (random I/O)

    Bitmap approach:
    1. Build bitmap of pages needed: {5, 7, 12, 15, 23, 24, 25, 31}
    2. Sort by page number
    3. Read in sorted order: 5, 7, 12, 15, 23, 24, 25, 31
    4. Less seeking, benefits from read-ahead

3.8 Write Strategies

Writes have different requirements than reads:

Synchronous Writes (fsync)

For durability, writes must actually reach disk:

    write(fd, data, size);   // Write to OS buffer
    fsync(fd);               // Force to disk - SLOW!

The fsync ensures data survives power failure. Without it, data might be in OS buffers only.

Write-Back Caching

Most writes go to buffer pool first, written to disk later:

    1. Modify page in buffer pool (fast)
    2. Mark page "dirty"
    3. Page stays in memory
    4. Background writer flushes dirty pages to disk
    5. Checkpoint ensures all changes are durable

Why delay writes?

  • Amortize I/O: Multiple changes to same page = one write
  • Batch writes: Write many pages in one I/O operation
  • Sequential writes: Background writer can sort pages

Write Ordering

Some writes must happen before others:

    WAL write MUST complete before data page write
    (Otherwise crash recovery breaks)

    Checkpoint MUST complete before WAL truncation
    (Otherwise we lose recovery data)

Databases carefully manage write ordering for correctness.


3.9 I/O Scheduling

Operating System I/O Schedulers

Linux I/O schedulers reorder requests for efficiency:

CFQ (Completely Fair Queuing): Balances between processes Deadline: Prevents starvation, good for databases NOOP: No reordering, best for SSDs (they have internal schedulers) mq-deadline/none: Modern kernels with blk-mq

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set scheduler (example)
echo deadline > /sys/block/sda/queue/scheduler

Database I/O Scheduling

Databases also prioritize I/O:

    Priority 1: Transaction commit (fsync WAL) - User waiting!
    Priority 2: Active query I/O - User waiting!
    Priority 3: Background writer - No one waiting
    Priority 4: Vacuum/maintenance - Can wait

3.10 Filesystem Considerations

Filesystem Choice

Different filesystems have different characteristics:

FilesystemCharacteristicsDatabase Use
ext4Mature, good all-aroundCommon on Linux
XFSBetter for large files, parallel I/OGood for databases
ZFSChecksums, snapshots, compressionData integrity focus
btrfsCopy-on-write, snapshotsUse with caution

Important Mount Options

# Example for database filesystem
/dev/sda1 /data xfs noatime,nodiratime,nobarrier 0 0

noatime: Don’t update access time on read (reduces writes) nobarrier: Disable write barriers (only if using battery-backed cache!)

File System Full

Databases need to handle disk full gracefully:

    1. Reserve some space for emergencies
    2. Monitor disk usage, alert before full
    3. Read-only mode when nearly full (allow queries, block writes)
    4. Graceful degradation, not crash

3.11 Page Size Trade-offs

The choice of page size affects performance:

Smaller Pages (4KB)

Advantages:

  • Less read amplification (read only what you need)
  • Better for random access patterns
  • Lower memory waste for small rows

Disadvantages:

  • More pages to manage (more metadata overhead)
  • More I/O operations for large scans
  • Higher B-tree depth (more tree levels)

Larger Pages (16KB, 32KB)

Advantages:

  • Fewer I/O operations for sequential scans
  • Lower B-tree depth (fewer tree levels)
  • Better compression ratios

Disadvantages:

  • More read amplification
  • Higher memory usage per cached page
  • More write amplification
    Page Size Guidelines:

    4KB:  Good for OLTP with small rows, random access
    8KB:  Good balance (PostgreSQL default)
    16KB: Good for larger rows, sequential scans (InnoDB default)
    32KB: Specialized analytics workloads

3.12 Practical Performance Implications

Symptoms of I/O Bottlenecks

    High disk wait time (iowait)
    Low cache hit ratio
    Slow queries despite good execution plans
    Performance varies with time of day (cache cold vs warm)

Monitoring I/O

# Linux: iostat
iostat -x 1
# Watch: await (latency), %util (utilization)

# PostgreSQL: pg_stat_io (v16+)
SELECT * FROM pg_stat_io;

# PostgreSQL: buffer cache hit ratio
SELECT
    sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database;

Quick Wins

  1. Add more RAM: Larger buffer pool = fewer disk reads
  2. Use SSDs: 100x better random I/O than HDDs
  3. Optimize queries: Read fewer pages in the first place
  4. Add indexes: Turn random scans into index lookups
  5. Tune checkpoints: Spread writes over time

3.13 Summary

Disk I/O is usually the database bottleneck:

  • HDDs are good at sequential I/O, terrible at random I/O
  • SSDs are much better at random I/O but still slower than RAM
  • Databases use page-based I/O to amortize access costs
  • Buffer pools cache pages in memory to avoid disk access
  • Free space maps track available space for inserts
  • Direct I/O vs buffered I/O is a trade-off
  • Write ordering is critical for correctness
  • Page size choice affects performance characteristics

Understanding I/O helps you diagnose performance problems and make informed decisions about hardware and configuration.


What’s Next

In Chapter 4, we’ll explore the data structures that make finding data fast—B-trees and their variants. You’ll see how index structures are designed to minimize the I/O we’ve discussed in this chapter.


“The fastest I/O is no I/O. The second fastest is sequential I/O. Random I/O is where dreams go to die.”