Chapter 3: Disk I/O and Page Management
“Understanding your storage hardware is half the battle of database performance.”
Databases live at the intersection of two worlds: the fast world of CPU and memory, and the slow world of persistent storage. This chapter explores that boundary—how data moves between disk and memory, why I/O is usually the bottleneck, and how databases optimize for the physical realities of storage hardware.
3.1 The Memory-Storage Performance Gap
The performance gap between memory and storage is enormous:
ACCESS LATENCY COMPARISON
L1 Cache │▌ 1 ns
L2 Cache │▌▌ 4 ns
L3 Cache │▌▌▌▌ 10 ns
RAM │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌ 100 ns
NVMe SSD │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌ 20,000 ns (20 μs)
SATA SSD │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌ 100,000 ns (100 μs)
HDD (spinning) │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌...▌▌▌▌▌▌▌▌▌▌▌ 10,000,000 ns (10 ms)
└─────────────────────────────────────────────────►
Logarithmic scale!
Key observations:
- RAM is ~200x faster than NVMe SSD
- RAM is ~100,000x faster than spinning disk
- A single HDD seek takes as long as 10 million memory accesses
This is why databases do everything possible to avoid disk I/O.
3.2 Understanding Storage Hardware
Hard Disk Drives (HDDs)
HDDs store data magnetically on spinning platters:
HDD ANATOMY
Spindle (motor)
│
┌────────────┴────────────┐
│ │
┌───────▼───────────────────────▼───────┐
│ ╭─────────────────────────────────╮ │
│ ╭│─────────────────────────────────│╮ │
│╭─│─────────────────────────────────│─╮│
││ │ PLATTER │ ││
│╰─│─────────────────────────────────│─╯│
│ ╰│─────────────────────────────────│╯ │
│ ╰─────────────────────────────────╯ │
└───────────────────────────────────────┘
▲
│
Read/Write Head
(moves in/out)
Access time components:
- Seek time: Move head to correct track (~5-10ms average)
- Rotational latency: Wait for sector to rotate under head (~4ms for 7200 RPM)
- Transfer time: Read the data (fast once positioned)
Key insight: Sequential reads are MUCH faster than random reads. Once the head is positioned, reading consecutive sectors is nearly free.
Sequential read: 150+ MB/s
Random read (4KB): ~100 IOPS = 0.4 MB/s
That's 375x difference!
Solid State Drives (SSDs)
SSDs use flash memory—no moving parts:
SSD ARCHITECTURE
┌───────────────────────────────────────────────┐
│ CONTROLLER │
│ (manages wear leveling, garbage collection) │
├───────┬───────┬───────┬───────┬───────┬──────┤
│ NAND │ NAND │ NAND │ NAND │ NAND │ NAND │
│ Chip │ Chip │ Chip │ Chip │ Chip │ Chip │
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │
└───────┴───────┴───────┴───────┴───────┴──────┘
SSD characteristics:
- No seek time—any location accessed equally fast
- Parallel access across chips
- Write amplification (must erase blocks before writing)
- Wear leveling (spreads writes to extend lifespan)
Performance:
- Random read: 10,000-500,000+ IOPS
- Sequential read: 500-7000+ MB/s (NVMe)
- Random writes slower than reads (erase-before-write)
NVMe vs SATA
SATA SSDs: Use the same interface as HDDs. Limited by SATA bandwidth (~550 MB/s).
NVMe SSDs: Connect directly to PCIe. Much higher bandwidth (3000-7000+ MB/s) and lower latency.
Interface Comparison:
SATA SSD: CPU ──► SATA Controller ──► SSD
Latency: ~100 μs
Bandwidth: 550 MB/s
NVMe SSD: CPU ══════════════════════► SSD
Latency: ~20 μs
Bandwidth: 3000+ MB/s
3.3 I/O Patterns and Database Design
Database design is heavily influenced by I/O patterns:
Sequential vs Random I/O
SEQUENTIAL I/O RANDOM I/O
Read pages: 1, 2, 3, 4, 5, 6 Read pages: 47, 3, 891, 12, 456
┌──┬──┬──┬──┬──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┬──┬──┬──┐
│▓▓│▓▓│▓▓│▓▓│▓▓│▓▓│ │ │ │ │ │▓▓│ │ │ │ │ │ ...
└──┴──┴──┴──┴──┴──┴──┴──┘ └──┴──┴──┴──┴──┴──┴──┴──┘
──────────────────► ↑ ↑ ↑
One seek, continuous read Multiple seeks required
HDD: ~150 MB/s HDD: ~0.4 MB/s (100 IOPS × 4KB)
SSD: ~3000 MB/s SSD: ~400 MB/s (100K IOPS × 4KB)
Design implications:
- Log-structured storage (WAL, LSM trees) uses sequential writes
- B-trees minimize random I/O through high fanout
- Table scans (sequential) can be faster than index scans (random) for large result sets
Read Amplification vs Write Amplification
Read amplification: Reading more data than needed to answer a query
- Example: Reading entire 8KB page to get one 100-byte row
Write amplification: Writing more data than the logical change
- Example: Changing one byte requires rewriting entire 4KB flash block
- Example: B-tree page splits propagate up the tree
Different storage engines trade off between these:
READ AMPLIFICATION WRITE AMPLIFICATION
B-tree Low Medium
LSM-tree High High
Hash index Very Low Low
3.4 Page Management Fundamentals
The Page as Unit of I/O
Databases read and write in pages, not bytes:
Application wants: 100-byte row
Database reads: 8,192-byte page (8KB)
Why? Because disk I/O works in blocks.
Reading 100 bytes costs the same as reading 4KB (or even 8KB).
Page Addressing
Pages are identified by their offset in the file:
Page Number → File Offset
Page 0 → Offset 0
Page 1 → Offset 8192
Page 2 → Offset 16384
Page N → Offset N × 8192
To read page 42:
1. Seek to offset 42 × 8192 = 344,064
2. Read 8192 bytes
Free Space Management
How does the database know which pages have room for new data?
Free Space Map (FSM): A bitmap or array tracking free space per page.
FREE SPACE MAP
Page Free Space Category
──── ────────── ────────
0 8% Nearly full
1 45% Half full
2 0% Full
3 92% Mostly empty
4 100% Empty
...
INSERT needs 200 bytes → Check FSM → Use Page 3 or 4
PostgreSQL uses a hierarchical FSM:
- Top level: Coarse-grained (ranges of pages)
- Bottom level: Fine-grained (individual pages)
This allows quick finding of pages with sufficient space without scanning everything.
3.5 Allocating and Extending Files
File Extension
When a table grows, the database must allocate more space:
Approach 1: Extend on demand
Need new page → Extend file by one page → Use it
Simple but causes many small extends.
Approach 2: Extend in chunks
Need new page → Extend file by 1MB (128 pages) → Use first one
Fewer, larger extends. Better sequential layout.
Segment Files
Large tables split into multiple files:
users # First 1GB
users.1 # Second 1GB
users.2 # Third 1GB
Page 150,000 is in users.1 at offset:
(150,000 - 131,072) × 8192 = 155,025,408
Benefits:
- Avoids filesystem maximum file size limits
- Allows parallel I/O to different segments
- Easier backup of individual segments
3.6 Direct I/O vs Buffered I/O
Operating systems cache file data in the page cache. Databases have a choice:
Buffered I/O (using OS page cache)
Application ──► Database Buffer Pool ──► OS Page Cache ──► Disk
Read path:
1. Check database buffer pool
2. If miss, read from OS (might be in OS cache)
3. If miss, read from disk
4. Data cached in BOTH locations
Problem: Double buffering wastes memory. Data may be cached twice.
Direct I/O (O_DIRECT, bypassing OS cache)
Application ──► Database Buffer Pool ──────────────────► Disk
Read path:
1. Check database buffer pool
2. If miss, read directly from disk
3. Data cached only in database buffer pool
PostgreSQL: Uses buffered I/O by default. The OS provides read-ahead and write-behind.
MySQL InnoDB: Uses O_DIRECT by default. Takes full control of caching.
SQLite: Uses buffered I/O. Relies on OS for caching.
When to use Direct I/O:
+ Database has sophisticated buffer management
+ Want predictable memory usage
+ Large buffer pool relative to OS memory
When to use Buffered I/O:
+ Simpler implementation
+ Benefit from OS read-ahead for sequential scans
+ Small datasets that fit in OS cache
3.7 Read-Ahead and Prefetching
For sequential access, reading ahead improves performance:
OS Read-Ahead
The OS detects sequential access patterns and prefetches upcoming pages:
Application reads: Page 1
OS notices sequential pattern
OS prefetches: Pages 2, 3, 4, 5, 6, 7, 8, ...
When application reads Page 2: Already in memory!
Database-Level Prefetching
Databases can be smarter because they know the query plan:
SELECT * FROM large_table; -- Full table scan
The database knows it will need all pages, so it can issue large sequential reads.
-- PostgreSQL effective_io_concurrency
SET effective_io_concurrency = 200; -- For SSDs
-- Tells planner to issue multiple concurrent I/O requests
Bitmap Heap Scans
When an index scan finds many rows across many pages:
Index scan finds rows on pages: 5, 7, 12, 15, 23, 24, 25, 31, ...
Naive approach: Read pages in that order (random I/O)
Bitmap approach:
1. Build bitmap of pages needed: {5, 7, 12, 15, 23, 24, 25, 31}
2. Sort by page number
3. Read in sorted order: 5, 7, 12, 15, 23, 24, 25, 31
4. Less seeking, benefits from read-ahead
3.8 Write Strategies
Writes have different requirements than reads:
Synchronous Writes (fsync)
For durability, writes must actually reach disk:
write(fd, data, size); // Write to OS buffer
fsync(fd); // Force to disk - SLOW!
The fsync ensures data survives power failure. Without it, data might be in OS buffers only.
Write-Back Caching
Most writes go to buffer pool first, written to disk later:
1. Modify page in buffer pool (fast)
2. Mark page "dirty"
3. Page stays in memory
4. Background writer flushes dirty pages to disk
5. Checkpoint ensures all changes are durable
Why delay writes?
- Amortize I/O: Multiple changes to same page = one write
- Batch writes: Write many pages in one I/O operation
- Sequential writes: Background writer can sort pages
Write Ordering
Some writes must happen before others:
WAL write MUST complete before data page write
(Otherwise crash recovery breaks)
Checkpoint MUST complete before WAL truncation
(Otherwise we lose recovery data)
Databases carefully manage write ordering for correctness.
3.9 I/O Scheduling
Operating System I/O Schedulers
Linux I/O schedulers reorder requests for efficiency:
CFQ (Completely Fair Queuing): Balances between processes Deadline: Prevents starvation, good for databases NOOP: No reordering, best for SSDs (they have internal schedulers) mq-deadline/none: Modern kernels with blk-mq
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Set scheduler (example)
echo deadline > /sys/block/sda/queue/scheduler
Database I/O Scheduling
Databases also prioritize I/O:
Priority 1: Transaction commit (fsync WAL) - User waiting!
Priority 2: Active query I/O - User waiting!
Priority 3: Background writer - No one waiting
Priority 4: Vacuum/maintenance - Can wait
3.10 Filesystem Considerations
Filesystem Choice
Different filesystems have different characteristics:
| Filesystem | Characteristics | Database Use |
|---|---|---|
| ext4 | Mature, good all-around | Common on Linux |
| XFS | Better for large files, parallel I/O | Good for databases |
| ZFS | Checksums, snapshots, compression | Data integrity focus |
| btrfs | Copy-on-write, snapshots | Use with caution |
Important Mount Options
# Example for database filesystem
/dev/sda1 /data xfs noatime,nodiratime,nobarrier 0 0
noatime: Don’t update access time on read (reduces writes) nobarrier: Disable write barriers (only if using battery-backed cache!)
File System Full
Databases need to handle disk full gracefully:
1. Reserve some space for emergencies
2. Monitor disk usage, alert before full
3. Read-only mode when nearly full (allow queries, block writes)
4. Graceful degradation, not crash
3.11 Page Size Trade-offs
The choice of page size affects performance:
Smaller Pages (4KB)
Advantages:
- Less read amplification (read only what you need)
- Better for random access patterns
- Lower memory waste for small rows
Disadvantages:
- More pages to manage (more metadata overhead)
- More I/O operations for large scans
- Higher B-tree depth (more tree levels)
Larger Pages (16KB, 32KB)
Advantages:
- Fewer I/O operations for sequential scans
- Lower B-tree depth (fewer tree levels)
- Better compression ratios
Disadvantages:
- More read amplification
- Higher memory usage per cached page
- More write amplification
Page Size Guidelines:
4KB: Good for OLTP with small rows, random access
8KB: Good balance (PostgreSQL default)
16KB: Good for larger rows, sequential scans (InnoDB default)
32KB: Specialized analytics workloads
3.12 Practical Performance Implications
Symptoms of I/O Bottlenecks
High disk wait time (iowait)
Low cache hit ratio
Slow queries despite good execution plans
Performance varies with time of day (cache cold vs warm)
Monitoring I/O
# Linux: iostat
iostat -x 1
# Watch: await (latency), %util (utilization)
# PostgreSQL: pg_stat_io (v16+)
SELECT * FROM pg_stat_io;
# PostgreSQL: buffer cache hit ratio
SELECT
sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database;
Quick Wins
- Add more RAM: Larger buffer pool = fewer disk reads
- Use SSDs: 100x better random I/O than HDDs
- Optimize queries: Read fewer pages in the first place
- Add indexes: Turn random scans into index lookups
- Tune checkpoints: Spread writes over time
3.13 Summary
Disk I/O is usually the database bottleneck:
- HDDs are good at sequential I/O, terrible at random I/O
- SSDs are much better at random I/O but still slower than RAM
- Databases use page-based I/O to amortize access costs
- Buffer pools cache pages in memory to avoid disk access
- Free space maps track available space for inserts
- Direct I/O vs buffered I/O is a trade-off
- Write ordering is critical for correctness
- Page size choice affects performance characteristics
Understanding I/O helps you diagnose performance problems and make informed decisions about hardware and configuration.
What’s Next
In Chapter 4, we’ll explore the data structures that make finding data fast—B-trees and their variants. You’ll see how index structures are designed to minimize the I/O we’ve discussed in this chapter.
“The fastest I/O is no I/O. The second fastest is sequential I/O. Random I/O is where dreams go to die.”