Chapter 2: Storage Engines and File Formats
“The way you store data determines everything about how you can access it.”
A storage engine is the component of a database that handles how data is stored on disk and retrieved. It’s the foundation upon which everything else is built. In this chapter, we’ll explore how storage engines organize data, the file formats they use, and the trade-offs they make.
2.1 What Is a Storage Engine?
A storage engine is responsible for:
- Writing data to persistent storage (disk, SSD)
- Reading data back when requested
- Managing space as data grows and shrinks
- Maintaining indexes for fast lookups
- Ensuring durability through logging and recovery
- Handling concurrency when multiple operations occur simultaneously
Different storage engines make different trade-offs. Some optimize for read-heavy workloads, others for writes. Some prioritize simplicity, others flexibility. Understanding these trade-offs helps you choose the right database for your needs.
2.2 The Fundamental Trade-off: Read vs Write Optimization
Every storage engine faces a fundamental trade-off:
Optimize for reads → Keep data sorted and organized, making lookups fast but updates expensive (need to maintain order)
Optimize for writes → Append new data quickly, defer organization for later (reads may be slower)
THE READ-WRITE SPECTRUM
← Read Optimized Write Optimized →
┌─────────────────────────────────────────────┐
│ │
│ B-tree Hybrid LSM-tree │
│ (PostgreSQL, (some (RocksDB, │
│ MySQL InnoDB) configurations) Cassandra)│
│ │
└─────────────────────────────────────────────┘
- Fast reads Mixed - Fast writes
- Slower writes workloads - Slower reads
- More I/O on write - Background work
- Data always sorted - Eventual sorting
We’ll explore both approaches in detail in later chapters. For now, understand that this trade-off shapes every design decision.
2.3 Data Files and Page Structure
Most databases store data in files, organized as a sequence of fixed-size pages (also called blocks). Typical page sizes are 4KB, 8KB, or 16KB.
Why Fixed-Size Pages?
-
Alignment with disk I/O: Disks read in fixed-size blocks (usually 512B or 4KB). Matching page size to disk block size avoids partial reads.
-
Simpler memory management: The buffer pool can allocate a fixed number of equal-sized slots.
-
Predictable addressing: Page N is at byte offset
N * page_sizein the file. No need for an index of page locations.
DATA FILE LAYOUT
File offset: 0 8KB 16KB 24KB 32KB
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │ ...
│ (header)│ (data) │ (data) │ (data) │ (index) │
└─────────┴─────────┴─────────┴─────────┴─────────┘
To read Page 3: seek to 24KB, read 8KB
Page Types
Different pages serve different purposes:
- Data pages: Store actual table rows
- Index pages: Store index entries (B-tree nodes)
- Free space map pages: Track which pages have room for new data
- Metadata pages: Store table definitions, statistics
- Overflow pages: Handle data too large for one page
2.4 Heap Files: The Simplest Organization
The simplest way to store table data is as a heap file—pages containing rows in no particular order. New rows are inserted wherever there’s space.
HEAP FILE ORGANIZATION
┌──────────────────────────────────────────────────┐
│ Page 0 │
│ ┌──────────────────────────────────────────────┐ │
│ │ Row: id=47, name='Zoe', email='...' │ │
│ │ Row: id=3, name='Carol', email='...' │ │
│ │ Row: id=91, name='Alice', email='...' │ │
│ │ [free space] │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ Page 1 │
│ ┌──────────────────────────────────────────────┐ │
│ │ Row: id=12, name='Bob', email='...' │ │
│ │ Row: id=7, name='Eve', email='...' │ │
│ │ [free space] │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Note: Rows are NOT sorted by id!
Advantages:
- Simple to implement
- Fast inserts (just find space and write)
- No overhead of maintaining order
Disadvantages:
- Finding a specific row requires scanning all pages (unless you have an index)
- Deleted rows leave holes (fragmentation)
Who uses heap files?
- PostgreSQL stores table data in heap files (indexes provide fast access)
- MySQL InnoDB uses clustered indexes instead (data sorted by primary key)
2.5 Row Format: How Rows Are Stored in Pages
Within a page, how are individual rows laid out? Most databases use one of two approaches:
Slotted Pages
The most common format. Each page has:
- A header with metadata
- A slot array pointing to rows
- Rows stored from the end of the page backward
- Free space in the middle
SLOTTED PAGE FORMAT
┌────────────────────────────────────────────────────┐
│ PAGE HEADER │
│ Page ID: 42 │
│ Slot count: 3 │
│ Free space start: 2048 │
│ Free space end: 5120 │
├────────────────────────────────────────────────────┤
│ SLOT ARRAY │
│ Slot 0: offset=7680, length=256 │
│ Slot 1: offset=7168, length=512 │
│ Slot 2: offset=5120, length=2048 │
├────────────────────────────────────────────────────┤
│ │
│ [FREE SPACE] │
│ │
├────────────────────────────────────────────────────┤
│ │
│ Row 2 (at offset 5120, 2048 bytes) │
│ ┌────────────────────────────────────────────┐ │
│ │ id=91 │ name='Alice' │ bio='Long text...' │ │
│ └────────────────────────────────────────────┘ │
│ │
│ Row 1 (at offset 7168, 512 bytes) │
│ ┌───────────────────────────────────┐ │
│ │ id=3 │ name='Carol' │ bio='...' │ │
│ └───────────────────────────────────┘ │
│ │
│ Row 0 (at offset 7680, 256 bytes) │
│ ┌─────────────────────────────┐ │
│ │ id=47 │ name='Zoe' │ bio='' │ │
│ └─────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────┘
Why grow rows backward?
Free space stays contiguous. The slot array grows forward, rows grow backward, and they meet in the middle. No fragmentation until the page is full.
Why use a slot array?
- Rows can be different sizes
- Rows can be moved within the page (for compaction) without changing their external address—just update the slot
- Indexes can reference rows as (page_id, slot_number), which remains stable
Identifying Rows: TIDs and ROWIDs
Databases need a way to reference specific rows. Common approaches:
- PostgreSQL TID (Tuple ID):
(page_number, slot_number)— e.g.,(42, 3) - Oracle ROWID: Encodes data file, block, and slot
- MySQL InnoDB: Uses the primary key as the row identifier
-- PostgreSQL: You can see the TID of each row
SELECT ctid, * FROM users;
ctid | id | name
----------+----+-------
(0,1) | 1 | Alice
(0,2) | 2 | Bob
(1,1) | 3 | Carol
2.6 Row Storage Formats
How is the actual row data encoded? Several formats exist:
N-ary Storage Model (NSM) — Row-Oriented
Store all columns of a row together. This is the traditional approach.
Row format: [header][col1][col2][col3][col4]...
┌────────────────────────────────────────────────────────┐
│ Row 1: │header│ id=1 │ name='Alice' │ email='a@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 2: │header│ id=2 │ name='Bob' │ email='b@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 3: │header│ id=3 │ name='Carol' │ email='c@b.com' │
└────────────────────────────────────────────────────────┘
Row header typically contains:
- Row length
- Null bitmap (which columns are NULL)
- Transaction visibility info (for MVCC)
Handling Variable-Length Data
Fixed-length columns (INTEGER, CHAR(10)) are easy—always the same size. Variable-length columns (VARCHAR, TEXT) need special handling:
Approach 1: Length prefix
│ 5 │ A │ l │ i │ c │ e │ (5 bytes: "Alice")
Approach 2: Null-terminated
│ A │ l │ i │ c │ e │ \0 │ (terminated by null byte)
Approach 3: Offset array Store offsets to each variable-length column in the header:
[header: offset1=10, offset2=15][fixed cols][var1][var2]
Handling NULL Values
NULL values need representation without wasting space:
Null bitmap: A bit array in the row header. Bit N = 1 means column N is NULL.
Row header: null_bitmap = 0b00100000 (column 3 is NULL)
This is space-efficient—one bit per column regardless of column size.
2.7 Large Objects: TOAST and Overflow Pages
What if a value is too large to fit in a page? A single TEXT column might contain megabytes of data.
PostgreSQL TOAST (The Oversized-Attribute Storage Technique)
PostgreSQL automatically handles large values:
- Compression: Try to compress the value
- Out-of-line storage: If still too large, store in a separate TOAST table
- Chunking: Split very large values across multiple TOAST pages
TOAST ARCHITECTURE
Main Table Page TOAST Table
┌─────────────────┐ ┌─────────────────┐
│ Row: id=1, │ │ Chunk 1 of 3 │
│ name='Alice', │ ──────────► │ (32KB of data) │
│ bio=TOAST_PTR │ ├─────────────────┤
└─────────────────┘ │ Chunk 2 of 3 │
│ (32KB of data) │
TOAST_PTR contains: ├─────────────────┤
- TOAST table OID │ Chunk 3 of 3 │
- Chunk ID │ (15KB of data) │
- Original size └─────────────────┘
The main row stays small (just a pointer), so scans that don’t need the large column remain fast.
Other Databases
- MySQL InnoDB: Stores large values in overflow pages, keeps a prefix in-line
- Oracle: LOB (Large Object) storage with separate segments
- SQL Server: Stores large values in separate allocation units
2.8 File Organization Strategies
Beyond individual pages, how are entire files organized?
Single File vs Multiple Files
Single file (SQLite):
- Simple to manage
- Database = one file
- Limits: harder to grow, can’t easily use multiple disks
Multiple files (PostgreSQL, MySQL):
- Separate files for data, indexes, WAL
- Each table might have its own file
- Can spread across disks, easier growth
PostgreSQL File Layout
$PGDATA/
├── base/ # Database files
│ ├── 1/ # Database OID 1 (template1)
│ ├── 16384/ # Your database
│ │ ├── 16385 # Table file (OID 16385)
│ │ ├── 16385_fsm # Free space map
│ │ ├── 16385_vm # Visibility map
│ │ └── 16386 # Index file
├── global/ # Shared tables
├── pg_wal/ # Write-ahead log
│ ├── 000000010000000000000001
│ └── 000000010000000000000002
├── pg_xact/ # Transaction status
└── postgresql.conf # Configuration
Segment Files
Large tables are split into segments (typically 1GB each):
16385 # First 1GB
16385.1 # Second 1GB
16385.2 # Third 1GB
...
This avoids filesystem limitations on maximum file size and allows parallel I/O.
2.9 Pluggable Storage Engines
Some databases support multiple storage engines:
MySQL Storage Engines
-- Create table with specific engine
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100)
) ENGINE=InnoDB;
-- Check table's engine
SHOW TABLE STATUS LIKE 'users';
InnoDB (default):
- ACID compliant
- Row-level locking
- MVCC for concurrency
- Clustered indexes (data sorted by primary key)
- Crash recovery via redo logs
MyISAM (legacy):
- Table-level locking only
- No transactions
- Faster for read-heavy, single-user workloads
- Full-text search (before InnoDB had it)
Memory:
- Data in RAM only
- Lost on restart
- Useful for temporary tables
Embedded Storage Engines
Some applications embed a storage engine directly:
- RocksDB: LSM-tree engine, used by MySQL (MyRocks), CockroachDB
- LevelDB: Google’s LSM-tree implementation
- WiredTiger: B-tree + LSM hybrid, used by MongoDB
- SQLite: Complete embedded database
2.10 Comparing Storage Formats
| Feature | PostgreSQL | MySQL InnoDB | SQLite |
|---|---|---|---|
| Page size | 8KB (configurable) | 16KB (configurable) | 4KB (configurable) |
| Row format | Heap + slots | Clustered B-tree | B-tree |
| Large objects | TOAST | Overflow pages | Overflow pages |
| Files | Multiple | Multiple | Single |
| Primary key storage | Separate index | Data sorted by PK | Separate |
Clustered vs Non-Clustered Organization
Heap + Index (PostgreSQL):
- Data stored in insertion order
- Primary key is just another index
- Index entries point to heap location
- PRO: Updates don’t move data
- CON: Extra lookup through index
Clustered Index (MySQL InnoDB):
- Data stored sorted by primary key
- Primary key lookup is direct
- Secondary indexes point to primary key
- PRO: Fast primary key lookups
- CON: Large primary keys bloat secondary indexes
HEAP + INDEX CLUSTERED INDEX
(PostgreSQL) (MySQL InnoDB)
Index on id Heap Data IS the index
┌─────────┐ ┌─────────┐ ┌─────────────────┐
│ 1 → (0,1)│ │ id=47 │ │ id=1, Alice,... │
│ 2 → (0,2)│ │ id=3 │ │ id=2, Bob,... │
│ 3 → (1,1)│────►│ id=91 │ │ id=3, Carol,... │
│ ... │ │ ... │ │ id=4, Dave,... │
└─────────┘ └─────────┘ └─────────────────┘
│ │
│ Two lookups │ One lookup
└─────────────────────────────────────┘
2.11 Checksums and Data Integrity
Disks can silently corrupt data. Good storage engines detect this:
Page Checksums
Each page includes a checksum (CRC32 or similar). On read:
- Read page from disk
- Compute checksum of data
- Compare with stored checksum
- If mismatch → corruption detected!
┌────────────────────────────────────────────────────┐
│ Page Header │
│ Checksum: 0xA1B2C3D4 ◄── Computed from content │
│ ... │
├────────────────────────────────────────────────────┤
│ Page Content │
│ ... data ... │
└────────────────────────────────────────────────────┘
PostgreSQL Data Checksums
# Enable checksums when creating cluster
initdb --data-checksums -D /path/to/data
# Check if enabled
pg_controldata /path/to/data | grep checksum
Torn Page Detection
What if a crash happens mid-write, leaving a page partially written?
Full page writes: After a checkpoint, write entire page to WAL before first modification. Allows recovery of torn pages.
Double-write buffer (MySQL InnoDB): Write pages to a special buffer first, then to final location. If crash during final write, recover from double-write buffer.
2.12 Summary
Storage engines are the foundation of database systems:
- Data is stored in fixed-size pages for efficient I/O
- Heap files store rows in no particular order
- Slotted pages allow variable-size rows with stable addressing
- Row formats handle fixed and variable-length columns, NULLs
- TOAST and overflow pages handle large values
- Clustered indexes store data sorted by primary key
- Checksums detect corruption
The choice of storage engine—and understanding its characteristics—is crucial for database performance.
What’s Next
In Chapter 3, we’ll zoom in on the physical layer: disk I/O, page management, and why understanding your storage hardware is crucial for database performance.
“Storage is a resource. Space is a cost. I/O is a bottleneck. Design accordingly.”