Chapter 2: Storage Engines and File Formats

“The way you store data determines everything about how you can access it.”

A storage engine is the component of a database that handles how data is stored on disk and retrieved. It’s the foundation upon which everything else is built. In this chapter, we’ll explore how storage engines organize data, the file formats they use, and the trade-offs they make.

2.1 What Is a Storage Engine?

A storage engine is responsible for:

Writing data to persistent storage (disk, SSD)
Reading data back when requested
Managing space as data grows and shrinks
Maintaining indexes for fast lookups
Ensuring durability through logging and recovery
Handling concurrency when multiple operations occur simultaneously

Different storage engines make different trade-offs. Some optimize for read-heavy workloads, others for writes. Some prioritize simplicity, others flexibility. Understanding these trade-offs helps you choose the right database for your needs.

2.2 The Fundamental Trade-off: Read vs Write Optimization

Every storage engine faces a fundamental trade-off:

Optimize for reads → Keep data sorted and organized, making lookups fast but updates expensive (need to maintain order)

Optimize for writes → Append new data quickly, defer organization for later (reads may be slower)

              THE READ-WRITE SPECTRUM

  ← Read Optimized              Write Optimized →

  ┌─────────────────────────────────────────────┐
  │                                             │
  │   B-tree          Hybrid          LSM-tree  │
  │   (PostgreSQL,    (some           (RocksDB, │
  │    MySQL InnoDB)   configurations) Cassandra)│
  │                                             │
  └─────────────────────────────────────────────┘

  - Fast reads         Mixed          - Fast writes
  - Slower writes      workloads      - Slower reads
  - More I/O on write                 - Background work
  - Data always sorted                - Eventual sorting

We’ll explore both approaches in detail in later chapters. For now, understand that this trade-off shapes every design decision.

2.3 Data Files and Page Structure

Most databases store data in files, organized as a sequence of fixed-size pages (also called blocks). Typical page sizes are 4KB, 8KB, or 16KB.

Why Fixed-Size Pages?

Alignment with disk I/O: Disks read in fixed-size blocks (usually 512B or 4KB). Matching page size to disk block size avoids partial reads.
Simpler memory management: The buffer pool can allocate a fixed number of equal-sized slots.
Predictable addressing: Page N is at byte offset N * page_size in the file. No need for an index of page locations.

                    DATA FILE LAYOUT

    File offset:  0      8KB    16KB    24KB    32KB
                  │       │       │       │       │
                  ▼       ▼       ▼       ▼       ▼
    ┌─────────┬─────────┬─────────┬─────────┬─────────┐
    │ Page 0  │ Page 1  │ Page 2  │ Page 3  │ Page 4  │ ...
    │ (header)│ (data)  │ (data)  │ (data)  │ (index) │
    └─────────┴─────────┴─────────┴─────────┴─────────┘

    To read Page 3: seek to 24KB, read 8KB

Page Types

Different pages serve different purposes:

Data pages: Store actual table rows
Index pages: Store index entries (B-tree nodes)
Free space map pages: Track which pages have room for new data
Metadata pages: Store table definitions, statistics
Overflow pages: Handle data too large for one page

2.4 Heap Files: The Simplest Organization

The simplest way to store table data is as a heap file—pages containing rows in no particular order. New rows are inserted wherever there’s space.

                    HEAP FILE ORGANIZATION

    ┌──────────────────────────────────────────────────┐
    │ Page 0                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=47, name='Zoe', email='...'          │ │
    │ │ Row: id=3,  name='Carol', email='...'        │ │
    │ │ Row: id=91, name='Alice', email='...'        │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘
    ┌──────────────────────────────────────────────────┐
    │ Page 1                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=12, name='Bob', email='...'          │ │
    │ │ Row: id=7,  name='Eve', email='...'          │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘

    Note: Rows are NOT sorted by id!

Advantages:

Simple to implement
Fast inserts (just find space and write)
No overhead of maintaining order

Disadvantages:

Finding a specific row requires scanning all pages (unless you have an index)
Deleted rows leave holes (fragmentation)

Who uses heap files?

PostgreSQL stores table data in heap files (indexes provide fast access)
MySQL InnoDB uses clustered indexes instead (data sorted by primary key)

2.5 Row Format: How Rows Are Stored in Pages

Within a page, how are individual rows laid out? Most databases use one of two approaches:

Slotted Pages

The most common format. Each page has:

A header with metadata
A slot array pointing to rows
Rows stored from the end of the page backward
Free space in the middle

                    SLOTTED PAGE FORMAT

    ┌────────────────────────────────────────────────────┐
    │ PAGE HEADER                                        │
    │   Page ID: 42                                      │
    │   Slot count: 3                                    │
    │   Free space start: 2048                           │
    │   Free space end: 5120                             │
    ├────────────────────────────────────────────────────┤
    │ SLOT ARRAY                                         │
    │   Slot 0: offset=7680, length=256                  │
    │   Slot 1: offset=7168, length=512                  │
    │   Slot 2: offset=5120, length=2048                 │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │           [FREE SPACE]                             │
    │                                                    │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │   Row 2 (at offset 5120, 2048 bytes)              │
    │   ┌────────────────────────────────────────────┐  │
    │   │ id=91 │ name='Alice' │ bio='Long text...' │  │
    │   └────────────────────────────────────────────┘  │
    │                                                    │
    │   Row 1 (at offset 7168, 512 bytes)               │
    │   ┌───────────────────────────────────┐           │
    │   │ id=3 │ name='Carol' │ bio='...'   │           │
    │   └───────────────────────────────────┘           │
    │                                                    │
    │   Row 0 (at offset 7680, 256 bytes)               │
    │   ┌─────────────────────────────┐                 │
    │   │ id=47 │ name='Zoe' │ bio='' │                 │
    │   └─────────────────────────────┘                 │
    │                                                    │
    └────────────────────────────────────────────────────┘

Why grow rows backward?

Free space stays contiguous. The slot array grows forward, rows grow backward, and they meet in the middle. No fragmentation until the page is full.

Why use a slot array?

Rows can be different sizes
Rows can be moved within the page (for compaction) without changing their external address—just update the slot
Indexes can reference rows as (page_id, slot_number), which remains stable

Identifying Rows: TIDs and ROWIDs

Databases need a way to reference specific rows. Common approaches:

PostgreSQL TID (Tuple ID): (page_number, slot_number) — e.g., (42, 3)
Oracle ROWID: Encodes data file, block, and slot
MySQL InnoDB: Uses the primary key as the row identifier

-- PostgreSQL: You can see the TID of each row
SELECT ctid, * FROM users;

   ctid   | id | name
----------+----+-------
 (0,1)    |  1 | Alice
 (0,2)    |  2 | Bob
 (1,1)    |  3 | Carol

2.6 Row Storage Formats

How is the actual row data encoded? Several formats exist:

N-ary Storage Model (NSM) — Row-Oriented

Store all columns of a row together. This is the traditional approach.

Row format: [header][col1][col2][col3][col4]...

┌────────────────────────────────────────────────────────┐
│ Row 1: │header│ id=1 │ name='Alice' │ email='a@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 2: │header│ id=2 │ name='Bob'   │ email='b@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 3: │header│ id=3 │ name='Carol' │ email='c@b.com' │
└────────────────────────────────────────────────────────┘

Row header typically contains:

Row length
Null bitmap (which columns are NULL)
Transaction visibility info (for MVCC)

Handling Variable-Length Data

Fixed-length columns (INTEGER, CHAR(10)) are easy—always the same size. Variable-length columns (VARCHAR, TEXT) need special handling:

Approach 1: Length prefix

│ 5 │ A │ l │ i │ c │ e │   (5 bytes: "Alice")

Approach 2: Null-terminated

│ A │ l │ i │ c │ e │ \0 │  (terminated by null byte)

Approach 3: Offset array Store offsets to each variable-length column in the header:

[header: offset1=10, offset2=15][fixed cols][var1][var2]

Handling NULL Values

NULL values need representation without wasting space:

Null bitmap: A bit array in the row header. Bit N = 1 means column N is NULL.

Row header: null_bitmap = 0b00100000  (column 3 is NULL)

This is space-efficient—one bit per column regardless of column size.

2.7 Large Objects: TOAST and Overflow Pages

What if a value is too large to fit in a page? A single TEXT column might contain megabytes of data.

PostgreSQL TOAST (The Oversized-Attribute Storage Technique)

PostgreSQL automatically handles large values:

Compression: Try to compress the value
Out-of-line storage: If still too large, store in a separate TOAST table
Chunking: Split very large values across multiple TOAST pages

                    TOAST ARCHITECTURE

    Main Table Page                    TOAST Table
    ┌─────────────────┐               ┌─────────────────┐
    │ Row: id=1,      │               │ Chunk 1 of 3    │
    │   name='Alice', │  ──────────►  │ (32KB of data)  │
    │   bio=TOAST_PTR │               ├─────────────────┤
    └─────────────────┘               │ Chunk 2 of 3    │
                                      │ (32KB of data)  │
    TOAST_PTR contains:               ├─────────────────┤
    - TOAST table OID                 │ Chunk 3 of 3    │
    - Chunk ID                        │ (15KB of data)  │
    - Original size                   └─────────────────┘

The main row stays small (just a pointer), so scans that don’t need the large column remain fast.

Other Databases

MySQL InnoDB: Stores large values in overflow pages, keeps a prefix in-line
Oracle: LOB (Large Object) storage with separate segments
SQL Server: Stores large values in separate allocation units

2.8 File Organization Strategies

Beyond individual pages, how are entire files organized?

Single File vs Multiple Files

Single file (SQLite):

Simple to manage
Database = one file
Limits: harder to grow, can’t easily use multiple disks

Multiple files (PostgreSQL, MySQL):

Separate files for data, indexes, WAL
Each table might have its own file
Can spread across disks, easier growth

PostgreSQL File Layout

$PGDATA/
├── base/                    # Database files
│   ├── 1/                   # Database OID 1 (template1)
│   ├── 16384/              # Your database
│   │   ├── 16385           # Table file (OID 16385)
│   │   ├── 16385_fsm       # Free space map
│   │   ├── 16385_vm        # Visibility map
│   │   └── 16386           # Index file
├── global/                  # Shared tables
├── pg_wal/                  # Write-ahead log
│   ├── 000000010000000000000001
│   └── 000000010000000000000002
├── pg_xact/                 # Transaction status
└── postgresql.conf          # Configuration

Segment Files

Large tables are split into segments (typically 1GB each):

16385        # First 1GB
16385.1      # Second 1GB
16385.2      # Third 1GB
...

This avoids filesystem limitations on maximum file size and allows parallel I/O.

2.9 Pluggable Storage Engines

Some databases support multiple storage engines:

MySQL Storage Engines

-- Create table with specific engine
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100)
) ENGINE=InnoDB;

-- Check table's engine
SHOW TABLE STATUS LIKE 'users';

InnoDB (default):

ACID compliant
Row-level locking
MVCC for concurrency
Clustered indexes (data sorted by primary key)
Crash recovery via redo logs

MyISAM (legacy):

Table-level locking only
No transactions
Faster for read-heavy, single-user workloads
Full-text search (before InnoDB had it)

Memory:

Data in RAM only
Lost on restart
Useful for temporary tables

Embedded Storage Engines

Some applications embed a storage engine directly:

RocksDB: LSM-tree engine, used by MySQL (MyRocks), CockroachDB
LevelDB: Google’s LSM-tree implementation
WiredTiger: B-tree + LSM hybrid, used by MongoDB
SQLite: Complete embedded database

2.10 Comparing Storage Formats

Feature	PostgreSQL	MySQL InnoDB	SQLite
Page size	8KB (configurable)	16KB (configurable)	4KB (configurable)
Row format	Heap + slots	Clustered B-tree	B-tree
Large objects	TOAST	Overflow pages	Overflow pages
Files	Multiple	Multiple	Single
Primary key storage	Separate index	Data sorted by PK	Separate

Clustered vs Non-Clustered Organization

Heap + Index (PostgreSQL):

Data stored in insertion order
Primary key is just another index
Index entries point to heap location
PRO: Updates don’t move data
CON: Extra lookup through index

Clustered Index (MySQL InnoDB):

Data stored sorted by primary key
Primary key lookup is direct
Secondary indexes point to primary key
PRO: Fast primary key lookups
CON: Large primary keys bloat secondary indexes

        HEAP + INDEX                    CLUSTERED INDEX
        (PostgreSQL)                    (MySQL InnoDB)

    Index on id          Heap          Data IS the index
    ┌─────────┐     ┌─────────┐        ┌─────────────────┐
    │ 1 → (0,1)│     │ id=47   │        │ id=1, Alice,... │
    │ 2 → (0,2)│     │ id=3    │        │ id=2, Bob,...   │
    │ 3 → (1,1)│────►│ id=91   │        │ id=3, Carol,... │
    │ ...     │     │ ...     │        │ id=4, Dave,...  │
    └─────────┘     └─────────┘        └─────────────────┘
       │                                     │
       │ Two lookups                         │ One lookup
       └─────────────────────────────────────┘

2.11 Checksums and Data Integrity

Disks can silently corrupt data. Good storage engines detect this:

Page Checksums

Each page includes a checksum (CRC32 or similar). On read:

Read page from disk
Compute checksum of data
Compare with stored checksum
If mismatch → corruption detected!

┌────────────────────────────────────────────────────┐
│ Page Header                                        │
│   Checksum: 0xA1B2C3D4  ◄── Computed from content │
│   ...                                              │
├────────────────────────────────────────────────────┤
│ Page Content                                       │
│   ... data ...                                     │
└────────────────────────────────────────────────────┘

PostgreSQL Data Checksums

# Enable checksums when creating cluster
initdb --data-checksums -D /path/to/data

# Check if enabled
pg_controldata /path/to/data | grep checksum

Torn Page Detection

What if a crash happens mid-write, leaving a page partially written?

Full page writes: After a checkpoint, write entire page to WAL before first modification. Allows recovery of torn pages.

Double-write buffer (MySQL InnoDB): Write pages to a special buffer first, then to final location. If crash during final write, recover from double-write buffer.

2.12 Summary

Storage engines are the foundation of database systems:

Data is stored in fixed-size pages for efficient I/O
Heap files store rows in no particular order
Slotted pages allow variable-size rows with stable addressing
Row formats handle fixed and variable-length columns, NULLs
TOAST and overflow pages handle large values
Clustered indexes store data sorted by primary key
Checksums detect corruption

The choice of storage engine—and understanding its characteristics—is crucial for database performance.

What’s Next

In Chapter 3, we’ll zoom in on the physical layer: disk I/O, page management, and why understanding your storage hardware is crucial for database performance.

“Storage is a resource. Space is a cost. I/O is a bottleneck. Design accordingly.”

Keyboard shortcuts

Database Internals: Where Your Data Actually Lives