Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 2: Storage Engines and File Formats

“The way you store data determines everything about how you can access it.”

A storage engine is the component of a database that handles how data is stored on disk and retrieved. It’s the foundation upon which everything else is built. In this chapter, we’ll explore how storage engines organize data, the file formats they use, and the trade-offs they make.


2.1 What Is a Storage Engine?

A storage engine is responsible for:

  • Writing data to persistent storage (disk, SSD)
  • Reading data back when requested
  • Managing space as data grows and shrinks
  • Maintaining indexes for fast lookups
  • Ensuring durability through logging and recovery
  • Handling concurrency when multiple operations occur simultaneously

Different storage engines make different trade-offs. Some optimize for read-heavy workloads, others for writes. Some prioritize simplicity, others flexibility. Understanding these trade-offs helps you choose the right database for your needs.


2.2 The Fundamental Trade-off: Read vs Write Optimization

Every storage engine faces a fundamental trade-off:

Optimize for reads → Keep data sorted and organized, making lookups fast but updates expensive (need to maintain order)

Optimize for writes → Append new data quickly, defer organization for later (reads may be slower)

              THE READ-WRITE SPECTRUM

  ← Read Optimized              Write Optimized →

  ┌─────────────────────────────────────────────┐
  │                                             │
  │   B-tree          Hybrid          LSM-tree  │
  │   (PostgreSQL,    (some           (RocksDB, │
  │    MySQL InnoDB)   configurations) Cassandra)│
  │                                             │
  └─────────────────────────────────────────────┘

  - Fast reads         Mixed          - Fast writes
  - Slower writes      workloads      - Slower reads
  - More I/O on write                 - Background work
  - Data always sorted                - Eventual sorting

We’ll explore both approaches in detail in later chapters. For now, understand that this trade-off shapes every design decision.


2.3 Data Files and Page Structure

Most databases store data in files, organized as a sequence of fixed-size pages (also called blocks). Typical page sizes are 4KB, 8KB, or 16KB.

Why Fixed-Size Pages?

  1. Alignment with disk I/O: Disks read in fixed-size blocks (usually 512B or 4KB). Matching page size to disk block size avoids partial reads.

  2. Simpler memory management: The buffer pool can allocate a fixed number of equal-sized slots.

  3. Predictable addressing: Page N is at byte offset N * page_size in the file. No need for an index of page locations.

                    DATA FILE LAYOUT

    File offset:  0      8KB    16KB    24KB    32KB
                  │       │       │       │       │
                  ▼       ▼       ▼       ▼       ▼
    ┌─────────┬─────────┬─────────┬─────────┬─────────┐
    │ Page 0  │ Page 1  │ Page 2  │ Page 3  │ Page 4  │ ...
    │ (header)│ (data)  │ (data)  │ (data)  │ (index) │
    └─────────┴─────────┴─────────┴─────────┴─────────┘

    To read Page 3: seek to 24KB, read 8KB

Page Types

Different pages serve different purposes:

  • Data pages: Store actual table rows
  • Index pages: Store index entries (B-tree nodes)
  • Free space map pages: Track which pages have room for new data
  • Metadata pages: Store table definitions, statistics
  • Overflow pages: Handle data too large for one page

2.4 Heap Files: The Simplest Organization

The simplest way to store table data is as a heap file—pages containing rows in no particular order. New rows are inserted wherever there’s space.

                    HEAP FILE ORGANIZATION

    ┌──────────────────────────────────────────────────┐
    │ Page 0                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=47, name='Zoe', email='...'          │ │
    │ │ Row: id=3,  name='Carol', email='...'        │ │
    │ │ Row: id=91, name='Alice', email='...'        │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘
    ┌──────────────────────────────────────────────────┐
    │ Page 1                                           │
    │ ┌──────────────────────────────────────────────┐ │
    │ │ Row: id=12, name='Bob', email='...'          │ │
    │ │ Row: id=7,  name='Eve', email='...'          │ │
    │ │ [free space]                                 │ │
    │ └──────────────────────────────────────────────┘ │
    └──────────────────────────────────────────────────┘

    Note: Rows are NOT sorted by id!

Advantages:

  • Simple to implement
  • Fast inserts (just find space and write)
  • No overhead of maintaining order

Disadvantages:

  • Finding a specific row requires scanning all pages (unless you have an index)
  • Deleted rows leave holes (fragmentation)

Who uses heap files?

  • PostgreSQL stores table data in heap files (indexes provide fast access)
  • MySQL InnoDB uses clustered indexes instead (data sorted by primary key)

2.5 Row Format: How Rows Are Stored in Pages

Within a page, how are individual rows laid out? Most databases use one of two approaches:

Slotted Pages

The most common format. Each page has:

  • A header with metadata
  • A slot array pointing to rows
  • Rows stored from the end of the page backward
  • Free space in the middle
                    SLOTTED PAGE FORMAT

    ┌────────────────────────────────────────────────────┐
    │ PAGE HEADER                                        │
    │   Page ID: 42                                      │
    │   Slot count: 3                                    │
    │   Free space start: 2048                           │
    │   Free space end: 5120                             │
    ├────────────────────────────────────────────────────┤
    │ SLOT ARRAY                                         │
    │   Slot 0: offset=7680, length=256                  │
    │   Slot 1: offset=7168, length=512                  │
    │   Slot 2: offset=5120, length=2048                 │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │           [FREE SPACE]                             │
    │                                                    │
    ├────────────────────────────────────────────────────┤
    │                                                    │
    │   Row 2 (at offset 5120, 2048 bytes)              │
    │   ┌────────────────────────────────────────────┐  │
    │   │ id=91 │ name='Alice' │ bio='Long text...' │  │
    │   └────────────────────────────────────────────┘  │
    │                                                    │
    │   Row 1 (at offset 7168, 512 bytes)               │
    │   ┌───────────────────────────────────┐           │
    │   │ id=3 │ name='Carol' │ bio='...'   │           │
    │   └───────────────────────────────────┘           │
    │                                                    │
    │   Row 0 (at offset 7680, 256 bytes)               │
    │   ┌─────────────────────────────┐                 │
    │   │ id=47 │ name='Zoe' │ bio='' │                 │
    │   └─────────────────────────────┘                 │
    │                                                    │
    └────────────────────────────────────────────────────┘

Why grow rows backward?

Free space stays contiguous. The slot array grows forward, rows grow backward, and they meet in the middle. No fragmentation until the page is full.

Why use a slot array?

  • Rows can be different sizes
  • Rows can be moved within the page (for compaction) without changing their external address—just update the slot
  • Indexes can reference rows as (page_id, slot_number), which remains stable

Identifying Rows: TIDs and ROWIDs

Databases need a way to reference specific rows. Common approaches:

  • PostgreSQL TID (Tuple ID): (page_number, slot_number) — e.g., (42, 3)
  • Oracle ROWID: Encodes data file, block, and slot
  • MySQL InnoDB: Uses the primary key as the row identifier
-- PostgreSQL: You can see the TID of each row
SELECT ctid, * FROM users;

   ctid   | id | name
----------+----+-------
 (0,1)    |  1 | Alice
 (0,2)    |  2 | Bob
 (1,1)    |  3 | Carol

2.6 Row Storage Formats

How is the actual row data encoded? Several formats exist:

N-ary Storage Model (NSM) — Row-Oriented

Store all columns of a row together. This is the traditional approach.

Row format: [header][col1][col2][col3][col4]...

┌────────────────────────────────────────────────────────┐
│ Row 1: │header│ id=1 │ name='Alice' │ email='a@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 2: │header│ id=2 │ name='Bob'   │ email='b@b.com' │
├────────────────────────────────────────────────────────┤
│ Row 3: │header│ id=3 │ name='Carol' │ email='c@b.com' │
└────────────────────────────────────────────────────────┘

Row header typically contains:

  • Row length
  • Null bitmap (which columns are NULL)
  • Transaction visibility info (for MVCC)

Handling Variable-Length Data

Fixed-length columns (INTEGER, CHAR(10)) are easy—always the same size. Variable-length columns (VARCHAR, TEXT) need special handling:

Approach 1: Length prefix

│ 5 │ A │ l │ i │ c │ e │   (5 bytes: "Alice")

Approach 2: Null-terminated

│ A │ l │ i │ c │ e │ \0 │  (terminated by null byte)

Approach 3: Offset array Store offsets to each variable-length column in the header:

[header: offset1=10, offset2=15][fixed cols][var1][var2]

Handling NULL Values

NULL values need representation without wasting space:

Null bitmap: A bit array in the row header. Bit N = 1 means column N is NULL.

Row header: null_bitmap = 0b00100000  (column 3 is NULL)

This is space-efficient—one bit per column regardless of column size.


2.7 Large Objects: TOAST and Overflow Pages

What if a value is too large to fit in a page? A single TEXT column might contain megabytes of data.

PostgreSQL TOAST (The Oversized-Attribute Storage Technique)

PostgreSQL automatically handles large values:

  1. Compression: Try to compress the value
  2. Out-of-line storage: If still too large, store in a separate TOAST table
  3. Chunking: Split very large values across multiple TOAST pages
                    TOAST ARCHITECTURE

    Main Table Page                    TOAST Table
    ┌─────────────────┐               ┌─────────────────┐
    │ Row: id=1,      │               │ Chunk 1 of 3    │
    │   name='Alice', │  ──────────►  │ (32KB of data)  │
    │   bio=TOAST_PTR │               ├─────────────────┤
    └─────────────────┘               │ Chunk 2 of 3    │
                                      │ (32KB of data)  │
    TOAST_PTR contains:               ├─────────────────┤
    - TOAST table OID                 │ Chunk 3 of 3    │
    - Chunk ID                        │ (15KB of data)  │
    - Original size                   └─────────────────┘

The main row stays small (just a pointer), so scans that don’t need the large column remain fast.

Other Databases

  • MySQL InnoDB: Stores large values in overflow pages, keeps a prefix in-line
  • Oracle: LOB (Large Object) storage with separate segments
  • SQL Server: Stores large values in separate allocation units

2.8 File Organization Strategies

Beyond individual pages, how are entire files organized?

Single File vs Multiple Files

Single file (SQLite):

  • Simple to manage
  • Database = one file
  • Limits: harder to grow, can’t easily use multiple disks

Multiple files (PostgreSQL, MySQL):

  • Separate files for data, indexes, WAL
  • Each table might have its own file
  • Can spread across disks, easier growth

PostgreSQL File Layout

$PGDATA/
├── base/                    # Database files
│   ├── 1/                   # Database OID 1 (template1)
│   ├── 16384/              # Your database
│   │   ├── 16385           # Table file (OID 16385)
│   │   ├── 16385_fsm       # Free space map
│   │   ├── 16385_vm        # Visibility map
│   │   └── 16386           # Index file
├── global/                  # Shared tables
├── pg_wal/                  # Write-ahead log
│   ├── 000000010000000000000001
│   └── 000000010000000000000002
├── pg_xact/                 # Transaction status
└── postgresql.conf          # Configuration

Segment Files

Large tables are split into segments (typically 1GB each):

16385        # First 1GB
16385.1      # Second 1GB
16385.2      # Third 1GB
...

This avoids filesystem limitations on maximum file size and allows parallel I/O.


2.9 Pluggable Storage Engines

Some databases support multiple storage engines:

MySQL Storage Engines

-- Create table with specific engine
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100)
) ENGINE=InnoDB;

-- Check table's engine
SHOW TABLE STATUS LIKE 'users';

InnoDB (default):

  • ACID compliant
  • Row-level locking
  • MVCC for concurrency
  • Clustered indexes (data sorted by primary key)
  • Crash recovery via redo logs

MyISAM (legacy):

  • Table-level locking only
  • No transactions
  • Faster for read-heavy, single-user workloads
  • Full-text search (before InnoDB had it)

Memory:

  • Data in RAM only
  • Lost on restart
  • Useful for temporary tables

Embedded Storage Engines

Some applications embed a storage engine directly:

  • RocksDB: LSM-tree engine, used by MySQL (MyRocks), CockroachDB
  • LevelDB: Google’s LSM-tree implementation
  • WiredTiger: B-tree + LSM hybrid, used by MongoDB
  • SQLite: Complete embedded database

2.10 Comparing Storage Formats

FeaturePostgreSQLMySQL InnoDBSQLite
Page size8KB (configurable)16KB (configurable)4KB (configurable)
Row formatHeap + slotsClustered B-treeB-tree
Large objectsTOASTOverflow pagesOverflow pages
FilesMultipleMultipleSingle
Primary key storageSeparate indexData sorted by PKSeparate

Clustered vs Non-Clustered Organization

Heap + Index (PostgreSQL):

  • Data stored in insertion order
  • Primary key is just another index
  • Index entries point to heap location
  • PRO: Updates don’t move data
  • CON: Extra lookup through index

Clustered Index (MySQL InnoDB):

  • Data stored sorted by primary key
  • Primary key lookup is direct
  • Secondary indexes point to primary key
  • PRO: Fast primary key lookups
  • CON: Large primary keys bloat secondary indexes
        HEAP + INDEX                    CLUSTERED INDEX
        (PostgreSQL)                    (MySQL InnoDB)

    Index on id          Heap          Data IS the index
    ┌─────────┐     ┌─────────┐        ┌─────────────────┐
    │ 1 → (0,1)│     │ id=47   │        │ id=1, Alice,... │
    │ 2 → (0,2)│     │ id=3    │        │ id=2, Bob,...   │
    │ 3 → (1,1)│────►│ id=91   │        │ id=3, Carol,... │
    │ ...     │     │ ...     │        │ id=4, Dave,...  │
    └─────────┘     └─────────┘        └─────────────────┘
       │                                     │
       │ Two lookups                         │ One lookup
       └─────────────────────────────────────┘

2.11 Checksums and Data Integrity

Disks can silently corrupt data. Good storage engines detect this:

Page Checksums

Each page includes a checksum (CRC32 or similar). On read:

  1. Read page from disk
  2. Compute checksum of data
  3. Compare with stored checksum
  4. If mismatch → corruption detected!
┌────────────────────────────────────────────────────┐
│ Page Header                                        │
│   Checksum: 0xA1B2C3D4  ◄── Computed from content │
│   ...                                              │
├────────────────────────────────────────────────────┤
│ Page Content                                       │
│   ... data ...                                     │
└────────────────────────────────────────────────────┘

PostgreSQL Data Checksums

# Enable checksums when creating cluster
initdb --data-checksums -D /path/to/data

# Check if enabled
pg_controldata /path/to/data | grep checksum

Torn Page Detection

What if a crash happens mid-write, leaving a page partially written?

Full page writes: After a checkpoint, write entire page to WAL before first modification. Allows recovery of torn pages.

Double-write buffer (MySQL InnoDB): Write pages to a special buffer first, then to final location. If crash during final write, recover from double-write buffer.


2.12 Summary

Storage engines are the foundation of database systems:

  • Data is stored in fixed-size pages for efficient I/O
  • Heap files store rows in no particular order
  • Slotted pages allow variable-size rows with stable addressing
  • Row formats handle fixed and variable-length columns, NULLs
  • TOAST and overflow pages handle large values
  • Clustered indexes store data sorted by primary key
  • Checksums detect corruption

The choice of storage engine—and understanding its characteristics—is crucial for database performance.


What’s Next

In Chapter 3, we’ll zoom in on the physical layer: disk I/O, page management, and why understanding your storage hardware is crucial for database performance.


“Storage is a resource. Space is a cost. I/O is a bottleneck. Design accordingly.”