Chapter 7: Write-Ahead Logging (WAL)

“First, log it. Then, do it. Always in that order.”

What happens when your database server crashes mid-transaction? Without careful engineering, you could lose committed data or end up with corrupted state. Write-Ahead Logging (WAL) is the fundamental technique that makes databases durable and recoverable.

7.1 The Durability Problem

Consider this scenario:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Without WAL, the database might:

Modify Account 1’s page in memory
CRASH before modifying Account 2
After restart: Account 1 lost $100, Account 2 gained nothing!

Money has vanished. The transaction was supposed to be atomic.

                    THE DURABILITY PROBLEM

    Transaction commits
           │
           ▼
    ┌──────────────────────────────────────────────────────┐
    │ Buffer Pool (Memory)                                  │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Changes are here   │
    │ │ $900      │  │ $1100     │                        │
    │ │ (dirty)   │  │ (dirty)   │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘
                    │
                    │ Not yet written to disk!
                    ▼
              [POWER FAILURE]
                    │
                    ▼
    ┌──────────────────────────────────────────────────────┐
    │ Disk (Persistent Storage)                             │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Old values!        │
    │ │ $1000     │  │ $1000     │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘

    Data loss!

7.2 The WAL Protocol

Write-Ahead Logging solves this with a simple rule:

Before any change is written to a data page on disk, the log record describing that change must first be written to the WAL on disk.

                    THE WAL PROTOCOL

    1. FIRST: Write to WAL (log the intent)
    ┌──────────────────────────────────────────────────────┐
    │ WAL (on disk, append-only)                           │
    │ [LSN 1001: UPDATE accounts SET balance=900 WHERE id=1]│
    │ [LSN 1002: UPDATE accounts SET balance=1100 WHERE id=2]│
    │ [LSN 1003: COMMIT Transaction 42]                    │
    └──────────────────────────────────────────────────────┘
           │
           │ fsync() - Force to disk!
           ▼

    2. THEN: Return success to client
           │
           ▼

    3. LATER: Write data pages to disk (background)
    ┌──────────────────────────────────────────────────────┐
    │ Data Files (on disk)                                  │
    │ Account 1: $900                                       │
    │ Account 2: $1100                                      │
    └──────────────────────────────────────────────────────┘

Key insight: WAL writes are sequential and append-only—fast even on spinning disks. Data page writes can be deferred and batched.

7.3 WAL Record Structure

Each WAL record contains enough information to redo (or undo) a change:

                    WAL RECORD FORMAT

    ┌─────────────────────────────────────────────────────────────┐
    │ LSN (Log Sequence Number): Unique identifier for this record│
    ├─────────────────────────────────────────────────────────────┤
    │ Transaction ID: Which transaction made this change          │
    ├─────────────────────────────────────────────────────────────┤
    │ Previous LSN: For chaining records in same transaction      │
    ├─────────────────────────────────────────────────────────────┤
    │ Page ID: Which page is affected                             │
    ├─────────────────────────────────────────────────────────────┤
    │ Record Type: INSERT, UPDATE, DELETE, COMMIT, ABORT, etc.   │
    ├─────────────────────────────────────────────────────────────┤
    │ Before Image: Original data (for UNDO)                      │
    ├─────────────────────────────────────────────────────────────┤
    │ After Image: New data (for REDO)                            │
    ├─────────────────────────────────────────────────────────────┤
    │ Checksum: Data integrity verification                       │
    └─────────────────────────────────────────────────────────────┘

LSN (Log Sequence Number)

Every WAL record has a unique, monotonically increasing LSN:

    LSN 1001: BEGIN Transaction 42
    LSN 1002: UPDATE on page 7
    LSN 1003: UPDATE on page 12
    LSN 1004: COMMIT Transaction 42
    LSN 1005: BEGIN Transaction 43
    ...

Pages track which LSN they were last modified at:

    Page 7 header: page_lsn = 1002
    Page 12 header: page_lsn = 1003

This enables recovery to know which changes have been applied.

7.4 WAL and Checkpoints

Writing every change to WAL is great for durability, but:

WAL files grow forever
Recovery would replay from the beginning of time

Checkpoints solve this:

                    CHECKPOINT PROCESS

    Before checkpoint:
    ┌─────────────────────────────────────────────────────────────┐
    │ Buffer Pool: dirty pages scattered across time              │
    │ [Page 3: LSN 950] [Page 7: LSN 1002] [Page 12: LSN 1003]   │
    └─────────────────────────────────────────────────────────────┘

    Checkpoint (LSN 1100):
    1. Write all dirty pages to disk
    2. Write checkpoint record to WAL
    3. Record: "All changes up to LSN 1100 are on disk"

    After checkpoint:
    - WAL records before LSN 1100 can be deleted
    - Recovery only needs to replay from LSN 1100

    ┌─────────────────────────────────────────────────────────────┐
    │ WAL:                                                         │
    │ [OLD - can delete] │ CHECKPOINT │ [NEEDED for recovery]      │
    │  LSN < 1100        │  LSN 1100  │  LSN > 1100                │
    └─────────────────────────────────────────────────────────────┘

Checkpoint Trade-offs

Frequent checkpoints:

Faster recovery (less WAL to replay)
More I/O during normal operation
Can cause latency spikes

Infrequent checkpoints:

Longer recovery time
Less normal I/O
Risk of running out of WAL space

-- PostgreSQL checkpoint configuration
checkpoint_timeout = 5min       -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread writes over 90% of interval
max_wal_size = 1GB             -- Trigger checkpoint if WAL reaches this

7.5 ARIES: The Industry Standard

Most modern databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in the early 1990s. ARIES provides three key guarantees:

1. Write-Ahead Logging

Log before data. We’ve covered this.

2. Repeating History During Redo

On recovery, replay ALL changes (even for aborted transactions) to restore the exact pre-crash state, then undo incomplete transactions.

3. Logging Changes During Undo

When rolling back a transaction, log the undo operations too. This ensures idempotent recovery.

                    ARIES RECOVERY PHASES

    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 1: ANALYSIS                                          │
    │   - Scan WAL from last checkpoint                          │
    │   - Build dirty page table (pages that might need redo)    │
    │   - Build active transaction table (incomplete transactions)│
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 2: REDO                                              │
    │   - Scan forward from oldest dirty page LSN                │
    │   - Reapply all changes (committed or not)                 │
    │   - Restore database to exact pre-crash state              │
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 3: UNDO                                              │
    │   - Roll back uncommitted transactions                     │
    │   - Process in reverse LSN order                           │
    │   - Log compensation records (CLRs)                        │
    └───────────────────────────────────────────────────────────┘

Compensation Log Records (CLRs)

When undoing a change, we log the undo:

    Original: LSN 1002 - UPDATE balance from 1000 to 900

    Undo: LSN 1050 - CLR: UPDATE balance from 900 to 1000
                    (undoes LSN 1002)

    If crash during undo:
    - Redo phase reapplies CLR 1050
    - Undo phase sees CLR, skips LSN 1002
    - No infinite undo loop!

7.6 Physical vs Logical Logging

Physical Logging

Log exact bytes changed:

    Page 7, offset 1024, old bytes: [00 00 03 E8], new bytes: [00 00 03 84]

Pros: Simple, exact replay Cons: Large logs, tied to physical layout

Logical Logging

Log the operation:

    UPDATE accounts SET balance = 900 WHERE id = 1

Pros: Smaller logs, replication-friendly Cons: Must handle schema changes, non-determinism

Physiological Logging (Most Common)

Hybrid: Physical page reference, logical within page:

    Page 7: UPDATE row in slot 3, set column 2 to 900

This is what PostgreSQL and most databases use:

Page-level addressing (physical)
Row-level operations (logical)

7.7 Full Page Writes

What if a crash happens mid-page-write, leaving a torn page?

    8KB page write in progress:
    ┌────────────────────────────────────────────────────┐
    │ First 4KB written (new data)                       │
    │────────────────────────────────────────────────────│
    │ Last 4KB NOT written (old data)                   │ ← Crash here!
    └────────────────────────────────────────────────────┘

    Page is now corrupted: Half old, half new

Solution: Full Page Writes (FPW)

After each checkpoint, the first modification to any page writes the entire page to WAL:

    Checkpoint at LSN 1000

    First modification to Page 7 after checkpoint:
    WAL record LSN 1050:
    ┌─────────────────────────────────────────────────────────────┐
    │ Type: FULL PAGE IMAGE                                        │
    │ Page: 7                                                       │
    │ Content: [entire 8KB page content]                           │
    │ Plus: The actual change                                       │
    └─────────────────────────────────────────────────────────────┘

    Subsequent modifications to Page 7 before next checkpoint:
    Only log the delta (normal small records)

On recovery, if page is torn, restore from full page image in WAL.

Trade-off: Full page writes increase WAL volume significantly but ensure recoverability.

7.8 WAL Archiving and PITR

WAL enables Point-in-Time Recovery (PITR):

                    POINT-IN-TIME RECOVERY

    ┌─────────────────────────────────────────────────────────────┐
    │ Full Backup (Sunday midnight)                                │
    │ [Snapshot of all data files]                                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Apply WAL...
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ WAL Segments (Sunday → Today)                                │
    │ [Mon] → [Tue] → [Wed] → [Thu] → [Fri] → [Sat]               │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Stop at specific point
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ "Restore database as of Thursday 3:47:23 PM"                 │
    └─────────────────────────────────────────────────────────────┘

Setting Up WAL Archiving (PostgreSQL)

# postgresql.conf
wal_level = replica          # or 'logical'
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

Performing PITR

# 1. Restore base backup
pg_restore -d mydb /backup/base_backup

# 2. Configure recovery
# recovery.conf or postgresql.conf (v12+)
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 15:47:23'

# 3. Start server - it will replay WAL to target time

7.9 WAL and Replication

WAL is the foundation of database replication:

                    STREAMING REPLICATION

    Primary Server                      Standby Server
    ┌───────────────────┐              ┌───────────────────┐
    │ Transactions      │              │                   │
    │      │            │              │                   │
    │      ▼            │              │                   │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │   WAL   │───────│─ Stream ────│►│   WAL   │       │
    │ └─────────┘       │   WAL        │ └─────────┘       │
    │      │            │              │      │            │
    │      ▼            │              │      ▼            │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │  Data   │       │              │ │  Data   │       │
    │ └─────────┘       │              │ └─────────┘       │
    └───────────────────┘              └───────────────────┘

    Standby applies WAL records to stay in sync

Synchronous vs Asynchronous Replication

Asynchronous (default):

Primary doesn’t wait for standby
Possible data loss on failover
Better performance

Synchronous:

Primary waits for standby acknowledgment
No data loss on failover
Higher latency

-- PostgreSQL synchronous replication
-- postgresql.conf on primary:
synchronous_standby_names = 'standby1'

-- Commit waits for standby confirmation
synchronous_commit = remote_apply  -- or: on, remote_write

7.10 WAL Performance Considerations

WAL Write Latency

Every commit requires:

Write WAL record(s) to buffer
fsync() WAL to disk
Return to client

The fsync() is the bottleneck:

    HDD: fsync ≈ 5-10ms  → ~100-200 commits/second
    SSD: fsync ≈ 0.1-1ms → ~1,000-10,000 commits/second
    NVMe: fsync ≈ 20-50μs → ~20,000-50,000 commits/second

Group Commit

Batch multiple transactions’ fsync into one:

    Without group commit:
    T1: write → fsync (5ms)
    T2: write → fsync (5ms)
    T3: write → fsync (5ms)
    Total: 15ms for 3 transactions

    With group commit:
    T1: write → wait
    T2: write → wait
    T3: write → wait
    All: single fsync (5ms)
    Total: 5ms for 3 transactions

-- PostgreSQL group commit tuning
commit_delay = 10        -- Wait 10μs for more commits to batch
commit_siblings = 5      -- Only delay if 5+ active transactions

WAL Compression

Compress WAL to reduce I/O:

-- PostgreSQL 15+
wal_compression = lz4    -- or: pglz, zstd, on (pglz)

Reduces WAL size by ~50% with minimal CPU overhead.

7.11 WAL Segment Management

WAL is stored in fixed-size segment files:

    PostgreSQL WAL directory:
    pg_wal/
    ├── 000000010000000000000001   (16MB segment)
    ├── 000000010000000000000002   (16MB segment)
    ├── 000000010000000000000003   (16MB segment)
    └── ...

    Filename encodes: timeline + segment number

WAL Retention

Old segments are recycled or removed based on:

wal_keep_size: Minimum WAL to retain
Replication slots: Keep WAL needed by standbys
Archive status: Keep until archived

-- Check WAL usage
SELECT * FROM pg_stat_wal;

-- Check replication slot WAL retention
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots;

WAL Bloat Issues

WAL can grow unexpectedly due to:

Inactive replication slots
Slow archive commands
Long-running transactions

-- Find inactive slots consuming WAL
SELECT slot_name, active, pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) as retained
FROM pg_replication_slots;

-- Drop problematic slot
SELECT pg_drop_replication_slot('unused_slot');

7.12 Double-Write Buffer (InnoDB)

MySQL InnoDB uses a different approach to torn pages:

                    INNODB DOUBLE-WRITE BUFFER

    Dirty Page in Buffer Pool
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. Write to Double-Write Buffer (sequential, 2MB area)       │
    │    [Page1][Page2][Page3]...                                  │
    │    fsync()                                                    │
    └─────────────────────────────────────────────────────────────┘
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. Write to actual page locations (random I/O)               │
    └─────────────────────────────────────────────────────────────┘

    On crash:
    - If torn page detected, restore from double-write buffer
    - Then apply redo log

Trade-off: Doubles write I/O for data pages, but provides torn page protection without full page writes in the redo log.

7.13 Summary

Write-Ahead Logging is fundamental to database durability:

WAL rule: Log before data, always
LSN: Unique identifier for log ordering
Checkpoints: Bound recovery time, enable WAL recycling
ARIES: Industry-standard recovery algorithm
Full page writes: Handle torn pages
PITR: Restore to any point in time
Replication: Stream WAL to standbys
Group commit: Batch fsyncs for performance

Understanding WAL helps you configure durability guarantees and diagnose recovery issues.

What’s Next

In Chapter 8, we’ll explore MVCC (Multi-Version Concurrency Control) and transaction isolation—how databases let multiple users read and write simultaneously without stepping on each other.

“The WAL is your safety net. Everything else is just performance.”

Keyboard shortcuts

Database Internals: Where Your Data Actually Lives