Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 7: Write-Ahead Logging (WAL)

“First, log it. Then, do it. Always in that order.”

What happens when your database server crashes mid-transaction? Without careful engineering, you could lose committed data or end up with corrupted state. Write-Ahead Logging (WAL) is the fundamental technique that makes databases durable and recoverable.


7.1 The Durability Problem

Consider this scenario:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Without WAL, the database might:

  1. Modify Account 1’s page in memory
  2. CRASH before modifying Account 2
  3. After restart: Account 1 lost $100, Account 2 gained nothing!

Money has vanished. The transaction was supposed to be atomic.

                    THE DURABILITY PROBLEM

    Transaction commits
           │
           ▼
    ┌──────────────────────────────────────────────────────┐
    │ Buffer Pool (Memory)                                  │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Changes are here   │
    │ │ $900      │  │ $1100     │                        │
    │ │ (dirty)   │  │ (dirty)   │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘
                    │
                    │ Not yet written to disk!
                    ▼
              [POWER FAILURE]
                    │
                    ▼
    ┌──────────────────────────────────────────────────────┐
    │ Disk (Persistent Storage)                             │
    │ ┌────────────┐  ┌────────────┐                       │
    │ │ Account 1  │  │ Account 2  │  ← Old values!        │
    │ │ $1000     │  │ $1000     │                        │
    │ └────────────┘  └────────────┘                       │
    └──────────────────────────────────────────────────────┘

    Data loss!

7.2 The WAL Protocol

Write-Ahead Logging solves this with a simple rule:

Before any change is written to a data page on disk, the log record describing that change must first be written to the WAL on disk.

                    THE WAL PROTOCOL

    1. FIRST: Write to WAL (log the intent)
    ┌──────────────────────────────────────────────────────┐
    │ WAL (on disk, append-only)                           │
    │ [LSN 1001: UPDATE accounts SET balance=900 WHERE id=1]│
    │ [LSN 1002: UPDATE accounts SET balance=1100 WHERE id=2]│
    │ [LSN 1003: COMMIT Transaction 42]                    │
    └──────────────────────────────────────────────────────┘
           │
           │ fsync() - Force to disk!
           ▼

    2. THEN: Return success to client
           │
           ▼

    3. LATER: Write data pages to disk (background)
    ┌──────────────────────────────────────────────────────┐
    │ Data Files (on disk)                                  │
    │ Account 1: $900                                       │
    │ Account 2: $1100                                      │
    └──────────────────────────────────────────────────────┘

Key insight: WAL writes are sequential and append-only—fast even on spinning disks. Data page writes can be deferred and batched.


7.3 WAL Record Structure

Each WAL record contains enough information to redo (or undo) a change:

                    WAL RECORD FORMAT

    ┌─────────────────────────────────────────────────────────────┐
    │ LSN (Log Sequence Number): Unique identifier for this record│
    ├─────────────────────────────────────────────────────────────┤
    │ Transaction ID: Which transaction made this change          │
    ├─────────────────────────────────────────────────────────────┤
    │ Previous LSN: For chaining records in same transaction      │
    ├─────────────────────────────────────────────────────────────┤
    │ Page ID: Which page is affected                             │
    ├─────────────────────────────────────────────────────────────┤
    │ Record Type: INSERT, UPDATE, DELETE, COMMIT, ABORT, etc.   │
    ├─────────────────────────────────────────────────────────────┤
    │ Before Image: Original data (for UNDO)                      │
    ├─────────────────────────────────────────────────────────────┤
    │ After Image: New data (for REDO)                            │
    ├─────────────────────────────────────────────────────────────┤
    │ Checksum: Data integrity verification                       │
    └─────────────────────────────────────────────────────────────┘

LSN (Log Sequence Number)

Every WAL record has a unique, monotonically increasing LSN:

    LSN 1001: BEGIN Transaction 42
    LSN 1002: UPDATE on page 7
    LSN 1003: UPDATE on page 12
    LSN 1004: COMMIT Transaction 42
    LSN 1005: BEGIN Transaction 43
    ...

Pages track which LSN they were last modified at:

    Page 7 header: page_lsn = 1002
    Page 12 header: page_lsn = 1003

This enables recovery to know which changes have been applied.


7.4 WAL and Checkpoints

Writing every change to WAL is great for durability, but:

  1. WAL files grow forever
  2. Recovery would replay from the beginning of time

Checkpoints solve this:

                    CHECKPOINT PROCESS

    Before checkpoint:
    ┌─────────────────────────────────────────────────────────────┐
    │ Buffer Pool: dirty pages scattered across time              │
    │ [Page 3: LSN 950] [Page 7: LSN 1002] [Page 12: LSN 1003]   │
    └─────────────────────────────────────────────────────────────┘

    Checkpoint (LSN 1100):
    1. Write all dirty pages to disk
    2. Write checkpoint record to WAL
    3. Record: "All changes up to LSN 1100 are on disk"

    After checkpoint:
    - WAL records before LSN 1100 can be deleted
    - Recovery only needs to replay from LSN 1100

    ┌─────────────────────────────────────────────────────────────┐
    │ WAL:                                                         │
    │ [OLD - can delete] │ CHECKPOINT │ [NEEDED for recovery]      │
    │  LSN < 1100        │  LSN 1100  │  LSN > 1100                │
    └─────────────────────────────────────────────────────────────┘

Checkpoint Trade-offs

Frequent checkpoints:

  • Faster recovery (less WAL to replay)
  • More I/O during normal operation
  • Can cause latency spikes

Infrequent checkpoints:

  • Longer recovery time
  • Less normal I/O
  • Risk of running out of WAL space
-- PostgreSQL checkpoint configuration
checkpoint_timeout = 5min       -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread writes over 90% of interval
max_wal_size = 1GB             -- Trigger checkpoint if WAL reaches this

7.5 ARIES: The Industry Standard

Most modern databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in the early 1990s. ARIES provides three key guarantees:

1. Write-Ahead Logging

Log before data. We’ve covered this.

2. Repeating History During Redo

On recovery, replay ALL changes (even for aborted transactions) to restore the exact pre-crash state, then undo incomplete transactions.

3. Logging Changes During Undo

When rolling back a transaction, log the undo operations too. This ensures idempotent recovery.

                    ARIES RECOVERY PHASES

    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 1: ANALYSIS                                          │
    │   - Scan WAL from last checkpoint                          │
    │   - Build dirty page table (pages that might need redo)    │
    │   - Build active transaction table (incomplete transactions)│
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 2: REDO                                              │
    │   - Scan forward from oldest dirty page LSN                │
    │   - Reapply all changes (committed or not)                 │
    │   - Restore database to exact pre-crash state              │
    └───────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌───────────────────────────────────────────────────────────┐
    │ PHASE 3: UNDO                                              │
    │   - Roll back uncommitted transactions                     │
    │   - Process in reverse LSN order                           │
    │   - Log compensation records (CLRs)                        │
    └───────────────────────────────────────────────────────────┘

Compensation Log Records (CLRs)

When undoing a change, we log the undo:

    Original: LSN 1002 - UPDATE balance from 1000 to 900

    Undo: LSN 1050 - CLR: UPDATE balance from 900 to 1000
                    (undoes LSN 1002)

    If crash during undo:
    - Redo phase reapplies CLR 1050
    - Undo phase sees CLR, skips LSN 1002
    - No infinite undo loop!

7.6 Physical vs Logical Logging

Physical Logging

Log exact bytes changed:

    Page 7, offset 1024, old bytes: [00 00 03 E8], new bytes: [00 00 03 84]

Pros: Simple, exact replay Cons: Large logs, tied to physical layout

Logical Logging

Log the operation:

    UPDATE accounts SET balance = 900 WHERE id = 1

Pros: Smaller logs, replication-friendly Cons: Must handle schema changes, non-determinism

Physiological Logging (Most Common)

Hybrid: Physical page reference, logical within page:

    Page 7: UPDATE row in slot 3, set column 2 to 900

This is what PostgreSQL and most databases use:

  • Page-level addressing (physical)
  • Row-level operations (logical)

7.7 Full Page Writes

What if a crash happens mid-page-write, leaving a torn page?

    8KB page write in progress:
    ┌────────────────────────────────────────────────────┐
    │ First 4KB written (new data)                       │
    │────────────────────────────────────────────────────│
    │ Last 4KB NOT written (old data)                   │ ← Crash here!
    └────────────────────────────────────────────────────┘

    Page is now corrupted: Half old, half new

Solution: Full Page Writes (FPW)

After each checkpoint, the first modification to any page writes the entire page to WAL:

    Checkpoint at LSN 1000

    First modification to Page 7 after checkpoint:
    WAL record LSN 1050:
    ┌─────────────────────────────────────────────────────────────┐
    │ Type: FULL PAGE IMAGE                                        │
    │ Page: 7                                                       │
    │ Content: [entire 8KB page content]                           │
    │ Plus: The actual change                                       │
    └─────────────────────────────────────────────────────────────┘

    Subsequent modifications to Page 7 before next checkpoint:
    Only log the delta (normal small records)

On recovery, if page is torn, restore from full page image in WAL.

Trade-off: Full page writes increase WAL volume significantly but ensure recoverability.


7.8 WAL Archiving and PITR

WAL enables Point-in-Time Recovery (PITR):

                    POINT-IN-TIME RECOVERY

    ┌─────────────────────────────────────────────────────────────┐
    │ Full Backup (Sunday midnight)                                │
    │ [Snapshot of all data files]                                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Apply WAL...
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ WAL Segments (Sunday → Today)                                │
    │ [Mon] → [Tue] → [Wed] → [Thu] → [Fri] → [Sat]               │
    └─────────────────────────────────────────────────────────────┘
                               │
                               │ Stop at specific point
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ "Restore database as of Thursday 3:47:23 PM"                 │
    └─────────────────────────────────────────────────────────────┘

Setting Up WAL Archiving (PostgreSQL)

# postgresql.conf
wal_level = replica          # or 'logical'
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

Performing PITR

# 1. Restore base backup
pg_restore -d mydb /backup/base_backup

# 2. Configure recovery
# recovery.conf or postgresql.conf (v12+)
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 15:47:23'

# 3. Start server - it will replay WAL to target time

7.9 WAL and Replication

WAL is the foundation of database replication:

                    STREAMING REPLICATION

    Primary Server                      Standby Server
    ┌───────────────────┐              ┌───────────────────┐
    │ Transactions      │              │                   │
    │      │            │              │                   │
    │      ▼            │              │                   │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │   WAL   │───────│─ Stream ────│►│   WAL   │       │
    │ └─────────┘       │   WAL        │ └─────────┘       │
    │      │            │              │      │            │
    │      ▼            │              │      ▼            │
    │ ┌─────────┐       │              │ ┌─────────┐       │
    │ │  Data   │       │              │ │  Data   │       │
    │ └─────────┘       │              │ └─────────┘       │
    └───────────────────┘              └───────────────────┘

    Standby applies WAL records to stay in sync

Synchronous vs Asynchronous Replication

Asynchronous (default):

  • Primary doesn’t wait for standby
  • Possible data loss on failover
  • Better performance

Synchronous:

  • Primary waits for standby acknowledgment
  • No data loss on failover
  • Higher latency
-- PostgreSQL synchronous replication
-- postgresql.conf on primary:
synchronous_standby_names = 'standby1'

-- Commit waits for standby confirmation
synchronous_commit = remote_apply  -- or: on, remote_write

7.10 WAL Performance Considerations

WAL Write Latency

Every commit requires:

  1. Write WAL record(s) to buffer
  2. fsync() WAL to disk
  3. Return to client

The fsync() is the bottleneck:

    HDD: fsync ≈ 5-10ms  → ~100-200 commits/second
    SSD: fsync ≈ 0.1-1ms → ~1,000-10,000 commits/second
    NVMe: fsync ≈ 20-50μs → ~20,000-50,000 commits/second

Group Commit

Batch multiple transactions’ fsync into one:

    Without group commit:
    T1: write → fsync (5ms)
    T2: write → fsync (5ms)
    T3: write → fsync (5ms)
    Total: 15ms for 3 transactions

    With group commit:
    T1: write → wait
    T2: write → wait
    T3: write → wait
    All: single fsync (5ms)
    Total: 5ms for 3 transactions
-- PostgreSQL group commit tuning
commit_delay = 10        -- Wait 10μs for more commits to batch
commit_siblings = 5      -- Only delay if 5+ active transactions

WAL Compression

Compress WAL to reduce I/O:

-- PostgreSQL 15+
wal_compression = lz4    -- or: pglz, zstd, on (pglz)

Reduces WAL size by ~50% with minimal CPU overhead.


7.11 WAL Segment Management

WAL is stored in fixed-size segment files:

    PostgreSQL WAL directory:
    pg_wal/
    ├── 000000010000000000000001   (16MB segment)
    ├── 000000010000000000000002   (16MB segment)
    ├── 000000010000000000000003   (16MB segment)
    └── ...

    Filename encodes: timeline + segment number

WAL Retention

Old segments are recycled or removed based on:

  • wal_keep_size: Minimum WAL to retain
  • Replication slots: Keep WAL needed by standbys
  • Archive status: Keep until archived
-- Check WAL usage
SELECT * FROM pg_stat_wal;

-- Check replication slot WAL retention
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots;

WAL Bloat Issues

WAL can grow unexpectedly due to:

  • Inactive replication slots
  • Slow archive commands
  • Long-running transactions
-- Find inactive slots consuming WAL
SELECT slot_name, active, pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) as retained
FROM pg_replication_slots;

-- Drop problematic slot
SELECT pg_drop_replication_slot('unused_slot');

7.12 Double-Write Buffer (InnoDB)

MySQL InnoDB uses a different approach to torn pages:

                    INNODB DOUBLE-WRITE BUFFER

    Dirty Page in Buffer Pool
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. Write to Double-Write Buffer (sequential, 2MB area)       │
    │    [Page1][Page2][Page3]...                                  │
    │    fsync()                                                    │
    └─────────────────────────────────────────────────────────────┘
           │
           ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. Write to actual page locations (random I/O)               │
    └─────────────────────────────────────────────────────────────┘

    On crash:
    - If torn page detected, restore from double-write buffer
    - Then apply redo log

Trade-off: Doubles write I/O for data pages, but provides torn page protection without full page writes in the redo log.


7.13 Summary

Write-Ahead Logging is fundamental to database durability:

  • WAL rule: Log before data, always
  • LSN: Unique identifier for log ordering
  • Checkpoints: Bound recovery time, enable WAL recycling
  • ARIES: Industry-standard recovery algorithm
  • Full page writes: Handle torn pages
  • PITR: Restore to any point in time
  • Replication: Stream WAL to standbys
  • Group commit: Batch fsyncs for performance

Understanding WAL helps you configure durability guarantees and diagnose recovery issues.


What’s Next

In Chapter 8, we’ll explore MVCC (Multi-Version Concurrency Control) and transaction isolation—how databases let multiple users read and write simultaneously without stepping on each other.


“The WAL is your safety net. Everything else is just performance.”