Chapter 13: Recovery and Crash Safety

“Crashes are not a matter of if, but when. The question is: what happens to your data?”

Servers crash. Power fails. Disks corrupt. Kubernetes restarts your pod. A database must survive all of these and emerge with data intact. This chapter explores how databases achieve crash safety and recover from failures.

13.1 The Recovery Challenge

Consider what’s in flight when a database crashes:

    State at crash:
    ┌─────────────────────────────────────────────────────────────┐
    │ Buffer Pool:                                                 │
    │   Page 42: Modified by committed Txn 100 (not on disk!)     │
    │   Page 67: Modified by uncommitted Txn 101                  │
    │                                                              │
    │ WAL:                                                         │
    │   Txn 100: COMMIT record written and fsynced                │
    │   Txn 101: INSERT record written, no COMMIT                 │
    │   Txn 102: Several records, no COMMIT                       │
    │                                                              │
    │ Disk:                                                        │
    │   Page 42: Contains old data (before Txn 100)               │
    │   Page 67: Contains old data (before Txn 101)               │
    └─────────────────────────────────────────────────────────────┘

    After restart, we need:
    - Txn 100 changes: RESTORED (it committed!)
    - Txn 101 changes: UNDONE (never committed)
    - Txn 102 changes: UNDONE (never committed)

13.2 The ARIES Recovery Algorithm

Most databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics):

                    ARIES RECOVERY PHASES

    ┌─────────────────────────────────────────────────────────────┐
    │                     1. ANALYSIS                              │
    │   Scan WAL from last checkpoint                              │
    │   Determine:                                                 │
    │   - Which transactions were active at crash                 │
    │   - Which pages might be dirty (need redo)                  │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │                       2. REDO                                │
    │   Replay WAL forward from oldest dirty page                  │
    │   Reapply ALL changes (committed or not)                    │
    │   Restore database to exact pre-crash state                 │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │                       3. UNDO                                │
    │   Roll back uncommitted transactions                        │
    │   Process in reverse order                                  │
    │   Log compensation records for idempotence                  │
    └─────────────────────────────────────────────────────────────┘

Phase 1: Analysis

Scan the WAL starting from the last checkpoint:

    Checkpoint Record (LSN 1000):
    - Active transactions: {Txn 100, Txn 101}
    - Dirty pages: {Page 42, Page 67}

    Continue scanning WAL:
    - LSN 1001: Txn 100 INSERT page 42
    - LSN 1002: Txn 100 COMMIT ← Txn 100 is done!
    - LSN 1003: Txn 101 UPDATE page 67
    - LSN 1004: Txn 102 BEGIN
    - LSN 1005: Txn 102 INSERT page 89
    [CRASH]

    Analysis results:
    - Winner (committed): Txn 100
    - Losers (uncommitted): Txn 101, Txn 102
    - Dirty pages to check: 42, 67, 89

Phase 2: Redo

Replay all logged changes to restore pre-crash state:

    For each WAL record from oldest dirty page LSN:
        If page_lsn < record_lsn:  (change not yet applied)
            Reapply the change
            Set page_lsn = record_lsn

    This handles:
    - Committed changes not yet on disk (REDO them)
    - Uncommitted changes not yet on disk (REDO them too!)

    Wait, why redo uncommitted changes?
    → Restores exact pre-crash state
    → Then undo phase can cleanly roll back

Phase 3: Undo

Roll back uncommitted transactions:

    Loser transactions: {Txn 101, Txn 102}

    Process in reverse LSN order:
    - LSN 1005: Txn 102 INSERT → UNDO: Delete the row
    - LSN 1003: Txn 101 UPDATE → UNDO: Restore old value

    For each undo, write Compensation Log Record (CLR):
    - CLR for LSN 1005: "Deleted row inserted by 1005"
    - CLR for LSN 1003: "Restored value changed by 1003"

    CLRs ensure idempotence:
    If crash during undo, redo phase replays CLRs,
    then undo skips already-undone records

13.3 Checkpoints

Checkpoints limit how much WAL must be replayed:

                    CHECKPOINT OPERATION

    1. Write CHECKPOINT BEGIN to WAL
    2. Note active transactions and dirty pages
    3. Flush dirty pages to disk (spread over time)
    4. Write CHECKPOINT END to WAL with:
       - Active transaction list
       - Dirty page list
       - Oldest LSN needed for recovery

    Recovery starts from checkpoint, not beginning of time

Fuzzy Checkpoints

Don’t stop the world to checkpoint:

    Fuzzy Checkpoint:
    - Mark checkpoint start
    - Continue normal operations
    - Background writer flushes dirty pages
    - Eventually all pages from before start are flushed
    - Mark checkpoint complete

    Database never stops serving requests

Checkpoint Frequency Trade-offs

    Frequent checkpoints:
    + Faster recovery (less WAL to replay)
    - More I/O during normal operation
    - Potential performance spikes

    Infrequent checkpoints:
    + Less I/O during normal operation
    + Smoother performance
    - Longer recovery time
    - More WAL space needed

-- PostgreSQL checkpoint tuning
checkpoint_timeout = 5min       -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread I/O over 90% of interval

13.4 WAL Record Types

Different record types serve different purposes:

    Data Records:
    - INSERT: New row data
    - UPDATE: Before and after images
    - DELETE: Deleted row data

    Transaction Records:
    - BEGIN: Transaction started
    - COMMIT: Transaction committed
    - ABORT: Transaction rolled back
    - PREPARE: For two-phase commit

    Control Records:
    - CHECKPOINT: Checkpoint marker
    - CLR: Compensation log record (undo result)

    DDL Records:
    - CREATE/DROP TABLE
    - CREATE/DROP INDEX
    - ALTER TABLE

13.5 Logical vs Physical Logging

Physical Logging

Record exact bytes changed:

    Page 42, offset 1024:
    Old bytes: 00 00 03 E8
    New bytes: 00 00 03 84

Pros: Simple, exact replay Cons: Large logs, doesn’t survive page reorganization

Logical Logging

Record the operation:

    UPDATE accounts SET balance = 900 WHERE id = 1

Pros: Compact, works across page changes Cons: Must be deterministic, harder to implement

Physiological Logging

Most databases use a hybrid:

    Page 42: UPDATE row in slot 3, set column 2 to 900

    - Physical: References specific page
    - Logical: Describes row-level operation

13.6 Torn Page Protection

What if a crash occurs mid-page-write?

    8KB page write:
    First 4KB written → [CRASH] → Last 4KB not written

    Page is now "torn" - partially old, partially new
    Checksum will fail, but how to recover?

Solution 1: Full Page Writes

Write complete page to WAL on first modification after checkpoint:

    After checkpoint, first write to page 42:
    1. Write full 8KB page image to WAL
    2. Write actual change to WAL
    3. Modify page in buffer pool

    On recovery:
    - If page is torn, restore from WAL full page image
    - Then apply subsequent changes

Solution 2: Double-Write Buffer (InnoDB)

    1. Write dirty pages to double-write buffer (sequential)
    2. fsync double-write buffer
    3. Write pages to actual locations (random)

    On recovery:
    - If page is torn, restore from double-write buffer
    - Then apply redo log

13.7 Crash Recovery Scenarios

Scenario 1: Clean Shutdown

    CHECKPOINT completes
    All dirty pages flushed
    WAL ends with CHECKPOINT record

    Recovery: Check WAL, confirm clean shutdown, done!
    (Almost instant)

Scenario 2: Crash After Commit, Before Flush

    Txn 100: INSERT, UPDATE, COMMIT (in WAL)
    Dirty pages still in buffer pool
    [CRASH]

    Recovery:
    1. Analysis: Txn 100 committed
    2. Redo: Replay Txn 100 changes to disk
    3. Undo: Nothing to undo
    → Data preserved!

Scenario 3: Crash During Transaction

    Txn 101: INSERT, UPDATE (in WAL)
    No COMMIT record
    [CRASH]

    Recovery:
    1. Analysis: Txn 101 uncommitted (loser)
    2. Redo: Replay Txn 101 changes (restore state)
    3. Undo: Roll back Txn 101 changes
    → Transaction never happened!

Scenario 4: Crash During Recovery

    Recovering from crash #1
    In undo phase, rolling back Txn 101
    [CRASH #2]

    Recovery from crash #2:
    1. Analysis: See CLRs from previous recovery
    2. Redo: Replay CLRs
    3. Undo: Continue where we left off (CLRs mark progress)
    → Eventually completes!

13.8 Point-in-Time Recovery (PITR)

Restore to any point in time:

                    PITR ARCHITECTURE

    Base Backup (Sunday)
    ┌─────────────────────────────────────────────────────────────┐
    │ pg_basebackup or similar                                     │
    │ Contains: All data files as of Sunday 00:00                 │
    └─────────────────────────────────────────────────────────────┘
                               +
    WAL Archive (Sunday → Friday)
    ┌─────────────────────────────────────────────────────────────┐
    │ Every WAL segment archived                                   │
    │ Contains: All changes since Sunday                          │
    └─────────────────────────────────────────────────────────────┘
                               =
    Restore to Wednesday 14:30
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. Restore base backup                                       │
    │ 2. Replay WAL up to Wednesday 14:30                         │
    │ 3. Stop recovery, open database                             │
    └─────────────────────────────────────────────────────────────┘

Use Cases

Recover from accidental DELETE
Restore before corrupting UPDATE
Create database clone at specific time
Test “what if” scenarios

PostgreSQL PITR

# Take base backup
pg_basebackup -D /backup/base -Fp -Xs -P

# Configure WAL archiving
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'

# Restore to point in time
# postgresql.conf on restore server:
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 14:30:00'

13.9 Backup Strategies

Full Backup

Copy everything:

    Pros: Simple restore (just copy back)
    Cons: Large, slow to create
    Frequency: Weekly or less

Incremental Backup

Copy only changed pages since last backup:

    Pros: Fast, small
    Cons: Restore requires all increments
    Frequency: Daily

    Restore: Full + Inc1 + Inc2 + Inc3 + ...

Continuous Archiving (WAL)

Archive every WAL segment:

    Pros: Point-in-time recovery
    Cons: Requires base backup + all WAL
    Frequency: Continuous

Backup Strategy Example

    Sunday 00:00: Full backup
    Mon-Sat 00:00: Incremental backup
    Continuous: WAL archiving

    Restore options:
    - Sunday data: Full backup
    - Wednesday data: Full + Mon + Tue + Wed incrementals
    - Wednesday 14:30: Full + incrementals + WAL to 14:30

13.10 Testing Recovery

If you haven’t tested recovery, you don’t have backups.

What to Test

    1. Full restore from backup
    2. Point-in-time recovery to specific timestamp
    3. Recovery from simulated crash
    4. Recovery time (is it acceptable?)
    5. Incremental restore chain
    6. Restore to different hardware

Automated Testing

#!/bin/bash
# Monthly recovery test

# 1. Restore latest backup to test server
pg_restore -d test_restore /backup/latest.dump

# 2. Verify data integrity
psql test_restore -c "SELECT count(*) FROM users"

# 3. Run application smoke tests
./run_smoke_tests.sh test_restore

# 4. Record recovery time
echo "Recovery completed in $SECONDS seconds"

13.11 High Availability and Failover

Beyond crash recovery: survive hardware failures.

Streaming Replication

                    PRIMARY → STANDBY

    Primary                         Standby
    ┌───────────────────┐          ┌───────────────────┐
    │ Active database   │          │ Hot standby       │
    │                   │   WAL    │ (read-only)       │
    │ ┌─────────────┐   │ ──────► │ ┌─────────────┐   │
    │ │    WAL      │   │ stream  │ │    WAL      │   │
    │ └─────────────┘   │          │ └─────────────┘   │
    └───────────────────┘          └───────────────────┘

    If primary fails:
    1. Promote standby to primary
    2. Redirect clients to new primary

Synchronous vs Asynchronous

    Asynchronous:
    - Primary doesn't wait for standby
    - Possible data loss on failover (standby behind)
    - Lower latency

    Synchronous:
    - Primary waits for standby acknowledgment
    - Zero data loss on failover
    - Higher latency

Failover Process

    1. Detect primary failure
       - Heartbeat timeout
       - Failed health check

    2. Verify primary is really down
       - Avoid split-brain!
       - Use fencing/STONITH if needed

    3. Promote standby
       - pg_ctl promote
       - or trigger file

    4. Redirect clients
       - Update DNS/load balancer
       - Connection pooler reconfiguration

13.12 Corruption Detection and Handling

Checksum Validation

-- PostgreSQL: Enable checksums (at initdb or pg_checksums)
-- Detects: Disk corruption, memory errors, torn pages

-- Check for corruption
SELECT * FROM pg_stat_database WHERE checksum_failures > 0;

When Corruption Is Detected

    Options:
    1. Restore from backup (safest)
    2. Use pg_resetwal to skip corrupted records (data loss!)
    3. Attempt to read good data, skip bad pages

    Prevention:
    - Enable checksums
    - Use ECC RAM
    - Use reliable storage
    - Regular backups
    - Monitor for errors

13.13 Recovery Time Objectives

RTO (Recovery Time Objective)

How long can you be down?

    RTO = 1 hour:
    - Crash recovery acceptable
    - Maybe PITR restore acceptable
    - Single server might be OK

    RTO = 1 minute:
    - Need hot standby
    - Automatic failover
    - Pre-warmed cache

    RTO = 0:
    - Multi-master or distributed database
    - No single point of failure

RPO (Recovery Point Objective)

How much data can you lose?

    RPO = 1 hour:
    - Hourly backups sufficient
    - Some data loss acceptable

    RPO = 1 minute:
    - Continuous WAL archiving
    - Frequent checkpoints

    RPO = 0:
    - Synchronous replication
    - Every commit acknowledged by standby

13.14 Summary

Database recovery ensures durability and consistency:

ARIES provides industry-standard recovery
Three phases: Analysis, Redo, Undo
Checkpoints bound recovery time
Full page writes handle torn pages
PITR enables point-in-time recovery
Replication provides high availability
Test your backups or they’re worthless

Recovery is the foundation of database reliability—everything else depends on it working correctly.

What’s Next

In Chapter 14, we’ll compare column stores and row stores—different storage architectures optimized for different workloads.

“The best time to test your recovery process was before you needed it. The second best time is now.”

Keyboard shortcuts

Database Internals: Where Your Data Actually Lives