Chapter 13: Recovery and Crash Safety
“Crashes are not a matter of if, but when. The question is: what happens to your data?”
Servers crash. Power fails. Disks corrupt. Kubernetes restarts your pod. A database must survive all of these and emerge with data intact. This chapter explores how databases achieve crash safety and recover from failures.
13.1 The Recovery Challenge
Consider what’s in flight when a database crashes:
State at crash:
┌─────────────────────────────────────────────────────────────┐
│ Buffer Pool: │
│ Page 42: Modified by committed Txn 100 (not on disk!) │
│ Page 67: Modified by uncommitted Txn 101 │
│ │
│ WAL: │
│ Txn 100: COMMIT record written and fsynced │
│ Txn 101: INSERT record written, no COMMIT │
│ Txn 102: Several records, no COMMIT │
│ │
│ Disk: │
│ Page 42: Contains old data (before Txn 100) │
│ Page 67: Contains old data (before Txn 101) │
└─────────────────────────────────────────────────────────────┘
After restart, we need:
- Txn 100 changes: RESTORED (it committed!)
- Txn 101 changes: UNDONE (never committed)
- Txn 102 changes: UNDONE (never committed)
13.2 The ARIES Recovery Algorithm
Most databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics):
ARIES RECOVERY PHASES
┌─────────────────────────────────────────────────────────────┐
│ 1. ANALYSIS │
│ Scan WAL from last checkpoint │
│ Determine: │
│ - Which transactions were active at crash │
│ - Which pages might be dirty (need redo) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. REDO │
│ Replay WAL forward from oldest dirty page │
│ Reapply ALL changes (committed or not) │
│ Restore database to exact pre-crash state │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. UNDO │
│ Roll back uncommitted transactions │
│ Process in reverse order │
│ Log compensation records for idempotence │
└─────────────────────────────────────────────────────────────┘
Phase 1: Analysis
Scan the WAL starting from the last checkpoint:
Checkpoint Record (LSN 1000):
- Active transactions: {Txn 100, Txn 101}
- Dirty pages: {Page 42, Page 67}
Continue scanning WAL:
- LSN 1001: Txn 100 INSERT page 42
- LSN 1002: Txn 100 COMMIT ← Txn 100 is done!
- LSN 1003: Txn 101 UPDATE page 67
- LSN 1004: Txn 102 BEGIN
- LSN 1005: Txn 102 INSERT page 89
[CRASH]
Analysis results:
- Winner (committed): Txn 100
- Losers (uncommitted): Txn 101, Txn 102
- Dirty pages to check: 42, 67, 89
Phase 2: Redo
Replay all logged changes to restore pre-crash state:
For each WAL record from oldest dirty page LSN:
If page_lsn < record_lsn: (change not yet applied)
Reapply the change
Set page_lsn = record_lsn
This handles:
- Committed changes not yet on disk (REDO them)
- Uncommitted changes not yet on disk (REDO them too!)
Wait, why redo uncommitted changes?
→ Restores exact pre-crash state
→ Then undo phase can cleanly roll back
Phase 3: Undo
Roll back uncommitted transactions:
Loser transactions: {Txn 101, Txn 102}
Process in reverse LSN order:
- LSN 1005: Txn 102 INSERT → UNDO: Delete the row
- LSN 1003: Txn 101 UPDATE → UNDO: Restore old value
For each undo, write Compensation Log Record (CLR):
- CLR for LSN 1005: "Deleted row inserted by 1005"
- CLR for LSN 1003: "Restored value changed by 1003"
CLRs ensure idempotence:
If crash during undo, redo phase replays CLRs,
then undo skips already-undone records
13.3 Checkpoints
Checkpoints limit how much WAL must be replayed:
CHECKPOINT OPERATION
1. Write CHECKPOINT BEGIN to WAL
2. Note active transactions and dirty pages
3. Flush dirty pages to disk (spread over time)
4. Write CHECKPOINT END to WAL with:
- Active transaction list
- Dirty page list
- Oldest LSN needed for recovery
Recovery starts from checkpoint, not beginning of time
Fuzzy Checkpoints
Don’t stop the world to checkpoint:
Fuzzy Checkpoint:
- Mark checkpoint start
- Continue normal operations
- Background writer flushes dirty pages
- Eventually all pages from before start are flushed
- Mark checkpoint complete
Database never stops serving requests
Checkpoint Frequency Trade-offs
Frequent checkpoints:
+ Faster recovery (less WAL to replay)
- More I/O during normal operation
- Potential performance spikes
Infrequent checkpoints:
+ Less I/O during normal operation
+ Smoother performance
- Longer recovery time
- More WAL space needed
-- PostgreSQL checkpoint tuning
checkpoint_timeout = 5min -- Max time between checkpoints
checkpoint_completion_target = 0.9 -- Spread I/O over 90% of interval
13.4 WAL Record Types
Different record types serve different purposes:
Data Records:
- INSERT: New row data
- UPDATE: Before and after images
- DELETE: Deleted row data
Transaction Records:
- BEGIN: Transaction started
- COMMIT: Transaction committed
- ABORT: Transaction rolled back
- PREPARE: For two-phase commit
Control Records:
- CHECKPOINT: Checkpoint marker
- CLR: Compensation log record (undo result)
DDL Records:
- CREATE/DROP TABLE
- CREATE/DROP INDEX
- ALTER TABLE
13.5 Logical vs Physical Logging
Physical Logging
Record exact bytes changed:
Page 42, offset 1024:
Old bytes: 00 00 03 E8
New bytes: 00 00 03 84
Pros: Simple, exact replay Cons: Large logs, doesn’t survive page reorganization
Logical Logging
Record the operation:
UPDATE accounts SET balance = 900 WHERE id = 1
Pros: Compact, works across page changes Cons: Must be deterministic, harder to implement
Physiological Logging
Most databases use a hybrid:
Page 42: UPDATE row in slot 3, set column 2 to 900
- Physical: References specific page
- Logical: Describes row-level operation
13.6 Torn Page Protection
What if a crash occurs mid-page-write?
8KB page write:
First 4KB written → [CRASH] → Last 4KB not written
Page is now "torn" - partially old, partially new
Checksum will fail, but how to recover?
Solution 1: Full Page Writes
Write complete page to WAL on first modification after checkpoint:
After checkpoint, first write to page 42:
1. Write full 8KB page image to WAL
2. Write actual change to WAL
3. Modify page in buffer pool
On recovery:
- If page is torn, restore from WAL full page image
- Then apply subsequent changes
Solution 2: Double-Write Buffer (InnoDB)
1. Write dirty pages to double-write buffer (sequential)
2. fsync double-write buffer
3. Write pages to actual locations (random)
On recovery:
- If page is torn, restore from double-write buffer
- Then apply redo log
13.7 Crash Recovery Scenarios
Scenario 1: Clean Shutdown
CHECKPOINT completes
All dirty pages flushed
WAL ends with CHECKPOINT record
Recovery: Check WAL, confirm clean shutdown, done!
(Almost instant)
Scenario 2: Crash After Commit, Before Flush
Txn 100: INSERT, UPDATE, COMMIT (in WAL)
Dirty pages still in buffer pool
[CRASH]
Recovery:
1. Analysis: Txn 100 committed
2. Redo: Replay Txn 100 changes to disk
3. Undo: Nothing to undo
→ Data preserved!
Scenario 3: Crash During Transaction
Txn 101: INSERT, UPDATE (in WAL)
No COMMIT record
[CRASH]
Recovery:
1. Analysis: Txn 101 uncommitted (loser)
2. Redo: Replay Txn 101 changes (restore state)
3. Undo: Roll back Txn 101 changes
→ Transaction never happened!
Scenario 4: Crash During Recovery
Recovering from crash #1
In undo phase, rolling back Txn 101
[CRASH #2]
Recovery from crash #2:
1. Analysis: See CLRs from previous recovery
2. Redo: Replay CLRs
3. Undo: Continue where we left off (CLRs mark progress)
→ Eventually completes!
13.8 Point-in-Time Recovery (PITR)
Restore to any point in time:
PITR ARCHITECTURE
Base Backup (Sunday)
┌─────────────────────────────────────────────────────────────┐
│ pg_basebackup or similar │
│ Contains: All data files as of Sunday 00:00 │
└─────────────────────────────────────────────────────────────┘
+
WAL Archive (Sunday → Friday)
┌─────────────────────────────────────────────────────────────┐
│ Every WAL segment archived │
│ Contains: All changes since Sunday │
└─────────────────────────────────────────────────────────────┘
=
Restore to Wednesday 14:30
┌─────────────────────────────────────────────────────────────┐
│ 1. Restore base backup │
│ 2. Replay WAL up to Wednesday 14:30 │
│ 3. Stop recovery, open database │
└─────────────────────────────────────────────────────────────┘
Use Cases
- Recover from accidental DELETE
- Restore before corrupting UPDATE
- Create database clone at specific time
- Test “what if” scenarios
PostgreSQL PITR
# Take base backup
pg_basebackup -D /backup/base -Fp -Xs -P
# Configure WAL archiving
# postgresql.conf:
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'
# Restore to point in time
# postgresql.conf on restore server:
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 14:30:00'
13.9 Backup Strategies
Full Backup
Copy everything:
Pros: Simple restore (just copy back)
Cons: Large, slow to create
Frequency: Weekly or less
Incremental Backup
Copy only changed pages since last backup:
Pros: Fast, small
Cons: Restore requires all increments
Frequency: Daily
Restore: Full + Inc1 + Inc2 + Inc3 + ...
Continuous Archiving (WAL)
Archive every WAL segment:
Pros: Point-in-time recovery
Cons: Requires base backup + all WAL
Frequency: Continuous
Backup Strategy Example
Sunday 00:00: Full backup
Mon-Sat 00:00: Incremental backup
Continuous: WAL archiving
Restore options:
- Sunday data: Full backup
- Wednesday data: Full + Mon + Tue + Wed incrementals
- Wednesday 14:30: Full + incrementals + WAL to 14:30
13.10 Testing Recovery
If you haven’t tested recovery, you don’t have backups.
What to Test
1. Full restore from backup
2. Point-in-time recovery to specific timestamp
3. Recovery from simulated crash
4. Recovery time (is it acceptable?)
5. Incremental restore chain
6. Restore to different hardware
Automated Testing
#!/bin/bash
# Monthly recovery test
# 1. Restore latest backup to test server
pg_restore -d test_restore /backup/latest.dump
# 2. Verify data integrity
psql test_restore -c "SELECT count(*) FROM users"
# 3. Run application smoke tests
./run_smoke_tests.sh test_restore
# 4. Record recovery time
echo "Recovery completed in $SECONDS seconds"
13.11 High Availability and Failover
Beyond crash recovery: survive hardware failures.
Streaming Replication
PRIMARY → STANDBY
Primary Standby
┌───────────────────┐ ┌───────────────────┐
│ Active database │ │ Hot standby │
│ │ WAL │ (read-only) │
│ ┌─────────────┐ │ ──────► │ ┌─────────────┐ │
│ │ WAL │ │ stream │ │ WAL │ │
│ └─────────────┘ │ │ └─────────────┘ │
└───────────────────┘ └───────────────────┘
If primary fails:
1. Promote standby to primary
2. Redirect clients to new primary
Synchronous vs Asynchronous
Asynchronous:
- Primary doesn't wait for standby
- Possible data loss on failover (standby behind)
- Lower latency
Synchronous:
- Primary waits for standby acknowledgment
- Zero data loss on failover
- Higher latency
Failover Process
1. Detect primary failure
- Heartbeat timeout
- Failed health check
2. Verify primary is really down
- Avoid split-brain!
- Use fencing/STONITH if needed
3. Promote standby
- pg_ctl promote
- or trigger file
4. Redirect clients
- Update DNS/load balancer
- Connection pooler reconfiguration
13.12 Corruption Detection and Handling
Checksum Validation
-- PostgreSQL: Enable checksums (at initdb or pg_checksums)
-- Detects: Disk corruption, memory errors, torn pages
-- Check for corruption
SELECT * FROM pg_stat_database WHERE checksum_failures > 0;
When Corruption Is Detected
Options:
1. Restore from backup (safest)
2. Use pg_resetwal to skip corrupted records (data loss!)
3. Attempt to read good data, skip bad pages
Prevention:
- Enable checksums
- Use ECC RAM
- Use reliable storage
- Regular backups
- Monitor for errors
13.13 Recovery Time Objectives
RTO (Recovery Time Objective)
How long can you be down?
RTO = 1 hour:
- Crash recovery acceptable
- Maybe PITR restore acceptable
- Single server might be OK
RTO = 1 minute:
- Need hot standby
- Automatic failover
- Pre-warmed cache
RTO = 0:
- Multi-master or distributed database
- No single point of failure
RPO (Recovery Point Objective)
How much data can you lose?
RPO = 1 hour:
- Hourly backups sufficient
- Some data loss acceptable
RPO = 1 minute:
- Continuous WAL archiving
- Frequent checkpoints
RPO = 0:
- Synchronous replication
- Every commit acknowledged by standby
13.14 Summary
Database recovery ensures durability and consistency:
- ARIES provides industry-standard recovery
- Three phases: Analysis, Redo, Undo
- Checkpoints bound recovery time
- Full page writes handle torn pages
- PITR enables point-in-time recovery
- Replication provides high availability
- Test your backups or they’re worthless
Recovery is the foundation of database reliability—everything else depends on it working correctly.
What’s Next
In Chapter 14, we’ll compare column stores and row stores—different storage architectures optimized for different workloads.
“The best time to test your recovery process was before you needed it. The second best time is now.”