Chapter 7: Write-Ahead Logging (WAL)
“First, log it. Then, do it. Always in that order.”
What happens when your database server crashes mid-transaction? Without careful engineering, you could lose committed data or end up with corrupted state. Write-Ahead Logging (WAL) is the fundamental technique that makes databases durable and recoverable.
7.1 The Durability Problem
Consider this scenario:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
Without WAL, the database might:
- Modify Account 1’s page in memory
- CRASH before modifying Account 2
- After restart: Account 1 lost $100, Account 2 gained nothing!
Money has vanished. The transaction was supposed to be atomic.
THE DURABILITY PROBLEM
Transaction commits
│
▼
┌──────────────────────────────────────────────────────┐
│ Buffer Pool (Memory) │
│ ┌────────────┐ ┌────────────┐ │
│ │ Account 1 │ │ Account 2 │ ← Changes are here │
│ │ $900 │ │ $1100 │ │
│ │ (dirty) │ │ (dirty) │ │
│ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
│
│ Not yet written to disk!
▼
[POWER FAILURE]
│
▼
┌──────────────────────────────────────────────────────┐
│ Disk (Persistent Storage) │
│ ┌────────────┐ ┌────────────┐ │
│ │ Account 1 │ │ Account 2 │ ← Old values! │
│ │ $1000 │ │ $1000 │ │
│ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────┘
Data loss!
7.2 The WAL Protocol
Write-Ahead Logging solves this with a simple rule:
Before any change is written to a data page on disk, the log record describing that change must first be written to the WAL on disk.
THE WAL PROTOCOL
1. FIRST: Write to WAL (log the intent)
┌──────────────────────────────────────────────────────┐
│ WAL (on disk, append-only) │
│ [LSN 1001: UPDATE accounts SET balance=900 WHERE id=1]│
│ [LSN 1002: UPDATE accounts SET balance=1100 WHERE id=2]│
│ [LSN 1003: COMMIT Transaction 42] │
└──────────────────────────────────────────────────────┘
│
│ fsync() - Force to disk!
▼
2. THEN: Return success to client
│
▼
3. LATER: Write data pages to disk (background)
┌──────────────────────────────────────────────────────┐
│ Data Files (on disk) │
│ Account 1: $900 │
│ Account 2: $1100 │
└──────────────────────────────────────────────────────┘
Key insight: WAL writes are sequential and append-only—fast even on spinning disks. Data page writes can be deferred and batched.
7.3 WAL Record Structure
Each WAL record contains enough information to redo (or undo) a change:
WAL RECORD FORMAT
┌─────────────────────────────────────────────────────────────┐
│ LSN (Log Sequence Number): Unique identifier for this record│
├─────────────────────────────────────────────────────────────┤
│ Transaction ID: Which transaction made this change │
├─────────────────────────────────────────────────────────────┤
│ Previous LSN: For chaining records in same transaction │
├─────────────────────────────────────────────────────────────┤
│ Page ID: Which page is affected │
├─────────────────────────────────────────────────────────────┤
│ Record Type: INSERT, UPDATE, DELETE, COMMIT, ABORT, etc. │
├─────────────────────────────────────────────────────────────┤
│ Before Image: Original data (for UNDO) │
├─────────────────────────────────────────────────────────────┤
│ After Image: New data (for REDO) │
├─────────────────────────────────────────────────────────────┤
│ Checksum: Data integrity verification │
└─────────────────────────────────────────────────────────────┘
LSN (Log Sequence Number)
Every WAL record has a unique, monotonically increasing LSN:
LSN 1001: BEGIN Transaction 42
LSN 1002: UPDATE on page 7
LSN 1003: UPDATE on page 12
LSN 1004: COMMIT Transaction 42
LSN 1005: BEGIN Transaction 43
...
Pages track which LSN they were last modified at:
Page 7 header: page_lsn = 1002
Page 12 header: page_lsn = 1003
This enables recovery to know which changes have been applied.
7.4 WAL and Checkpoints
Writing every change to WAL is great for durability, but:
- WAL files grow forever
- Recovery would replay from the beginning of time
Checkpoints solve this:
CHECKPOINT PROCESS
Before checkpoint:
┌─────────────────────────────────────────────────────────────┐
│ Buffer Pool: dirty pages scattered across time │
│ [Page 3: LSN 950] [Page 7: LSN 1002] [Page 12: LSN 1003] │
└─────────────────────────────────────────────────────────────┘
Checkpoint (LSN 1100):
1. Write all dirty pages to disk
2. Write checkpoint record to WAL
3. Record: "All changes up to LSN 1100 are on disk"
After checkpoint:
- WAL records before LSN 1100 can be deleted
- Recovery only needs to replay from LSN 1100
┌─────────────────────────────────────────────────────────────┐
│ WAL: │
│ [OLD - can delete] │ CHECKPOINT │ [NEEDED for recovery] │
│ LSN < 1100 │ LSN 1100 │ LSN > 1100 │
└─────────────────────────────────────────────────────────────┘
Checkpoint Trade-offs
Frequent checkpoints:
- Faster recovery (less WAL to replay)
- More I/O during normal operation
- Can cause latency spikes
Infrequent checkpoints:
- Longer recovery time
- Less normal I/O
- Risk of running out of WAL space
-- PostgreSQL checkpoint configuration
checkpoint_timeout = 5min -- Max time between checkpoints
checkpoint_completion_target = 0.9 -- Spread writes over 90% of interval
max_wal_size = 1GB -- Trigger checkpoint if WAL reaches this
7.5 ARIES: The Industry Standard
Most modern databases use ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), developed at IBM in the early 1990s. ARIES provides three key guarantees:
1. Write-Ahead Logging
Log before data. We’ve covered this.
2. Repeating History During Redo
On recovery, replay ALL changes (even for aborted transactions) to restore the exact pre-crash state, then undo incomplete transactions.
3. Logging Changes During Undo
When rolling back a transaction, log the undo operations too. This ensures idempotent recovery.
ARIES RECOVERY PHASES
┌───────────────────────────────────────────────────────────┐
│ PHASE 1: ANALYSIS │
│ - Scan WAL from last checkpoint │
│ - Build dirty page table (pages that might need redo) │
│ - Build active transaction table (incomplete transactions)│
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ PHASE 2: REDO │
│ - Scan forward from oldest dirty page LSN │
│ - Reapply all changes (committed or not) │
│ - Restore database to exact pre-crash state │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ PHASE 3: UNDO │
│ - Roll back uncommitted transactions │
│ - Process in reverse LSN order │
│ - Log compensation records (CLRs) │
└───────────────────────────────────────────────────────────┘
Compensation Log Records (CLRs)
When undoing a change, we log the undo:
Original: LSN 1002 - UPDATE balance from 1000 to 900
Undo: LSN 1050 - CLR: UPDATE balance from 900 to 1000
(undoes LSN 1002)
If crash during undo:
- Redo phase reapplies CLR 1050
- Undo phase sees CLR, skips LSN 1002
- No infinite undo loop!
7.6 Physical vs Logical Logging
Physical Logging
Log exact bytes changed:
Page 7, offset 1024, old bytes: [00 00 03 E8], new bytes: [00 00 03 84]
Pros: Simple, exact replay Cons: Large logs, tied to physical layout
Logical Logging
Log the operation:
UPDATE accounts SET balance = 900 WHERE id = 1
Pros: Smaller logs, replication-friendly Cons: Must handle schema changes, non-determinism
Physiological Logging (Most Common)
Hybrid: Physical page reference, logical within page:
Page 7: UPDATE row in slot 3, set column 2 to 900
This is what PostgreSQL and most databases use:
- Page-level addressing (physical)
- Row-level operations (logical)
7.7 Full Page Writes
What if a crash happens mid-page-write, leaving a torn page?
8KB page write in progress:
┌────────────────────────────────────────────────────┐
│ First 4KB written (new data) │
│────────────────────────────────────────────────────│
│ Last 4KB NOT written (old data) │ ← Crash here!
└────────────────────────────────────────────────────┘
Page is now corrupted: Half old, half new
Solution: Full Page Writes (FPW)
After each checkpoint, the first modification to any page writes the entire page to WAL:
Checkpoint at LSN 1000
First modification to Page 7 after checkpoint:
WAL record LSN 1050:
┌─────────────────────────────────────────────────────────────┐
│ Type: FULL PAGE IMAGE │
│ Page: 7 │
│ Content: [entire 8KB page content] │
│ Plus: The actual change │
└─────────────────────────────────────────────────────────────┘
Subsequent modifications to Page 7 before next checkpoint:
Only log the delta (normal small records)
On recovery, if page is torn, restore from full page image in WAL.
Trade-off: Full page writes increase WAL volume significantly but ensure recoverability.
7.8 WAL Archiving and PITR
WAL enables Point-in-Time Recovery (PITR):
POINT-IN-TIME RECOVERY
┌─────────────────────────────────────────────────────────────┐
│ Full Backup (Sunday midnight) │
│ [Snapshot of all data files] │
└─────────────────────────────────────────────────────────────┘
│
│ Apply WAL...
▼
┌─────────────────────────────────────────────────────────────┐
│ WAL Segments (Sunday → Today) │
│ [Mon] → [Tue] → [Wed] → [Thu] → [Fri] → [Sat] │
└─────────────────────────────────────────────────────────────┘
│
│ Stop at specific point
▼
┌─────────────────────────────────────────────────────────────┐
│ "Restore database as of Thursday 3:47:23 PM" │
└─────────────────────────────────────────────────────────────┘
Setting Up WAL Archiving (PostgreSQL)
# postgresql.conf
wal_level = replica # or 'logical'
archive_mode = on
archive_command = 'cp %p /backup/wal/%f'
Performing PITR
# 1. Restore base backup
pg_restore -d mydb /backup/base_backup
# 2. Configure recovery
# recovery.conf or postgresql.conf (v12+)
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = '2024-03-15 15:47:23'
# 3. Start server - it will replay WAL to target time
7.9 WAL and Replication
WAL is the foundation of database replication:
STREAMING REPLICATION
Primary Server Standby Server
┌───────────────────┐ ┌───────────────────┐
│ Transactions │ │ │
│ │ │ │ │
│ ▼ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │ WAL │───────│─ Stream ────│►│ WAL │ │
│ └─────────┘ │ WAL │ └─────────┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Data │ │ │ │ Data │ │
│ └─────────┘ │ │ └─────────┘ │
└───────────────────┘ └───────────────────┘
Standby applies WAL records to stay in sync
Synchronous vs Asynchronous Replication
Asynchronous (default):
- Primary doesn’t wait for standby
- Possible data loss on failover
- Better performance
Synchronous:
- Primary waits for standby acknowledgment
- No data loss on failover
- Higher latency
-- PostgreSQL synchronous replication
-- postgresql.conf on primary:
synchronous_standby_names = 'standby1'
-- Commit waits for standby confirmation
synchronous_commit = remote_apply -- or: on, remote_write
7.10 WAL Performance Considerations
WAL Write Latency
Every commit requires:
- Write WAL record(s) to buffer
fsync()WAL to disk- Return to client
The fsync() is the bottleneck:
HDD: fsync ≈ 5-10ms → ~100-200 commits/second
SSD: fsync ≈ 0.1-1ms → ~1,000-10,000 commits/second
NVMe: fsync ≈ 20-50μs → ~20,000-50,000 commits/second
Group Commit
Batch multiple transactions’ fsync into one:
Without group commit:
T1: write → fsync (5ms)
T2: write → fsync (5ms)
T3: write → fsync (5ms)
Total: 15ms for 3 transactions
With group commit:
T1: write → wait
T2: write → wait
T3: write → wait
All: single fsync (5ms)
Total: 5ms for 3 transactions
-- PostgreSQL group commit tuning
commit_delay = 10 -- Wait 10μs for more commits to batch
commit_siblings = 5 -- Only delay if 5+ active transactions
WAL Compression
Compress WAL to reduce I/O:
-- PostgreSQL 15+
wal_compression = lz4 -- or: pglz, zstd, on (pglz)
Reduces WAL size by ~50% with minimal CPU overhead.
7.11 WAL Segment Management
WAL is stored in fixed-size segment files:
PostgreSQL WAL directory:
pg_wal/
├── 000000010000000000000001 (16MB segment)
├── 000000010000000000000002 (16MB segment)
├── 000000010000000000000003 (16MB segment)
└── ...
Filename encodes: timeline + segment number
WAL Retention
Old segments are recycled or removed based on:
wal_keep_size: Minimum WAL to retain- Replication slots: Keep WAL needed by standbys
- Archive status: Keep until archived
-- Check WAL usage
SELECT * FROM pg_stat_wal;
-- Check replication slot WAL retention
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots;
WAL Bloat Issues
WAL can grow unexpectedly due to:
- Inactive replication slots
- Slow archive commands
- Long-running transactions
-- Find inactive slots consuming WAL
SELECT slot_name, active, pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) as retained
FROM pg_replication_slots;
-- Drop problematic slot
SELECT pg_drop_replication_slot('unused_slot');
7.12 Double-Write Buffer (InnoDB)
MySQL InnoDB uses a different approach to torn pages:
INNODB DOUBLE-WRITE BUFFER
Dirty Page in Buffer Pool
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Write to Double-Write Buffer (sequential, 2MB area) │
│ [Page1][Page2][Page3]... │
│ fsync() │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Write to actual page locations (random I/O) │
└─────────────────────────────────────────────────────────────┘
On crash:
- If torn page detected, restore from double-write buffer
- Then apply redo log
Trade-off: Doubles write I/O for data pages, but provides torn page protection without full page writes in the redo log.
7.13 Summary
Write-Ahead Logging is fundamental to database durability:
- WAL rule: Log before data, always
- LSN: Unique identifier for log ordering
- Checkpoints: Bound recovery time, enable WAL recycling
- ARIES: Industry-standard recovery algorithm
- Full page writes: Handle torn pages
- PITR: Restore to any point in time
- Replication: Stream WAL to standbys
- Group commit: Batch fsyncs for performance
Understanding WAL helps you configure durability guarantees and diagnose recovery issues.
What’s Next
In Chapter 8, we’ll explore MVCC (Multi-Version Concurrency Control) and transaction isolation—how databases let multiple users read and write simultaneously without stepping on each other.
“The WAL is your safety net. Everything else is just performance.”