Backup and Recovery
A database you can’t restore is a database you don’t have. Backup is not about the backup — it’s about the restore. The question is not “do we have backups?” but “have we tested restoring from them recently?” and “what’s our actual recovery time in an emergency?”
This chapter covers the full spectrum of Postgres backup strategies: logical dumps with pg_dump, continuous archiving with WAL, point-in-time recovery, and the production-grade tools that make this manageable.
Terminology: RPO and RTO
Two metrics define your backup requirements:
RPO (Recovery Point Objective): How much data can you afford to lose? An RPO of 5 minutes means you can tolerate losing up to 5 minutes of transactions in a catastrophic failure. An RPO of 0 means you need synchronous replication with no data loss.
RTO (Recovery Time Objective): How long can the database be unavailable during recovery? An RTO of 4 hours means you can take up to 4 hours to restore from a backup and catch up. An RTO of 5 minutes requires a hot standby.
Your backup strategy flows from these numbers. A startup with moderate data loss tolerance might accept RPO=24h (daily backups) and RTO=4h. A financial system might require RPO=0 and RTO=minutes. Most production applications land somewhere in between.
pg_dump: Logical Backups
pg_dump is Postgres’s built-in tool for creating logical (SQL) backups. It connects to the database and dumps the schema and data as SQL commands or CSV.
# Dump to SQL file (text format)
pg_dump --dbname=mydb --file=backup.sql
# Dump to custom binary format (recommended: smaller, parallelizable restore)
pg_dump --dbname=mydb --format=custom --file=backup.dump
# Dump with connection string
pg_dump "postgresql://user:pass@host:5432/mydb" --format=custom --file=backup.dump
# Dump only specific tables
pg_dump --table=orders --table=users --format=custom --file=partial_backup.dump
# Dump schema only (no data)
pg_dump --schema-only --format=custom --file=schema_only.dump
# Dump data only (no schema)
pg_dump --data-only --format=custom --file=data_only.dump
Restoring with pg_restore
# Restore from custom format
pg_restore --dbname=mydb --jobs=4 backup.dump # parallel restore with 4 workers
# Restore to a new database
createdb mydb_restored
pg_restore --dbname=mydb_restored --jobs=4 backup.dump
# Restore a specific table
pg_restore --dbname=mydb --table=orders backup.dump
# Restore schema only
pg_restore --schema-only --dbname=mydb backup.dump
# From SQL format:
psql --dbname=mydb < backup.sql
pg_dumpall
Dumps the entire PostgreSQL cluster — all databases, global objects (roles, tablespaces):
pg_dumpall --file=cluster_backup.sql --globals-only # Just roles/tablespaces
pg_dumpall --file=cluster_backup.sql # Everything
Limitations of Logical Backups
- Slow: Dumping a large database takes hours; restoring takes longer (must replay all INSERT statements)
- Point-in-time is limited: A dump reflects the state at one moment. No PITR without WAL archiving.
- Not suitable for very large databases: For databases > several hundred GB, dump/restore is too slow for realistic RTO
For small databases (<50GB), pg_dump with daily automation is often sufficient. For larger databases or stricter RPO/RTO requirements, WAL archiving is necessary.
WAL Archiving: Continuous Backup
Physical backup (PITR) works by:
- Taking a base backup (a copy of the data directory)
- Archiving WAL segments continuously as they’re generated
- At recovery time: restore the base backup, then replay WAL segments up to the desired point in time
This allows recovery to any point in time since the base backup — with granularity limited only by WAL segment size.
Configuring WAL Archiving
In postgresql.conf:
wal_level = replica # Minimum for archiving
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
# Or to S3:
# archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
# Or with WAL-G:
# archive_command = 'wal-g wal-push %p'
%p is the path to the WAL file, %f is just the filename. The archive command must return 0 on success and non-zero on failure.
Taking a Base Backup
# Simple base backup with pg_basebackup
pg_basebackup \
--host=localhost \
--username=replicator \
--pgdata=/var/lib/postgresql/base_backup \
--format=tar \
--gzip \
--compress=9 \
--wal-method=stream \
--progress
# Or to a custom location:
pg_basebackup -h localhost -U replicator -D /backup/base -Ft -z -Xs -P
--wal-method=stream streams WAL during the backup, ensuring the backup is immediately usable without waiting for WAL segments to be archived.
Point-in-Time Recovery
To recover to a specific point in time:
- Restore the base backup to the data directory
- Configure recovery in
postgresql.conf:
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
# Or from S3: 'aws s3 cp s3://my-bucket/wal/%f %p'
recovery_target_time = '2024-06-15 14:30:00+00'
recovery_target_action = 'promote' # Promote to primary after reaching target
- Create
recovery.signalin the data directory - Start Postgres
Postgres will restore WAL segments up to the target time, then promote the instance to read-write mode.
Recovery targets:
recovery_target_time: Recover to a specific timestamprecovery_target_lsn: Recover to a specific WAL positionrecovery_target_xid: Recover to after a specific transactionrecovery_target_name: Recover to a named restore point (created withpg_create_restore_point())
PITR is invaluable for “logical corruption” disasters — a bad deployment drops the wrong column, a bug deletes the wrong rows. You can recover the database to just before the accident.
pgBackRest: The Production Standard
pgBackRest is an enterprise-grade backup tool for Postgres, widely considered the most complete solution available. It handles:
- Full, differential, and incremental backups
- WAL archiving and PITR
- Parallel backup and restore (dramatically faster than
pg_basebackup) - Backup catalog management
- Encryption and compression
- Remote repositories (S3, GCS, Azure, SFTP)
- Standby backup (don’t stress the primary)
pgBackRest Configuration
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=very_secure_passphrase
# Or S3:
# repo1-type=s3
# repo1-s3-bucket=my-postgres-backups
# repo1-s3-region=us-east-1
# repo1-s3-endpoint=s3.amazonaws.com
[main]
pg1-path=/var/lib/postgresql/data
pg1-user=postgres
PostgreSQL configuration (for archiving):
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
pgBackRest Operations
# Create the stanza (repository initialization)
pgbackrest --stanza=main stanza-create
# Take a full backup
pgbackrest --stanza=main backup --type=full
# Take a differential backup (since last full)
pgbackrest --stanza=main backup --type=diff
# Take an incremental backup (since last backup)
pgbackrest --stanza=main backup --type=incr
# List backups
pgbackrest --stanza=main info
# Restore to latest
pgbackrest --stanza=main restore
# PITR to a specific time
pgbackrest --stanza=main restore --recovery-option="recovery_target_time=2024-06-15 14:30:00"
# Restore to a new path
pgbackrest --stanza=main restore --pg1-path=/var/lib/postgresql/restored
Backup Schedule
A typical schedule:
- Full backup: Weekly (Sunday at 2am)
- Differential backup: Daily (every other day at 2am)
- WAL archiving: Continuous
This gives you the ability to restore to any point in time within the retention window, with a restore time of (base backup restore time) + (WAL replay time since backup). With incremental backups, the WAL replay window is small.
Barman: Another Production Option
Barman (Backup and Recovery Manager) is another mature backup tool, popular in enterprises. It focuses on remote backup management — Barman runs on a dedicated backup server and fetches backups from Postgres servers.
For teams that want Barman’s centralized management model, it’s an excellent choice. pgBackRest and Barman have roughly equivalent capabilities; the choice is often based on operational preference.
Testing Your Backups
Here is a hard rule: if you haven’t tested a restore, you don’t have a backup.
Databases fail in ways that expose backup configuration bugs you didn’t know existed. Backup processes fail silently. WAL archiving gets misconfigured. Encryption keys get lost. The restore command in your documentation is wrong.
A practical test regime:
Monthly: Restore the latest backup to a test environment. Verify the database starts, data is present, and basic queries work.
After major configuration changes: Any time you change the backup configuration, test a restore before assuming it works.
After Postgres upgrades: Major version upgrades may require updating backup tool versions. Test the restore process after any upgrade.
Restore test script:
#!/bin/bash
set -e
# Restore to a test instance
pgbackrest --stanza=main restore \
--pg1-path=/var/lib/postgresql/test_restore \
--target-exclusive \
--target=latest
# Start the test instance on a different port
postgres -D /var/lib/postgresql/test_restore -p 5433 &
# Wait for startup
until pg_isready -p 5433; do sleep 1; done
# Verify critical data
psql -p 5433 -d mydb -c "SELECT COUNT(*) FROM orders;" | grep -q "[0-9]"
psql -p 5433 -d mydb -c "SELECT max(created_at) FROM orders;"
# Verify WAL applied (check that recent transactions are present)
psql -p 5433 -d mydb -c "SELECT id FROM orders ORDER BY id DESC LIMIT 1;"
echo "Restore test passed"
# Cleanup
pg_ctl stop -D /var/lib/postgresql/test_restore
rm -rf /var/lib/postgresql/test_restore
Run this in CI or as a scheduled job. The 30 minutes this takes each month is cheap compared to discovering your backup is broken during an actual disaster.
The 3-2-1 Rule
Apply the 3-2-1 backup rule to Postgres:
- 3 copies of the data
- 2 different storage media
- 1 offsite copy
In practice:
- Copy 1: The live database
- Copy 2: A hot standby (streaming replication)
- Copy 3: WAL archives and base backups in cloud object storage (S3, GCS)
The standby gives you fast failover (RTO in minutes). The cloud backup gives you PITR and protection against logical corruption and catastrophic datacenter failures.
Monitoring Backup Health
Critical metrics to alert on:
-- Is WAL archiving working?
-- (Check pg_stat_archiver)
SELECT archived_count, last_archived_wal, last_archived_time,
failed_count, last_failed_wal, last_failed_time
FROM pg_stat_archiver;
Alert if:
last_archived_timeis more than 5 minutes oldfailed_countincreased since the last check- Base backup hasn’t run in the configured interval
- Backup storage usage is growing unexpectedly
pgBackRest has built-in check and verify commands:
pgbackrest --stanza=main check # Verify archiving is working
pgbackrest --stanza=main verify # Verify backup files are intact
Backup and recovery is the insurance policy for everything in this book. You can optimize performance, harden security, and design elegant schemas — but without working backups, a single hardware failure or operator error can erase it all. Invest in your backup strategy proportionally to the value of your data. For almost any production system, that means WAL archiving, regular base backups to offsite storage, and tested restores.