Migrations Done Right
Schema migrations are the most dangerous routine operation in a production database. Done wrong, they cause downtime, lock tables, and corrupt data. Done right, they’re invisible — the schema evolves while the application serves traffic, and nobody notices.
This chapter covers the tools, techniques, and mental models for migrating a live Postgres database without pain.
The Migration Mindset
Every migration you write will run against a live database. Your application will be serving requests while Postgres executes your SQL. This changes how you think about schema changes.
The fundamental constraints:
-
Locking: DDL operations take locks. Some locks block reads and writes for the duration. A migration that takes
ACCESS EXCLUSIVEon a busy table will queue behind all active transactions, then block everything that comes after it until it completes. If it takes 10 minutes to complete, everything waits 10 minutes. -
Application compatibility: During a deployment, you may have old application code running against the new schema, or new application code running against the old schema — or both simultaneously (during a rolling deployment). Your migration must not break either.
-
Irreversibility: Dropping a column is immediate and permanent. Getting the data back requires restoring from backup. Don’t drop things you might need.
-
Performance: Rebuilding an index on a 500-million-row table takes hours. Adding a NOT NULL constraint without a default rewrites the table. These operations have real time costs.
Locking and Dangerous Operations
PostgreSQL takes different lock levels for different DDL operations. The ones that cause production problems:
ACCESS EXCLUSIVE Lock (Blocks Everything)
-- These take ACCESS EXCLUSIVE:
ALTER TABLE orders DROP COLUMN legacy_field; -- locks entire table
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT; -- rewrites table
ALTER TABLE orders ADD CONSTRAINT check_amount CHECK (amount > 0) NOT VALID; -- actually fine with NOT VALID
TRUNCATE TABLE orders;
DROP TABLE orders;
The danger: ALTER TABLE takes ACCESS EXCLUSIVE and waits for all existing transactions to finish first. On a busy production table, there may always be active transactions. When the ALTER TABLE sits waiting, it also blocks all new queries on the table. This creates a backlog that looks like an outage.
Fix: Set a lock_timeout before dangerous DDL:
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;
-- If it can't acquire the lock within 5 seconds, it fails (doesn't wait and block)
Retry the migration during lower-traffic periods if it times out.
Adding a Column
PostgreSQL 11+: Adding a nullable column with no default or a constant default is instant — Postgres doesn’t rewrite the table.
-- Instant in PG11+:
ALTER TABLE orders ADD COLUMN notes TEXT;
ALTER TABLE orders ADD COLUMN processed BOOLEAN NOT NULL DEFAULT false;
Pre-PostgreSQL 11: Adding a column with a non-null default rewrites the entire table. This can take hours on large tables and locks the table the entire time.
Modern Postgres (11+) uses a metadata trick: it stores the default value in the system catalog and returns it for any row that doesn’t have the column physically written. Only new rows get the column written. This is “invisible” to queries but avoids the rewrite.
Adding a NOT NULL Constraint
Adding NOT NULL to an existing column in PostgreSQL requires a table scan to verify no NULL values exist. On large tables, this is slow and takes ACCESS EXCLUSIVE.
The safe pattern:
-- Step 1: Add with default, NOT VALID (no scan needed)
ALTER TABLE orders ADD COLUMN customer_email TEXT;
-- Step 2: Backfill existing rows
UPDATE orders SET customer_email = '' WHERE customer_email IS NULL;
-- Step 3: Add the constraint as NOT VALID (doesn't check existing rows)
ALTER TABLE orders ADD CONSTRAINT orders_customer_email_not_null
CHECK (customer_email IS NOT NULL) NOT VALID;
-- Step 4: Validate the constraint (light scan, doesn't block queries in PG 12+)
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_email_not_null;
NOT VALID creates the constraint but skips the validation scan. VALIDATE CONSTRAINT then validates existing rows, but in PostgreSQL 12+, this only takes SHARE UPDATE EXCLUSIVE (doesn’t block reads and writes).
Creating Indexes
CREATE INDEX (without CONCURRENTLY) takes SHARE lock — it blocks writes but not reads. For large tables, this is still a problem.
CREATE INDEX CONCURRENTLY builds the index in the background, only taking brief locks at the start and end:
-- Always use CONCURRENTLY on production:
CREATE INDEX CONCURRENTLY idx_orders_customer_email ON orders(customer_email);
Caveats:
CREATE INDEX CONCURRENTLYcannot run inside a transaction block (BEGIN/COMMIT)- It takes longer than a blocking
CREATE INDEX - If it fails, it leaves behind an
INVALIDindex that must be dropped manually:-- Find and drop invalid indexes: SELECT indexrelid::regclass FROM pg_index WHERE NOT indisvalid; DROP INDEX CONCURRENTLY idx_orders_customer_email; -- then retry
Renaming a Column
Renaming a column is fast (it only updates the system catalog) but requires ACCESS EXCLUSIVE. More importantly, it breaks any application code that uses the old name.
Never rename a column while application code is deployed that uses the old name. Use the expand/contract pattern instead (see below).
The Expand/Contract Pattern
The expand/contract pattern (also called the parallel change pattern) is the safe way to rename columns, change types, or do any transformation that requires old and new states to coexist.
Example: Renaming user_id to customer_id
Phase 1: Expand — add the new column, start writing to both:
ALTER TABLE orders ADD COLUMN customer_id BIGINT REFERENCES customers(id);
Update application code to write to both user_id and customer_id, and read from customer_id (falling back to user_id if null).
Phase 2: Backfill — populate the new column for existing rows:
UPDATE orders
SET customer_id = user_id
WHERE customer_id IS NULL;
On very large tables, do this in batches:
DO $$
DECLARE
batch_size INT := 10000;
rows_updated INT;
BEGIN
LOOP
UPDATE orders
SET customer_id = user_id
WHERE customer_id IS NULL
AND ctid IN (
SELECT ctid FROM orders WHERE customer_id IS NULL LIMIT batch_size
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- brief pause to avoid overwhelming autovacuum
END LOOP;
END$$;
Phase 3: Cut over — once all rows are populated and application code only uses customer_id, add the NOT NULL constraint and deploy the final application version that removes the old column reference:
ALTER TABLE orders ALTER COLUMN customer_id SET NOT NULL;
Phase 4: Contract — remove the old column:
ALTER TABLE orders DROP COLUMN user_id;
This can be done in a later migration, after confirming the new column is correct.
Migration Tools
golang-migrate
Simple, widely-used, language-agnostic. Migrations are plain SQL files with up/down versions.
File naming: 001_create_users.up.sql, 001_create_users.down.sql
-- 001_create_users.up.sql
CREATE TABLE users (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- 001_create_users.down.sql
DROP TABLE users;
migrate -path ./migrations -database $DATABASE_URL up
migrate -path ./migrations -database $DATABASE_URL down 1
golang-migrate tracks migration state in a schema_migrations table. Simple, reliable, integrates easily into CI/CD pipelines.
Flyway
JVM-based, enterprise-grade, widely used in Java ecosystems. Supports complex migration scenarios, callbacks, undo migrations, and dry runs. Excellent integration with Spring Boot.
-- V1__Create_users.sql
CREATE TABLE users (...);
-- V2__Add_email_index.sql
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
Flyway is the right choice if you’re in the Java ecosystem or need enterprise features.
Liquibase
Similar to Flyway but also supports XML/YAML/JSON migration formats in addition to SQL. Stronger focus on database-agnostic migrations. Overkill if you’re Postgres-only.
Atlas
A newer tool focused on declarative schema management. Instead of writing incremental migrations, you describe the desired schema state, and Atlas generates the migration SQL automatically.
# schema.hcl
table "users" {
schema = schema.public
column "id" {
type = bigint
identity { generated = ALWAYS }
}
column "email" {
type = text
}
primary_key { columns = [column.id] }
unique "email_unique" { columns = [column.email] }
}
atlas schema apply --url $DATABASE_URL --to file://schema.hcl
Atlas has excellent Postgres support and generates safe migration SQL. Its --dry-run flag shows what changes would be made without applying them.
ORM Migration Systems
ORMs like SQLAlchemy, Diesel, Prisma, and GORM include migration systems. They work well for simple cases but often generate unsafe SQL for production (blocking index creation, non-concurrent operations). Review and adjust ORM-generated migrations for large tables.
Zero-Downtime Deployment Checklist
For a migration to be safe to run against a live production database:
- All DDL operations use
CONCURRENTLYwhere applicable (indexes) - All potentially long-running operations use
NOT VALID/VALIDATE CONSTRAINTpattern -
lock_timeoutis set before operations that takeACCESS EXCLUSIVE - New columns are nullable or have constant defaults (not computed values)
- No
RENAME COLUMNon a column currently in use by running application code - No
DROP COLUMNunless application code has been deployed that no longer uses it - Large backfills are batched, not a single massive
UPDATE - The migration has been tested against a production-like data volume
Rolling Back
Postgres has no built-in rollback for DDL that was committed. If a migration runs successfully but you realize it was wrong:
- Additive changes (new columns, new tables, new indexes) can be rolled back by removing them
- Destructive changes (dropped columns, dropped tables) require restoring from backup
- Type changes may require recreating the column and backfilling
This is why down migrations are important: they document what “undo” looks like. But they’re only useful if the data hasn’t changed since the up migration ran. A down migration that adds back a dropped column doesn’t restore the data that was in it.
The safest rollback strategy: never drop a column in the same migration as the one that stopped using it. Leave dropped columns in place for at least one deployment cycle. This gives you time to roll back the application code if something is wrong, before the data is gone.
Migration in Practice: The Workflow
-
Write the migration. Consider what lock it takes, how long it will run on your production data volume, and whether the new schema is backward-compatible with currently-deployed application code.
-
Test on a production-sized clone. Many migrations that seem fast in development are slow against 100 million rows. Test with realistic data volumes.
-
Deploy in two phases if needed. If old and new code can’t coexist:
- Phase 1: Expand (add new column, no constraints yet)
- Deploy new application code (reads new column, writes both)
- Phase 2: Contract (add constraints, drop old column)
-
Monitor lock wait times. During the migration, watch
pg_stat_activityfor blocked queries:SELECT pid, wait_event_type, wait_event, query, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND wait_event_type = 'Lock' ORDER BY duration DESC; -
Verify completion. Check that
pg_stat_progress_create_index,pg_stat_progress_copy, etc. show the expected progress and finish.
Safe migrations aren’t complicated — they require discipline, a few patterns (expand/contract, NOT VALID, CONCURRENTLY), and respect for what’s running in production while you work. The teams that get burned by migrations are almost always the teams that wrote the SQL in development and ran it directly on production without considering the implications at scale.