The Case for Enough
There is a moment every engineering team knows. You are sitting in a planning meeting, and someone — usually the person who just came back from a conference, or who just finished a blog post, or who has been waiting for the right opportunity — says “we should use Redis for that.” Or Elasticsearch. Or Kafka. Or DynamoDB. Or MongoDB. The proposal is made with the best of intentions. The new tool does do something well. The demo is compelling.
And so you add it.
Six months later, you have five databases to monitor, five different failure modes to understand, five backup strategies to maintain, five sets of client libraries to keep updated, five mental models to context-switch between, and a team that is increasingly not sure what lives where.
This is the trap. And almost every team falls into it.
The Database Proliferation Problem
The modern data infrastructure stack at a midsize startup often looks something like this:
- PostgreSQL (or MySQL) for the core relational data
- Redis for caching, sessions, rate limiting, and sometimes a job queue
- Elasticsearch (or OpenSearch) for full-text search
- MongoDB (or DynamoDB) for “flexible” document storage
- ClickHouse (or BigQuery, or Redshift) for analytics
- Kafka (or RabbitMQ) for message passing and event streaming
- Pinecone (or Weaviate) for vector search, now that everyone is building AI features
Each of these databases was added for a reason. Redis is legitimately fast. Elasticsearch legitimately has good search capabilities. MongoDB legitimately has a flexible schema. None of these choices were irrational in isolation.
But the aggregate is a disaster.
Consider what “adding a database” actually means for a team:
Operational burden. Each database needs to be provisioned, configured, monitored, backed up, and eventually upgraded. If you are running on Kubernetes, each one needs manifests, persistent volume claims, and health checks. If you are using managed services, each one is a line item on your cloud bill. These costs are not one-time — they compound indefinitely.
Failure surface. Each database is a potential outage. Each one has its own failure modes, its own recovery procedures, its own operational quirks that take years to fully understand. The probability of something being degraded at any given moment increases with each addition.
Knowledge fragmentation. Your team can only go deep on so many things. Every hour spent learning Elasticsearch internals is an hour not spent going deeper on Postgres. Expertise is finite. When you spread it across five databases, you end up with shallow knowledge of all of them and deep knowledge of none. This is when bugs become incidents, and incidents become disasters.
Consistency hazards. Data that spans multiple databases is data that is eventually inconsistent. You cannot do a two-phase commit across Postgres and Redis. You cannot get a transaction that atomically updates Elasticsearch and PostgreSQL. The moment you spread a logical entity across multiple stores, you have accepted eventual consistency — whether you meant to or not.
Onboarding friction. Every new engineer who joins your team has to learn not just your codebase but your entire data topology. “The products are in Postgres but the search index is in Elasticsearch and the product embeddings are in Pinecone and sessions are in Redis” is a sentence that should make you wince.
The Specialization Illusion
The argument for specialized databases is seductive: use the right tool for the job. Redis is faster than Postgres for simple key lookups. Elasticsearch has richer text analysis than Postgres’s full-text search. MongoDB’s flexible schema accommodates unstructured data more naturally than Postgres’s rigid tables.
These claims were more true in 2012 than they are today.
PostgreSQL has been on a relentless march of capability expansion. The JSONB type (introduced in PostgreSQL 9.4, released in 2014) made Postgres a genuinely competitive document store. The GIN index type makes JSONB queries fast. The jsonpath language gives you MongoDB-style query expressiveness. Jacob Kaplan-Moss summarized this cleanly: Postgres is a better Mongo than Mongo — and has been for years.
Full-text search with tsvector, tsquery, ts_rank, and proper language dictionaries handles the vast majority of search use cases. Not every app needs Elasticsearch. Many apps that use Elasticsearch would be better served by Postgres.
Extensions like pgvector have turned Postgres into a capable vector database. The job queue use case — once the canonical Redis argument — is handled beautifully by libraries like River and pg-boss, which use SKIP LOCKED and advisory locks to deliver reliable, transactional job queues. Unlogged tables can push key-value throughput into Redis territory for many workloads.
None of this means Postgres does everything better than every specialized tool. Chapter 19 is specifically about where Postgres is not enough, and it’s written honestly. But the question is not “does Postgres beat Redis at every benchmark?” The question is “does Postgres do this well enough that avoiding the operational complexity of Redis is worth it?” Far more often than people realize, the answer is yes.
What “Enough” Actually Means
“Enough” is not a consolation prize. It is a precision claim.
When we say Postgres is enough, we mean that for the overwhelming majority of use cases engineers reach for specialized databases to solve, Postgres can solve them well — with less operational overhead, with stronger consistency guarantees, with a single mental model, and with decades of battle-tested reliability.
“Enough” means:
- Fast enough (Postgres can handle tens of thousands of queries per second on commodity hardware with proper tuning)
- Flexible enough (JSONB, arrays, hstore, custom types)
- Searchable enough (full-text search, trigram matching, vector similarity)
- Queuing enough (SKIP LOCKED, advisory locks, transactional outboxes)
- Analyzable enough (window functions, CTEs, lateral joins, aggregations)
“Enough” also means: stop asking whether Postgres can do a thing, and start asking whether the alternative is worth the cost.
The Hidden Costs of Polyglot Persistence
The term “polyglot persistence” — using multiple different database technologies within a single system — was coined as a design principle, not a warning. The idea was that different data has different characteristics and should be stored in stores optimized for those characteristics.
In practice, it is usually a liability.
Here is what the hidden costs look like in practice:
Cross-cutting queries become ETL pipelines. The moment your data lives in two places, any query that touches both requires either a data warehouse, a synchronization job, or an application-layer join. Simple questions like “show me all users who clicked this button and then purchased within 24 hours” require a pipeline when clicks are in Kafka and purchases are in Postgres.
Transactions become distributed transactions. If you want to debit an account in Postgres and send a notification via a Redis pub/sub in the same atomic operation, you cannot. You have to accept that one of these operations might fail while the other succeeds, and build compensating logic to handle that case. This is hard to get right and even harder to test.
The development environment gets complicated. A docker-compose.yml that spins up Postgres, Redis, Elasticsearch, and Kafka is a heavy thing. Local development requires more RAM. CI pipelines run slower. New engineers spend their first day getting the environment to start.
Vendor lock-in multiplies. Every managed database service you adopt is a commitment. Migrating off any one of them is painful. Migrating off five simultaneously is project-defining work.
A Brief History of “Just Postgres”
The “just use Postgres” movement is not new, but it has been gaining momentum as Postgres’s capabilities have expanded. A few landmark moments:
2012–2014: The JSONB moment. MongoDB had become enormously popular as the “flexible schema” alternative to relational databases. Then PostgreSQL shipped JSONB — a binary JSON storage format with full indexing support, operators, and functions. The argument for MongoDB suddenly required much more justification.
2016–2018: The job queue case gets made. Engineers publishing case studies showing Postgres-backed job queues handling millions of jobs per day, reliably, with full ACID guarantees. The SKIP LOCKED clause (PostgreSQL 9.5, 2016) was the key primitive that made this practical.
2019–2021: The extension explosion. PostGIS had long been the gold standard for geospatial data. TimescaleDB made Postgres a credible time-series store. pg_partman made partition management practical. The ecosystem started looking more like a platform than a database.
2023–present: The vector moment. pgvector brings approximate nearest neighbor vector search to Postgres with HNSW and IVFFlat indexes. Suddenly the argument for keeping AI feature data in Postgres alongside your application data becomes very practical.
Each of these moments is a case where the specialized-database argument got harder to make.
Who This Book Is For
This book is for engineers who are tired of managing five databases. It is for architects who have been burned by polyglot persistence and want to make the case for simplicity. It is for teams that are starting fresh and want to resist the pull toward premature complexity.
It is not for engineers who have already made the right choice for genuinely specialized requirements. If you are building a globally distributed system with Spanner semantics, use Spanner. If you are doing real-time analytics on petabytes of streaming data, use ClickHouse. Chapter 19 will tell you exactly when to reach for something else, and mean it.
But if you are building a typical web application, an API, an internal tool, a data pipeline of reasonable scale — and you are wondering whether you really need that Redis cluster, that Elasticsearch instance, that MongoDB collection — this book will help you answer that question with confidence.
The answer, more often than you think, is: Postgres is enough.
How to Read This Book
The chapters are designed to be read in order, but most can stand alone if you already have a working Postgres foundation.
Chapters 1–4 establish the foundation: the case for simplicity, how Postgres works internally, schema design principles, and indexes. These chapters matter even if you never read another page — the most common Postgres performance problems come from misunderstanding these fundamentals.
Chapters 5–11 make the case for Postgres as a replacement for specialized databases. Each chapter covers one use case in depth, explains the relevant Postgres features, and compares honestly against the specialized alternative.
Chapters 12–18 cover operations: migrations, tuning, replication, backup, security, observability, and operational ergonomics. These chapters are where the “one database” dividend pays off.
Chapter 19 is the honest chapter: where Postgres isn’t enough, and what to reach for instead.
Chapter 20 ties everything together with a reference architecture.
Let’s start with understanding what Postgres actually is — not just how to query it, but how it works.
How Postgres Actually Works
You don’t need to know how a database engine works to use it. But you do need to know how Postgres works to use Postgres well. The decisions you make about schema design, indexing, query structure, and configuration all become more coherent when you understand what’s happening underneath. The bugs that confuse you — bloated tables, unexpected lock contention, query plans that make no sense — suddenly have obvious explanations.
This chapter is not an academic treatment of database internals. It’s a working engineer’s guide to the parts of Postgres that show up in production.
Process Architecture
PostgreSQL is a process-per-connection database. When a client connects, Postgres forks a new process (called a backend) to handle that connection. This is different from MySQL’s thread-per-connection model or the event-loop models used in databases like MongoDB.
The key processes in a running Postgres instance:
Postmaster. The root process. It listens for incoming connections, forks backends to handle them, and manages the background worker processes. If the postmaster dies, the whole instance goes down.
Backend processes. One per client connection. Each backend has its own memory for sorting, hash joins, and other operations — bounded by work_mem. Each backend can access shared memory (the buffer pool, etc.) but does its own query execution independently.
Checkpointer. Periodically writes dirty pages from the shared buffer cache to disk, ensuring durability. Checkpoints are important for recovery time — the longer between checkpoints, the more WAL replay is required after a crash.
Background Writer. Writes dirty pages to disk more continuously than the checkpointer, reducing the spike of I/O during checkpoints.
WAL Writer. Flushes the write-ahead log to disk. Without this, transactions are not durable.
Autovacuum launcher and workers. Manages the vacuum process across all tables. More on this in the performance chapter — autovacuum is one of the most important and most misunderstood parts of Postgres.
Archiver. If WAL archiving is configured, copies completed WAL segments to an archive location. Essential for point-in-time recovery.
Stats collector. Gathers statistics about table and index usage, query execution counts, and more — surfaced via pg_stat_* views.
The process model has implications for how Postgres scales. Because each connection is a full process, connections are expensive — they consume memory and create OS overhead. This is why connection pooling (PgBouncer, pgpool) is essentially mandatory for high-connection-count applications. The max_connections setting isn’t just a configuration knob — it’s a statement about how many concurrent backend processes you want the OS to manage.
Shared Memory and the Buffer Cache
All backends share a region of memory called the shared buffer cache — configured by shared_buffers. When Postgres needs to read a page, it first checks this cache. If the page is there (a cache hit), it serves it from memory. If not (a cache miss), it reads it from disk into the cache.
The buffer cache uses a clock-sweep eviction algorithm. Pages that are frequently accessed get a higher “usage count” and are kept around longer. Pages that haven’t been accessed recently get evicted to make room for new pages.
shared_buffers should typically be set to about 25% of available RAM for a dedicated Postgres server. The reason it’s not higher is that the OS page cache also helps — Linux (and other operating systems) will cache frequently-read files in the OS page cache, which Postgres also benefits from. The total effective caching is shared_buffers plus whatever the OS is caching on top.
effective_cache_size is a hint to the query planner about how much total caching (shared_buffers + OS cache) is available. Setting it too low causes the planner to prefer sequential scans over index scans when it shouldn’t.
Storage: Heap Files and Pages
Postgres stores table data in heap files — ordinary files on disk, named after the table’s OID (Object Identifier). Each table lives in one or more files in the data directory, under $PGDATA/base/<database_oid>/.
These files are divided into pages (also called blocks), each exactly 8KB by default. A page contains:
- A 24-byte header with metadata (checksum, flags, LSN, etc.)
- An array of item pointers (4 bytes each) that point to tuples within the page
- Free space in the middle (where new tuples go)
- Tuples at the end of the page, growing toward the middle
A tuple is Postgres’s internal name for a row. Each tuple includes:
- A tuple header (
HeapTupleHeaderData): visibility information, OID if applicable, number of attributes, null bitmap - The actual column data
The separation of item pointers from tuples allows Postgres to move tuples within a page for vacuuming without invalidating the item pointer offset that indexes use to locate the tuple.
MVCC: The Heart of Concurrency
Multi-Version Concurrency Control is the mechanism by which Postgres allows readers and writers to run concurrently without blocking each other. Understanding MVCC explains a lot of behaviors that otherwise seem mysterious.
The core idea: instead of updating a row in place (which would require locking readers out), Postgres creates a new version of the row. The old version remains visible to transactions that started before the update. The new version is only visible to transactions that start after the update commits.
Transaction IDs and Visibility
Every transaction gets a transaction ID (XID), a 32-bit integer. Postgres uses XIDs to track which transactions created and invalidated each tuple.
Each tuple has two system columns:
xmin: the XID of the transaction that created this tuplexmax: the XID of the transaction that deleted (or updated) this tuple; 0 if the tuple is “live”
When you update a row, Postgres:
- Marks the old tuple’s
xmaxwith the current transaction XID - Creates a new tuple with the updated data, stamping its
xminwith the current transaction XID
When you delete a row, Postgres:
- Marks the tuple’s
xmaxwith the current transaction XID - Does not immediately reclaim the space — the tuple remains on disk until VACUUM removes it
This is why Postgres tables can grow even when you’re not adding rows. Heavy UPDATE and DELETE workloads create dead tuples that accumulate until VACUUM cleans them up.
Snapshots
When a transaction starts, Postgres takes a snapshot of the current state of the transaction ID space. The snapshot records:
- The current maximum assigned XID (
xmax) - A list of XIDs that are currently in progress (active but not committed)
Using this snapshot, a transaction can determine whether any given tuple is visible to it:
- If
tuple.xmin>snapshot.xmax: this tuple was created after the snapshot — not visible - If
tuple.xminis in the active list: this tuple was created by a transaction that wasn’t committed when the snapshot was taken — not visible (with exceptions for read-committed isolation) - If
tuple.xmaxis committed and <snapshot.xmaxand not in the active list: this tuple was deleted before the snapshot — not visible - Otherwise: visible
This logic is evaluated for every tuple during a scan. It’s not free, but it’s fast, and it enables the most important Postgres property: readers never block writers, and writers never block readers.
Isolation Levels
Postgres supports four isolation levels, though it implements them in interesting ways:
Read Committed (default). Each statement within a transaction sees a fresh snapshot — committed rows from all transactions that committed before the statement started. This means within a single transaction, two identical SELECT statements can return different results.
Repeatable Read. The snapshot is taken at the start of the first statement in the transaction and held for the duration. All statements in the transaction see the same consistent view of the data. In Postgres, this is implemented using the same snapshot mechanism as serializable — there’s no separate “lock all reads” implementation.
Serializable (Serializable Snapshot Isolation, SSI). Full serializable isolation, implemented with predicate locks that detect serialization anomalies and abort transactions that would violate them. Postgres’s SSI implementation is genuinely impressive — it provides true serializability without the throughput collapse of traditional two-phase locking.
Read Uncommitted. Postgres doesn’t actually implement this — it falls back to Read Committed. PostgreSQL never shows uncommitted data from other transactions.
The XID Wraparound Problem
Transaction IDs are 32-bit integers. That means Postgres can only have about 4 billion unique XIDs. When the counter wraps around, old XIDs would appear newer than they are, causing Postgres to make all your data invisible.
Postgres prevents this with a concept called freezing. VACUUM periodically updates old tuples to mark them as “frozen” — visible to all transactions regardless of XID. This essentially removes them from the XID visibility system. The autovacuum_freeze_max_age setting controls how old a transaction can be before Postgres forces a freeze vacuum on the table.
XID wraparound is a real operational hazard. If autovacuum is disabled or blocked, and you hit 2^31 transactions since the last freeze vacuum, Postgres will:
- Start emitting warnings at
autovacuum_freeze_max_agetransactions before wraparound - Stop accepting new connections entirely (going into “recovery mode”) at approximately 3 million transactions before wraparound
Watch pg_stat_user_tables.n_dead_tup and age(relfrozenxid) in pg_class. Keep autovacuum healthy.
The Write-Ahead Log (WAL)
The WAL is Postgres’s mechanism for ensuring durability. Every modification to database state — inserts, updates, deletes, commits — is first written to the WAL before it is applied to the actual data files. If Postgres crashes mid-operation, it can replay the WAL on startup to recover to a consistent state.
The WAL is a circular log of WAL records, each describing a change to database state. WAL records are written to WAL segments — 16MB files (by default) in $PGDATA/pg_wal/. As segments fill up, new ones are created. Old segments can be archived (for point-in-time recovery) or deleted once they’re no longer needed for recovery or replication.
WAL is also the mechanism for streaming replication. Standby servers connect to the primary and stream WAL records, applying them in order to maintain a replicated copy of the primary’s state. This is why WAL and replication are inseparable concepts in Postgres.
WAL and Durability
The fsync and synchronous_commit settings control durability vs. performance trade-offs:
fsync = on (default): After each transaction commits, Postgres calls fsync() to ensure the WAL is durable on disk. Without this, a crash could lose committed transactions.
synchronous_commit = on (default): The transaction doesn’t return to the client until the WAL record is flushed to disk (and, if using synchronous replication, to the standbys). Setting this to off allows commits to return before fsync, improving throughput at the cost of a risk of losing the last few seconds of committed transactions on crash.
synchronous_commit = local: The transaction waits for the local WAL flush but not for standbys. A reasonable middle ground for some workloads.
The wal_level setting controls how much information is written to WAL:
minimal: just enough for crash recoveryreplica: enough for streaming replication (the default since PostgreSQL 10)logical: enough for logical decoding and logical replication (required for tools like Debezium and pglogical)
The Query Planner
Postgres’s query planner is one of its defining features — and one of the things that requires the most understanding when things go wrong.
When you execute a query, Postgres goes through several phases:
- Parse: Convert the SQL text into an abstract syntax tree
- Analyze/Rewrite: Apply rewrite rules (views are expanded here), resolve names, type-check
- Plan: Generate candidate execution plans, estimate costs, choose the best one
- Execute: Run the chosen plan
The planner’s job is to find the cheapest execution plan for your query. “Cheapest” means a cost estimate based on a model of:
- How many pages of data need to be read from disk
- How many tuples need to be processed
- How expensive each operation (sort, hash join, nested loop) is
Statistics
The planner makes cost estimates using statistics collected by ANALYZE. For each column in each table, Postgres stores:
n_distinct: estimated number of distinct valuescorrelation: how well the physical order of the column matches the logical order (relevant for index scans)most_common_valsandmost_common_freqs: the most common values and how often they appear- A histogram of value distribution
Statistics are stored in pg_statistic (raw data, for internal use) and pg_stats (a friendlier view for humans).
The default_statistics_target setting (default: 100) controls how much detail to collect. For columns with high cardinality or skewed distributions where the planner keeps making bad estimates, increase the per-column statistics target:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
Join Strategies
When joining tables, the planner has three strategies available:
Nested Loop Join. For each row in the outer relation, scan the inner relation. Efficient when the outer relation is small and the inner relation has a good index.
Hash Join. Build a hash table from the smaller relation, then probe it for each row in the larger relation. Great for large equality joins where both sides are big. Requires work_mem to hold the hash table.
Merge Join. If both relations are sorted on the join key (or can be sorted), merge them together in one pass. Efficient for large sorted datasets.
The planner chooses based on cost estimates. When the planner makes a wrong choice, it’s usually because the statistics are stale or inaccurate. EXPLAIN (ANALYZE, BUFFERS) tells you what the planner estimated vs. what actually happened.
Examining Plans
The EXPLAIN command is your window into the planner. Without ANALYZE, it shows estimated costs. With ANALYZE, it actually executes the query and shows real numbers.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.email, count(o.id)
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > now() - interval '30 days'
GROUP BY u.email;
Key things to look for in a plan:
- Rows estimates vs. actual rows: large discrepancies indicate bad statistics
- Seq Scans on large tables: often a sign of a missing index
- Nested loops on large outer relations: usually a red flag
- Buffers: hit vs. read: cache hit rate for this query
- Sort operations that spill to disk:
work_memmay be too low
EXPLAIN (ANALYZE, BUFFERS) is the most important debugging tool in Postgres. Learn to read it fluently.
TOAST
Postgres has a maximum row size of approximately 8KB (one page). For columns that can hold larger values — text, bytea, jsonb, arrays — Postgres uses TOAST (The Oversized-Attribute Storage Technique).
TOAST works by:
- Compressing the value inline if compression makes it fit in a page
- Chunking and storing the value in a separate TOAST table if it’s still too large
The TOAST table has a name like pg_toast_<table_oid> and is managed automatically. When you query a TOASTed column, Postgres automatically decompresses/reassembles the value.
TOAST has performance implications:
- Large JSONB documents, text fields, or binary data will be TOASTed, adding decompression overhead
- Selecting a TOASTed column from a wide table requires going to the TOAST table, not just the heap
- Index-only scans cannot use TOASTed column values from the visibility map — they need to visit the heap
The STORAGE attribute on a column controls TOAST behavior: PLAIN (never TOAST), EXTERNAL (TOAST without compression), EXTENDED (the default — compress first, then TOAST), MAIN (try to keep in heap but compress if needed).
Locking
Postgres has a rich locking system that operates at multiple levels. Understanding it prevents production surprises.
Table-Level Locks
Postgres has eight table-level lock modes, from weakest to strongest:
ACCESS SHARE: taken bySELECT— compatible with almost everythingROW SHARE: taken bySELECT FOR UPDATE/SHAREROW EXCLUSIVE: taken byINSERT,UPDATE,DELETESHARE UPDATE EXCLUSIVE: taken byVACUUM,ANALYZE, someALTER TABLESHARE: taken byCREATE INDEX(not concurrently)SHARE ROW EXCLUSIVE: taken by someCREATE TRIGGER,ALTER TABLEEXCLUSIVE: taken byREFRESH MATERIALIZED VIEW CONCURRENTLYACCESS EXCLUSIVE: taken by mostALTER TABLE,DROP TABLE,TRUNCATE,VACUUM FULL
The critical one for operations is ACCESS EXCLUSIVE — it blocks everything, including SELECT. Any DDL operation that takes ACCESS EXCLUSIVE will queue behind all active queries on the table and will block all subsequent queries while it waits. This is why ALTER TABLE ADD COLUMN (even for nullable columns, in older Postgres versions) could take down production.
Row-Level Locks
Row-level locks are taken implicitly by DML. UPDATE and DELETE take a row-level exclusive lock on the affected rows. SELECT FOR UPDATE takes an explicit row-level lock for use in pessimistic locking patterns.
Row-level locks conflict only with each other — concurrent UPDATE on different rows does not block.
Advisory Locks
Postgres also supports advisory locks — locks you explicitly acquire and release in application code. They’re not tied to table rows; they’re just named locks with a 64-bit integer key.
-- Acquire an advisory lock (blocks if another session holds it)
SELECT pg_advisory_lock(12345);
-- ... do work ...
SELECT pg_advisory_unlock(12345);
-- Try-acquire (returns false immediately if lock is held)
SELECT pg_try_advisory_lock(12345);
Advisory locks are per-session and are automatically released when the connection closes. They’re the building block for distributed locking patterns and are used extensively in job queue implementations.
Deadlocks
When two transactions each hold a lock the other wants, Postgres detects the deadlock and aborts one of the transactions (the one that’s cheaper to abort). You can’t prevent deadlocks entirely, but you can minimize them by always acquiring locks in the same order.
Putting It Together
What you’ve learned in this chapter shapes every decision in the rest of the book:
- MVCC means reads don’t block writes and vice versa — but dead tuples accumulate, and VACUUM is not optional
- The buffer cache means random I/O to recently-accessed data is fast — but cold data costs
- The query planner makes smart choices when statistics are accurate — keep statistics fresh with regular
ANALYZE - The WAL makes Postgres durable and replicable — it’s also the foundation for logical replication and change data capture
- TOAST means you can store large values, but there are performance trade-offs
- Locking is fine-grained and avoids contention in most cases — but DDL operations can cause surprises
The single most important habit you can develop is running EXPLAIN (ANALYZE, BUFFERS) on slow queries and understanding what it tells you. Everything else in this book will make more sense because of that habit.
Next: schema design. The choices you make before the first row is inserted will shape your system for years.
Schema Design That Won’t Haunt You
Schema design is the decision you’ll regret the most if you get it wrong, and benefit from the most if you get it right. Unlike code, which can be refactored, schemas leave fingerprints on everything: your queries, your indexes, your migrations, your application logic, your team’s mental model of the system. A good schema is nearly invisible — it just works, naturally. A bad schema generates friction at every turn.
This chapter covers the design principles that hold up over years of production use.
Start With Normalization, Deviate Intentionally
Third Normal Form (3NF) is not just a textbook abstraction — it’s the starting point for schemas that stay maintainable. In practice, normalization means:
- Every column depends on the primary key, the whole primary key, and nothing but the primary key
- Each fact is stored in exactly one place
- Update anomalies are impossible: you change a value once and it’s changed everywhere
The practical upshot: if you’re storing a user’s email address in the users table and also in the audit_log table and also in the email_preferences table, you have three places to update when an email changes. You will eventually update two of them and miss the third. You will then spend an afternoon debugging why your system thinks a user has two different email addresses.
Normalize first. Denormalize later, only where the performance evidence demands it, and only in well-understood ways.
Common normalization failures to avoid:
Repeating groups in a single column: storing comma-separated values like "tag1,tag2,tag3" in a tags column. Use a proper junction table or a text[] array. Comma-separated values cannot be indexed, cannot be queried efficiently, and will make your future self unhappy.
Non-atomic values: storing a person’s full name in one column when you need to sort by last name. If you ever need to decompose a value, it should have been two columns.
Transitive dependencies: a orders table with customer_id, customer_name, and customer_email. If customer_name depends on customer_id rather than on the order, it belongs in the customers table, not the orders table. Orders should reference customers by ID.
Primary Keys
The question of what to use as a primary key is more consequential than it appears.
Serial vs. Identity
The old pattern — id SERIAL PRIMARY KEY — creates an integer sequence and sets the default. This works, but SERIAL is unofficial shorthand. The proper SQL-standard approach since PostgreSQL 10 is identity columns:
id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY
-- or
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY
GENERATED ALWAYS prevents accidental insertion of explicit values (you have to use OVERRIDING SYSTEM VALUE to override it). GENERATED BY DEFAULT allows explicit values when needed.
Use BIGINT (8 bytes, range up to ~9.2 × 10^18) rather than INTEGER (4 bytes, range up to ~2.1 × 10^9) for anything expected to grow. Running out of integer space in a high-traffic primary key table is a terrible emergency.
UUIDs
UUIDs (UUID type, 16 bytes) are popular for several legitimate reasons:
- They can be generated client-side without a database round-trip
- They don’t leak row counts or insertion order to external parties
- They work naturally in distributed systems where multiple sources generate IDs
The traditional gen_random_uuid() (v4) generates fully random UUIDs. The problem with random UUIDs as primary keys is index fragmentation. Because v4 UUIDs are random, insertions go to random positions in the index B-tree, causing page splits and poor cache locality. This matters at scale — tables with hundreds of millions of rows and random UUID primary keys will have noticeably worse performance than sequential alternatives.
The better choice for most applications: UUIDv7.
UUIDv7 is a time-ordered UUID — the first 48 bits encode the millisecond timestamp, followed by random bits. This preserves the benefits of UUIDs (opaque, distributable) while maintaining roughly sequential insertion order, which dramatically reduces B-tree fragmentation.
In PostgreSQL 17, gen_random_uuid() still generates v4. For v7, you currently need either an extension or application-level generation. Several libraries provide this:
-- Using the uuid-ossp extension
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
-- uuid_generate_v4() for v4 (random)
-- Or generate in the application and pass it in
-- e.g., in Go: github.com/google/uuid v7
-- in Python: uuid7 package
As of PostgreSQL 17, there’s ongoing work to add native UUIDv7 generation.
Practical recommendation:
- Small to medium tables with simple access patterns:
BIGINT GENERATED ALWAYS AS IDENTITY - Tables that need externally-opaque IDs or multi-source ID generation: UUIDv7
Composite Primary Keys
Junction tables for many-to-many relationships should use composite primary keys:
CREATE TABLE user_roles (
user_id BIGINT NOT NULL REFERENCES users(id),
role_id BIGINT NOT NULL REFERENCES roles(id),
granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, role_id)
);
Adding a surrogate id column to a junction table is usually unnecessary and adds space overhead. The composite key is natural and enforces uniqueness.
Data Types: Use the Right Ones
PostgreSQL’s type system is rich. Using the right types is not just about correctness — it’s about storage efficiency, index performance, and the queries the database can help you write.
Text Types
TEXT and VARCHAR(n) are stored identically in Postgres — there is no performance advantage to VARCHAR(n) over TEXT. CHAR(n) is padded with spaces and is almost never what you want.
Use TEXT for variable-length strings unless you have a business rule that enforces a length, in which case use a CHECK constraint:
-- Not this (VARCHAR limit has no storage advantage):
name VARCHAR(255)
-- This (with explicit constraint if needed):
name TEXT NOT NULL CHECK (length(name) BETWEEN 1 AND 255)
The one case for VARCHAR(n): it gives PostgreSQL an upper bound on the column size, which can help the planner in some cases. But this is a minor consideration compared to readability.
Numeric Types
SMALLINT(2 bytes): -32768 to 32767. Almost never worth the savings.INTEGER(4 bytes): -2.1B to 2.1B. Good for most counters and IDs that won’t grow large.BIGINT(8 bytes): -9.2 × 10^18 to 9.2 × 10^18. Default for IDs on large tables.NUMERIC(p, s)/DECIMAL(p, s): Exact arbitrary-precision arithmetic. Use for money and anything where rounding matters. Slow compared to integer/float types.REAL(4 bytes, float): Approximate. Use for measurements where small rounding errors are acceptable.DOUBLE PRECISION(8 bytes, float): Approximate, larger range. Use for scientific computation.
Money: Use NUMERIC(12, 2) or NUMERIC(19, 4) depending on required precision. Never use floating-point types for monetary values — the rounding errors are real and will cause auditing nightmares. Postgres has a MONEY type but it’s locale-dependent and problematic; prefer NUMERIC.
Timestamps
Always use TIMESTAMPTZ (timestamp with time zone) rather than TIMESTAMP (without time zone).
Despite the name, TIMESTAMPTZ doesn’t actually store the timezone. It stores a UTC instant and converts to/from the session timezone on display. TIMESTAMP stores whatever you give it with no timezone context — it’s a footgun when your server timezone changes or when your team works across time zones.
-- This is what you want:
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
-- This will bite you eventually:
created_at TIMESTAMP NOT NULL DEFAULT now()
For date-only values: DATE. For time-only: TIME (without tz) or TIMETZ (with tz, rarely needed). For intervals: INTERVAL.
now() returns the transaction start time (same within a transaction). clock_timestamp() returns the actual current time and changes within a transaction. For DEFAULT values on insert, now() is almost always what you want.
Booleans
Use BOOLEAN, not SMALLINT or CHAR(1). PostgreSQL’s BOOLEAN type stores TRUE, FALSE, or NULL. It’s unambiguous, indexable, and understood by every ORM.
Enumerations
For columns with a small, fixed set of values, Postgres offers ENUM types:
CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered', 'cancelled');
CREATE TABLE orders (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
status order_status NOT NULL DEFAULT 'pending'
);
ENUMs are stored efficiently (4 bytes) and enforced at the type level. The drawback: adding values to an ENUM requires ALTER TYPE ... ADD VALUE, which is a DDL operation (though a fast one — it doesn’t lock the table). Removing or reordering values requires a full type rebuild.
An alternative is a TEXT column with a CHECK constraint:
status TEXT NOT NULL CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled'))
This is easier to migrate (just update the constraint) but gives up the storage efficiency. For most applications, the TEXT + CHECK approach is preferable because of its migration flexibility. The storage difference is negligible unless you have billions of rows.
Arrays
Postgres has native array support. You can have TEXT[], INTEGER[], UUID[], etc.
Arrays are useful for storing a small, ordered list of values where the list isn’t a relational entity. Tags, permissions flags, phone numbers, and small sets work well as arrays.
CREATE TABLE articles (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
title TEXT NOT NULL,
tags TEXT[] NOT NULL DEFAULT '{}'
);
-- Querying array containment:
SELECT * FROM articles WHERE tags @> ARRAY['postgres', 'performance'];
-- Index support:
CREATE INDEX idx_articles_tags ON articles USING GIN (tags);
For larger sets or when you need to query individual elements relationally (join on them, count them, etc.), a proper junction table is usually better than an array.
JSONB
JSONB is Postgres’s binary JSON type. It’s useful for storing semi-structured data — configuration, metadata, external API payloads, optional attributes that vary by record type.
Chapter 5 covers JSONB in depth. The key rule here: don’t use JSONB as a crutch to avoid schema design. “I’ll just put everything in a JSONB column” is often a sign that you haven’t thought hard enough about your data model. JSONB shines for genuinely variable data; it’s a poor substitute for columns.
Constraints: Your Data’s Last Line of Defense
Constraints are not overhead — they’re the database enforcing your business rules at the layer closest to the data. Bugs that would otherwise corrupt your data fail loudly at the constraint level.
NOT NULL
Apply NOT NULL aggressively. Null is a special value that infects every query that touches it — NULL = NULL is NULL (not true), NULL + 5 is NULL, COUNT(col) ignores NULLs. Every nullable column makes your queries more complex.
Some columns are legitimately nullable — deleted_at in a soft-delete pattern, an optional middle name, a shipping address that might not exist yet. But “nullable because I’m not sure yet” is a schema smell. Be explicit about which NULLs are intentional.
CHECK Constraints
CHECK constraints let you encode business rules in the schema:
ALTER TABLE products ADD CONSTRAINT check_price_positive CHECK (price_cents > 0);
ALTER TABLE users ADD CONSTRAINT check_email_valid CHECK (email LIKE '%@%.%');
ALTER TABLE events ADD CONSTRAINT check_end_after_start CHECK (end_at > start_at);
CHECK constraints are evaluated on every insert and update. If a constraint is expensive to evaluate, consider whether it belongs in the schema or in application code. But for simple range checks and invariant enforcement, they’re invaluable.
UNIQUE Constraints
UNIQUE constraints prevent duplicate values and create an index automatically. Use them for columns that must be unique at the business level:
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);
ALTER TABLE products ADD CONSTRAINT products_sku_unique UNIQUE (sku);
For multi-column uniqueness:
ALTER TABLE team_members ADD CONSTRAINT team_members_unique UNIQUE (team_id, user_id);
Note that UNIQUE allows NULLs — two NULL values do not violate a UNIQUE constraint (because NULL ≠ NULL). If you need a column to be unique among non-null values but allow multiple NULLs, this is the default behavior. If you want to constrain that nulls are also unique (at most one NULL), use a partial index instead of a UNIQUE constraint.
Foreign Key Constraints
Foreign keys enforce referential integrity — a row in orders cannot reference a non-existent user_id. They also communicate the relationship structure to anyone reading the schema.
CREATE TABLE orders (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
...
);
The ON DELETE clause matters:
RESTRICT(default): prevent deletion of the referenced row if child rows existCASCADE: automatically delete child rows when parent is deletedSET NULL: set the foreign key column to NULL when parent is deletedSET DEFAULT: set the foreign key column to its default when parent is deletedNO ACTION: like RESTRICT but deferred (checked at end of transaction)
Choose deliberately. CASCADE is often convenient but can cause surprising bulk deletes. RESTRICT is safer but requires explicit cleanup.
Foreign key columns should almost always be indexed separately (the foreign key constraint does not create an index automatically — only the referenced side gets an index). Missing indexes on foreign key columns cause slow deletes and joins:
CREATE INDEX idx_orders_user_id ON orders(user_id);
Naming Conventions
Consistent naming is a form of documentation. These conventions are widely used in the Postgres community:
- Tables: plural snake_case —
users,orders,product_categories - Columns: snake_case —
created_at,user_id,first_name - Primary keys:
id(or<table_singular>_idfor clarity in joins) - Foreign keys:
<referenced_table_singular>_id—user_id,order_id - Timestamps:
created_at,updated_at,deleted_at - Indexes:
idx_<table>_<columns>—idx_orders_user_id,idx_users_email - Constraints:
<table>_<columns>_<type>—orders_user_id_fkey,users_email_unique - Sequences:
<table>_<column>_seq—users_id_seq
Postgres folds unquoted identifiers to lowercase. Never use quoted mixed-case identifiers (like "userId") — it requires quoting everywhere and makes SQL in psql and other tools annoying.
Soft Deletes
Many applications implement soft deletes — instead of actually deleting a row, they set a deleted_at timestamp. This preserves the audit trail, allows undeletion, and avoids cascading foreign key deletes.
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ;
The operational challenge with soft deletes is that every query must filter out deleted rows. Forgetting WHERE deleted_at IS NULL is a persistent bug source.
Strategies:
- Use a view that filters deleted rows, and query through the view
- Use row-level security to hide deleted rows for non-admin roles
- Use a partial index that covers only non-deleted rows:
CREATE INDEX idx_users_active ON users(email) WHERE deleted_at IS NULL
The partial index keeps queries on non-deleted rows fast without the overhead of including deleted rows in the index.
Audit Columns and Triggers
Every table that represents mutable domain objects should have created_at and updated_at columns. created_at is set on insert and never changed. updated_at is maintained automatically.
CREATE TABLE products (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL,
price_cents INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Trigger to auto-update updated_at
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = now();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_products_updated_at
BEFORE UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
Some ORMs maintain updated_at in application code. This is less reliable than a trigger — application code can be bypassed (direct database access, migrations, admin scripts), and a trigger ensures the column is always correct.
Partitioning
Table partitioning splits a large table into smaller child tables while presenting a unified interface. Postgres supports:
- Range partitioning: by a range of values (dates are the most common use case)
- List partitioning: by discrete values (e.g., partition by region or tenant)
- Hash partitioning: by hash of a column value (distributes rows evenly)
CREATE TABLE events (
id BIGINT NOT NULL,
event_type TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
Partitioning is useful when:
- You need to drop old data quickly (drop a partition instead of deleting millions of rows)
- Your queries consistently filter on the partition key (allowing partition pruning)
- The table is large enough that index scans across the entire table are slow
Partitioning is not a solution for poor query performance in general. A well-indexed table of 50 million rows often performs fine without partitioning. Adding partitioning to a table that doesn’t need it adds complexity without benefit.
The pg_partman extension (Chapter 11) automates partition creation and maintenance.
Table Design Anti-patterns
Entity-attribute-value (EAV). A table like (entity_id, attribute_name, attribute_value) that stores arbitrary key-value metadata. This is almost always a mistake. EAV tables are impossible to index effectively, produce horrible query syntax, and eliminate all type safety. If you need flexible attributes, use JSONB or a proper polymorphic schema.
Polymorphic associations. A comments table with a commentable_type TEXT and commentable_id BIGINT that references different tables based on commentable_type. You can’t add a foreign key constraint, can’t do efficient joins, and the relationship semantics are implicit. Consider table inheritance, separate junction tables, or a single comments table with nullable foreign keys.
GOD tables. A single table that stores multiple conceptually different entities with a type discriminator column, and lots of nullable columns that only apply to some types. This is almost always a schema that was designed to avoid creating new tables. Split it.
Natural keys as primary keys (with exceptions). Using a business-meaningful value (email address, ISBN, social security number) as a primary key creates coupling between business rules and physical data layout. When the “immutable” business key turns out to be mutable (and it always does eventually), migrating is expensive. Use surrogate keys (BIGINT or UUID) and enforce uniqueness of natural keys with a UNIQUE constraint.
The exception: genuinely immutable, truly unique identifiers like UPC codes or ISO country codes in lookup tables where the natural key will never change.
Designing for Migration
Every schema you design today will be migrated tomorrow. Design with that in mind:
- Prefer adding columns to changing them. Adding a nullable column is instant in modern Postgres. Changing a column’s type requires rewriting the table.
- Avoid NOT NULL on new columns without defaults. Adding a NOT NULL column without a default requires rewriting the table (pre-Postgres 11 for constants). Use a default, or add nullable and backfill separately.
- Name indexes and constraints explicitly. Auto-generated names like
orders_pkeyare predictable, but custom names clarify intent. - Use the expand/contract pattern for renames. Add the new column, backfill, make the old column nullable, stop writing to it, drop it later. See Chapter 12.
Schema design is an exercise in empathy — for your future self, your teammates, and the engineers who will maintain this system after you’re gone. Make the schema communicate the domain clearly, enforce invariants aggressively, and leave room for the changes you know are coming.
Indexes — The Full Picture
If schema design is the decision you’ll regret for years, index design is the decision you’ll regret for months. Bad indexes mean slow queries, angry users, and 3am pages. Good indexes are invisible — queries are fast, the planner makes sensible choices, and the only reason you look at the index configuration is to add a new one.
Postgres has a richer index ecosystem than any other major relational database. B-tree is just the beginning. Understanding when and how to use GIN, GiST, BRIN, hash, partial, and expression indexes is the difference between a database that struggles at scale and one that handles it gracefully.
How Indexes Work (The Part That Matters)
Before diving into index types, it’s worth being precise about what an index is and what it costs.
An index is a separate data structure that maintains a sorted (or hashed) reference to rows in a table. When Postgres evaluates a query, the planner can choose to use an index instead of reading the entire table (a sequential scan). An index scan finds the relevant rows in the index, then fetches them from the heap.
Indexes are not free. They cost:
- Space: An index typically uses 10–30% of the table’s data size, sometimes more
- Write overhead: Every
INSERT,UPDATE, andDELETEmust update all applicable indexes on the table - Maintenance: Indexes accumulate dead entries (dead tuples) that must be vacuumed; heavily updated indexes bloat
- Planning time: More indexes means more time spent by the planner evaluating which to use
A table with 20 indexes does not perform 20 times better than a table with 1 index. Adding indexes beyond what your actual query patterns require makes writes slower and doesn’t help reads. Index ruthlessly but purposefully.
The selectivity principle: Indexes are most valuable when they are selective — when they narrow down the search space significantly. An index on status in an orders table where 95% of orders are 'delivered' will be ignored by the planner for queries that filter on status = 'delivered' (because a sequential scan would be faster). The same index is invaluable for queries filtering on status = 'processing' (2% of rows). Selectivity is the key concept.
B-tree Indexes
The default. When you write CREATE INDEX, you get a B-tree. When you add a PRIMARY KEY or UNIQUE constraint, Postgres creates a B-tree index automatically.
A B-tree (balanced tree) maintains sorted values, allowing:
- Equality lookups:
WHERE email = 'alice@example.com' - Range scans:
WHERE created_at BETWEEN '2024-01-01' AND '2024-02-01' - Ordering:
ORDER BY last_name(if the index is onlast_name, Postgres can avoid sorting) - Prefix matching:
WHERE name LIKE 'foo%'(but NOTWHERE name LIKE '%foo') IS NULLandIS NOT NULLchecks
B-trees support all comparison operators: <, <=, =, >=, >, BETWEEN, IN.
Multi-column B-tree Indexes
A multi-column index (composite index) is a B-tree ordered by multiple columns:
CREATE INDEX idx_orders_user_id_status ON orders(user_id, status);
The critical rule: a composite index is used from left to right. This index supports:
WHERE user_id = $1(uses just the first column)WHERE user_id = $1 AND status = $2(uses both columns)WHERE user_id = $1 AND status = $2 ORDER BY user_id(uses both, avoids sort)
But NOT:
WHERE status = $1(the leading column isuser_id, notstatus)
The column order in a composite index matters. Generally, put the most selective column first, and match the order to your most common query patterns.
Index Scan vs. Bitmap Index Scan vs. Sequential Scan
Postgres has three ways to use an index:
Index Scan: Walks the B-tree to find matching entries, then fetches each heap row individually. Fast for small result sets, slow for large ones (many random I/O operations).
Bitmap Index Scan: Scans the index for all matching TIDs (tuple IDs), builds a bitmap of heap pages that need to be visited, then fetches those pages in order. More cache-friendly than a plain index scan for larger result sets. Can combine multiple indexes with AND/OR logic.
Sequential Scan: Reads the entire table in physical order. Wins when a large fraction of the table matches the query.
The planner chooses based on cost estimates. If your query returns more than roughly 5–15% of the table, the planner often prefers a sequential scan. This is correct behavior — don’t try to force index use on highly selective queries.
Covering Indexes (Index-Only Scans)
If an index contains all the columns a query needs, Postgres can satisfy the query from the index alone — no heap visit required. This is an index-only scan and is significantly faster.
-- This query:
SELECT email, created_at FROM users WHERE created_at > now() - interval '7 days';
-- Is satisfied entirely by this index (no heap access):
CREATE INDEX idx_users_created_covering ON users(created_at) INCLUDE (email);
The INCLUDE clause (PostgreSQL 11+) adds columns to the index leaf pages without sorting by them. This makes them available for index-only scans without affecting the index’s sort order or usability as a regular index.
Covering indexes can dramatically speed up read-heavy queries. The trade-off: they’re larger than plain indexes, since they store additional column data.
Caveat: Index-only scans still need to check the visibility map. If many pages are marked not all-visible (because of recent changes), Postgres falls back to heap access even with a covering index. Regular VACUUM helps by updating the visibility map.
GIN Indexes: For Multi-Valued Data
GIN (Generalized Inverted Index) indexes are designed for data types that contain multiple values per row, where you want to find rows that contain a particular value. The canonical use cases are:
- Arrays: find all rows where the array contains a value
- JSONB: find all rows where the JSON document contains a key or value
- Full-text search: find all rows where the tsvector contains a lexeme
pg_trgm: trigram-based fuzzy string matching (requires the extension)
GIN indexes maintain a mapping from “value” to “list of rows containing that value” — exactly like the inverted indexes used in search engines. This is why they’re expensive to update (every insert or update that changes the indexed values must update multiple postings lists) but fast for containment queries.
-- GIN index for JSONB:
CREATE INDEX idx_products_metadata ON products USING GIN (metadata);
-- GIN index for arrays:
CREATE INDEX idx_articles_tags ON articles USING GIN (tags);
-- GIN index for full-text search:
CREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', title || ' ' || body));
GIN operators include:
@>: contains (does the JSON document / array / tsvector contain this value?)<@: is contained by&&: overlap (do two arrays/tsvectors share any elements?)@@: text search match (tsvector@@tsquery)?: does this key exist in the JSONB??|,?&: does this key exist for any/all of these keys?
GIN vs. B-tree for JSONB: B-tree indexes on JSONB work for equality comparisons on specific paths (metadata->>'status' = 'active'). GIN indexes support the containment and existence operators (metadata @> '{"status": "active"}'). For querying specific paths with ->>operators, a B-tree index on the expression is often better. For containment queries, GIN is the right choice.
gin_pending_list_limit
GIN indexes maintain a “pending list” of recently-added entries that haven’t been fully integrated into the main index structure. Reads trigger a cleanup if the pending list is too large. You can control this with gin_pending_list_limit. VACUUM and ANALYZE also clean up the pending list.
GiST Indexes: Generalized Search Trees
GiST (Generalized Search Tree) is a framework for building custom index types. It’s used for:
- Geometric types:
POINT,LINE,BOX,POLYGON,CIRCLE— spatial queries like “find all rows within a bounding box” - Range types:
INT4RANGE,TSTZRANGE, etc. — “find all ranges that overlap with this range” pg_trgm: trigram similarity (also supports GIN)- PostGIS: geospatial data
Unlike GIN (which is an exact inverted index), GiST uses a lossy index with false positives — the index narrows down candidates, but a recheck against the actual data is required. This makes GiST slightly slower for final results but allows it to handle much more complex data types.
-- Range overlap query using GiST:
CREATE INDEX idx_reservations_period ON reservations USING GIST (period);
SELECT * FROM reservations
WHERE period && '[2024-06-01, 2024-06-15)'::TSTZRANGE;
-- PostGIS spatial index:
CREATE INDEX idx_stores_location ON stores USING GIST (location);
SELECT * FROM stores
WHERE ST_DWithin(location, ST_MakePoint(-73.935, 40.730)::geography, 5000);
GiST vs. GIN for text similarity: pg_trgm supports both GIN and GiST. GIN is generally faster for equality and similarity queries; GiST supports LIKE and ILIKE queries in addition to similarity operators. For fuzzy string matching, GIN is usually the better choice.
BRIN Indexes: Block Range Indexes
BRIN (Block Range Index) is a small, fast index type designed for columns where the physical storage order correlates with the column values. The canonical use case: a created_at timestamp on an append-only table.
BRIN works by dividing the table into ranges of pages (blocks) and storing the minimum and maximum value of the indexed column within each range. To find rows matching a range query, Postgres identifies which block ranges might contain matching rows and only reads those.
CREATE INDEX idx_events_created_at ON events USING BRIN (created_at);
BRIN indexes are tiny — often 100–1000x smaller than a B-tree on the same column. The trade-off is precision: BRIN only excludes whole block ranges, so it still reads more data than a B-tree in most cases.
When to use BRIN:
- Very large, append-only tables (logs, events, time series) where sequential reads are acceptable
- When disk space is critical
- When the table is so large that a B-tree index is itself massive and slow to build
When NOT to use BRIN:
- Randomly ordered data (BRIN provides no benefit if the data isn’t physically correlated with the column)
- High-selectivity queries on small tables (B-tree wins)
- Frequent updates (BRIN performs poorly when rows are updated non-sequentially)
For time-series data specifically, BRIN is often the right choice for timestamp columns. A B-tree index on created_at for a billion-row events table is gigantic; a BRIN index on the same column is a few megabytes.
Hash Indexes
Hash indexes use a hash table to map column values to heap locations. They’re fast for equality comparisons (=) and don’t support range queries or ordering.
CREATE INDEX idx_users_session_token ON users USING HASH (session_token);
Before PostgreSQL 10, hash indexes were not WAL-logged and couldn’t be replicated. Since PostgreSQL 10, they’re fully supported. That said, B-tree indexes support equality comparisons too, and also support ranges and ordering. The performance advantage of hash over B-tree for equality-only lookups is typically small — hash indexes may be slightly faster for very long key values (since they hash them rather than comparing byte-for-byte), but for normal string and integer keys, the difference is negligible.
Verdict: Hash indexes are rarely the right choice. Use B-tree unless you have specific benchmark evidence that hash is faster for your workload.
Partial Indexes
A partial index is an index that covers only a subset of rows — those satisfying a WHERE condition:
-- Only index active users:
CREATE INDEX idx_users_active_email ON users(email) WHERE deleted_at IS NULL;
-- Only index pending jobs:
CREATE INDEX idx_jobs_pending ON jobs(scheduled_at) WHERE status = 'pending';
Partial indexes are powerful for several reasons:
Smaller and faster: A partial index on 10% of a table is 10% the size of a full index, and queries that include the matching condition can use it with great selectivity.
Conditional uniqueness: Enforce uniqueness only for specific cases:
-- Only one active subscription per user:
CREATE UNIQUE INDEX idx_subscriptions_active_per_user
ON subscriptions(user_id)
WHERE status = 'active';
Exclude rare values from an index: If 99% of rows have status = 'processed' and 1% have status = 'pending', an index on status is nearly useless for the common case. A partial index on the rare case is tiny and precise.
The query must include the partial index’s WHERE condition (or imply it) for Postgres to use it. The planner is smart about this — it recognizes that status = 'pending' AND scheduled_at < now() is compatible with a partial index defined as WHERE status = 'pending'.
Expression Indexes
Index an expression rather than a column directly:
-- Case-insensitive email lookup:
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- Index on JSONB field extraction:
CREATE INDEX idx_products_category ON products((metadata->>'category'));
-- Index on computed value:
CREATE INDEX idx_orders_total ON orders((subtotal + tax_cents));
For the planner to use an expression index, the query must use the same expression:
-- Uses the index:
SELECT * FROM users WHERE lower(email) = lower($1);
-- Does NOT use the index (different expression):
SELECT * FROM users WHERE email ILIKE $1;
Expression indexes are especially useful for:
- Case-insensitive lookups (
lower()) - JSONB field extraction (
metadata->>'field') - Date truncation (
date_trunc('day', created_at)) - Computed business keys
Index Bloat and Maintenance
Indexes accumulate dead entries just like tables do. When rows are updated or deleted, the old index entries become “dead” — they still take up space but don’t point to live tuples. VACUUM cleans them up.
Heavily written tables (lots of UPDATE and DELETE) tend to have bloated indexes. You can check index bloat with queries against pg_stat_user_indexes and pgstattuple (requires the extension):
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT pgstatindex('idx_orders_user_id');
-- Returns: avg_leaf_density, leaf_fragmentation, etc.
When an index becomes badly fragmented, you have two options:
REINDEX INDEX CONCURRENTLY idx_orders_user_id: rebuilds the index without locking (PostgreSQL 12+)DROP INDEX CONCURRENTLY+CREATE INDEX CONCURRENTLY: same effect, slightly more control
REINDEX CONCURRENTLY (and CREATE INDEX CONCURRENTLY) build the index in the background, allowing reads and writes to continue. The initial build takes longer than a blocking REINDEX, but it doesn’t cause downtime.
Checking for unused indexes:
SELECT schemaname, relname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(indexrelid) DESC;
Indexes with idx_scan = 0 have never been used since the last statistics reset. They’re taking up space and slowing down writes for no benefit. Drop them (carefully, after confirming they’re truly unused).
Reset statistics since last restart with:
SELECT pg_stat_reset();
Index Design Patterns
The Three-Column Guideline: Most queries can be satisfied by an index covering 1–3 columns. If you find yourself creating 5- or 6-column indexes, your query is probably doing too much, or your schema design needs work.
Equity before range: In composite indexes, put equality columns before range columns:
-- Good: user_id is equality, created_at is range
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
-- Less good: range before equality rarely helps
CREATE INDEX idx_orders_date_user ON orders(created_at, user_id);
One index per query, not one query per index: Design indexes around your actual query patterns, not hypothetically. Add an index when a slow query requires it, not preemptively.
The EXPLAIN ANALYZE feedback loop: The only reliable way to know if an index helps is to:
- Run
EXPLAIN (ANALYZE, BUFFERS)on the query without the index - Create the index
- Run
EXPLAIN (ANALYZE, BUFFERS)again - Compare execution time and buffer reads
Don’t guess. Measure.
Putting It Together: An Index Audit
When auditing a production system’s indexes, check:
-- Indexes with no scans (candidates for removal)
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Tables with high seq scan rates (may need indexes)
SELECT relname, seq_scan, idx_scan,
round(idx_scan::numeric / nullif(seq_scan + idx_scan, 0) * 100, 2) AS idx_pct
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC;
-- Duplicate indexes
SELECT indrelid::regclass AS table_name,
array_agg(indexrelid::regclass) AS duplicate_indexes
FROM pg_index
GROUP BY indrelid, indkey
HAVING count(*) > 1;
The last query finds indexes with identical column definitions — a common result of ORMs creating indexes and DBAs creating the same indexes manually.
Indexes are one of the highest-leverage tools in Postgres. The investment in understanding them pays dividends every time you look at a slow query and immediately know what index to create — or what index is being ignored, and why.
Postgres as a Document Store
In 2014, Jacob Kaplan-Moss published a post arguing that since the release of PostgreSQL 9.4 and its JSONB type, Postgres had become a better document database than MongoDB. He wasn’t trolling. He was making a precise technical argument: Postgres could store, index, and query semi-structured JSON documents with a richer feature set, stronger consistency guarantees, and better performance characteristics than MongoDB at the time.
A decade later, that argument has only gotten stronger. Postgres’s JSONB type has gained operators, functions, and a path language (jsonpath) that rivals or exceeds what any document database offers. And it comes with everything else Postgres provides — ACID transactions, joins against relational tables, mature replication, row-level security, and decades of operational knowledge.
This chapter explains how to use Postgres as a document store for real workloads.
JSONB vs. JSON
Postgres has two JSON types:
JSON: Stores the JSON as-is, as text. Validates that it’s well-formed JSON, then stores it verbatim. This means it preserves formatting, duplicate keys, and key order. It also means every time you query it, Postgres has to re-parse the text. You cannot index JSON columns (other than a B-tree index on the whole serialized string, which is useless).
JSONB: Stores JSON in a decomposed binary format. Parsing happens once on write, not on every read. Key order is not preserved (keys are stored sorted). Duplicate keys are deduplicated (last one wins). JSONB supports all the rich indexing and operator support. This is what you want.
Always use JSONB. There is almost no reason to use JSON over JSONB unless you specifically need to preserve duplicate keys or exact formatting.
CREATE TABLE products (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sku TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
attributes JSONB NOT NULL DEFAULT '{}'
);
The JSONB Operator Zoo
JSONB comes with a large set of operators. The important ones:
Navigation Operators
-- -> extracts a JSON object field by key, returns JSON
SELECT attributes -> 'color' FROM products WHERE id = 1;
-- Returns: "red" (JSON, quoted)
-- ->> extracts a JSON object field by key, returns TEXT
SELECT attributes ->> 'color' FROM products WHERE id = 1;
-- Returns: red (text, unquoted)
-- -> with integer extracts from a JSON array
SELECT attributes -> 'sizes' -> 0 FROM products WHERE id = 1;
-- Returns: "S" (first element, as JSON)
-- #> extracts a path, returns JSON
SELECT attributes #> '{dimensions, width}' FROM products WHERE id = 1;
-- #>> extracts a path, returns TEXT
SELECT attributes #>> '{dimensions, width}' FROM products WHERE id = 1;
Containment Operators
These are the operators that make JSONB indexes useful:
-- @> contains: does the left side contain the right side?
SELECT * FROM products WHERE attributes @> '{"color": "red"}';
-- <@ is contained by
SELECT * FROM products WHERE '{"color": "red"}' <@ attributes;
Containment works recursively for nested objects:
-- Finds products where dimensions.width is 10
SELECT * FROM products WHERE attributes @> '{"dimensions": {"width": 10}}';
Existence Operators
-- ? does this key exist?
SELECT * FROM products WHERE attributes ? 'color';
-- ?| does any of these keys exist?
SELECT * FROM products WHERE attributes ?| ARRAY['color', 'size'];
-- ?& do all of these keys exist?
SELECT * FROM products WHERE attributes ?& ARRAY['color', 'size'];
Modification Operators (PostgreSQL 9.5+)
-- || concatenate (merge) two JSONB objects
UPDATE products
SET attributes = attributes || '{"in_stock": true}'
WHERE id = 1;
-- - remove a key
UPDATE products
SET attributes = attributes - 'discontinued'
WHERE id = 1;
-- #- remove at a path
UPDATE products
SET attributes = attributes #- '{dimensions, depth}'
WHERE id = 1;
jsonb_set and jsonb_insert
For targeted updates to nested values:
-- jsonb_set(target, path, new_value, create_missing)
UPDATE products
SET attributes = jsonb_set(attributes, '{price, usd}', '29.99'::jsonb, true)
WHERE id = 1;
-- jsonb_insert(target, path, new_value, insert_after)
-- Inserts into an array at a position
UPDATE products
SET attributes = jsonb_insert(attributes, '{tags, 0}', '"featured"'::jsonb)
WHERE id = 1;
Indexing JSONB
Raw JSONB queries without indexes require Postgres to scan every row and parse the JSONB for every document — equivalent to a MongoDB collection scan. Indexing makes JSONB queries fast.
GIN Index on the Whole Column
A GIN index on a JSONB column indexes every key-value path in every document:
CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
This single index supports:
WHERE attributes @> '{"color": "red"}'WHERE attributes ? 'color'WHERE attributes ?| ARRAY['color', 'size']WHERE attributes ?& ARRAY['color', 'size']
The trade-off: a GIN index on a JSONB column with deeply nested, high-cardinality values can be large. If you have documents with hundreds of keys or deeply nested objects, the index can be many times larger than the data itself.
Expression Index on a Specific Path
If you always query on a specific field, an expression index on that field is smaller and faster:
-- Index on a specific path as JSONB
CREATE INDEX idx_products_color ON products ((attributes -> 'color'));
-- Index on a specific path as TEXT (for ->> queries)
CREATE INDEX idx_products_color_text ON products ((attributes ->> 'color'));
Use these when:
- You have a high-cardinality field you frequently query (
user_id,order_number) - The GIN index would be too large for your document structure
- You want
ORDER BYon a JSONB field to use an index
GIN with jsonb_path_ops
The default GIN operator class for JSONB is jsonb_ops, which supports all the operators above. An alternative is jsonb_path_ops:
CREATE INDEX idx_products_attrs_path ON products USING GIN (attributes jsonb_path_ops);
jsonb_path_ops creates a smaller index (it only indexes paths, not keys separately) and is faster for the @> containment operator. But it doesn’t support the ?, ?|, ?&, and @? operators. If you primarily use containment queries, jsonb_path_ops is more efficient.
jsonpath: The Query Language
PostgreSQL 12 introduced the jsonpath language — a powerful path expression language for navigating and filtering JSONB documents. If you’re familiar with XPath for XML, jsonpath is the JSON equivalent.
-- @? Does the path exist? (like the ? operator)
SELECT * FROM products WHERE attributes @? '$.tags[*] ? (@ == "featured")';
-- @@ Does the path expression return true?
SELECT * FROM products WHERE attributes @@ '$.price.usd > 20';
-- jsonb_path_query returns all matches
SELECT jsonb_path_query(attributes, '$.variants[*].size')
FROM products
WHERE id = 1;
-- jsonb_path_exists
SELECT * FROM products
WHERE jsonb_path_exists(attributes, '$.tags[*] ? (@ starts with "eco")');
jsonpath supports:
- Path navigation:
$.field,$.nested.field,$.array[0],$.array[*] - Filters:
? (@ > 5),? (@ like_regex "pattern") - Methods:
.size(),.type(),.floor(),.ceiling(),.double(),.keyvalue() - Arithmetic:
+,-,*,/,% - Boolean:
&&,||,!
jsonpath filters can also be indexed via GIN with jsonb_path_ops:
-- This query:
SELECT * FROM products WHERE attributes @? '$.price.usd ? (@ < 30)';
-- Can use a GIN index with jsonb_path_ops:
CREATE INDEX idx_products_attrs_path ON products USING GIN (attributes jsonb_path_ops);
Functions for Working With JSONB
Beyond operators, Postgres has a rich set of JSONB functions:
-- Convert a table row to JSON
SELECT row_to_json(p.*) FROM products p WHERE id = 1;
-- Build a JSON object from key-value pairs
SELECT json_build_object('id', id, 'name', name) FROM products;
-- Build a JSON array from rows
SELECT json_agg(json_build_object('id', id, 'name', name)) FROM products;
-- Expand JSONB to a table (json_each)
SELECT key, value FROM jsonb_each(
'{"color": "red", "size": "M", "weight": 150}'::jsonb
);
-- Expand a JSONB array to rows (jsonb_array_elements)
SELECT value FROM jsonb_array_elements('[1, 2, 3, 4, 5]'::jsonb);
-- Get all keys of a JSON object
SELECT jsonb_object_keys('{"a": 1, "b": 2}'::jsonb);
-- Returns: a, b
-- Pretty-print
SELECT jsonb_pretty('{"a":1,"b":{"c":2}}'::jsonb);
-- Strip nulls
SELECT json_strip_nulls('{"a": 1, "b": null, "c": 3}'::jsonb);
-- Returns: {"a": 1, "c": 3}
Real-World Document Store Patterns
Product Catalog With Variable Attributes
Classic e-commerce problem: products have different attributes depending on category. Shirts have size and color. Laptops have RAM and screen size. Books have ISBN and author. Storing all possible attributes in a single table with nullable columns creates a maintenance nightmare.
JSONB handles this elegantly:
CREATE TABLE products (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
category TEXT NOT NULL,
name TEXT NOT NULL,
price_cents INTEGER NOT NULL,
attributes JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Shirts
INSERT INTO products (category, name, price_cents, attributes)
VALUES ('apparel', 'Classic Tee', 2999,
'{"color": "blue", "sizes": ["S", "M", "L", "XL"], "material": "cotton"}');
-- Laptops
INSERT INTO products (category, name, price_cents, attributes)
VALUES ('electronics', 'ProBook 15', 129999,
'{"ram_gb": 16, "storage_gb": 512, "cpu": "Intel i7", "display_inches": 15.6}');
-- Index for category-specific queries
CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
-- Find all blue apparel
SELECT name, price_cents
FROM products
WHERE category = 'apparel'
AND attributes @> '{"color": "blue"}';
-- Find all laptops with at least 16GB RAM
SELECT name, price_cents
FROM products
WHERE category = 'electronics'
AND (attributes ->> 'ram_gb')::integer >= 16;
Schema-on-Read Configuration
A common use case for document storage: store configuration or settings objects where the schema evolves:
CREATE TABLE app_settings (
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
settings JSONB NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (tenant_id)
);
-- Read a specific setting with a default
SELECT COALESCE(settings -> 'notification' ->> 'email_digest_frequency', 'daily')
FROM app_settings
WHERE tenant_id = $1;
-- Update just one setting key
UPDATE app_settings
SET settings = jsonb_set(settings, '{notification, email_digest_frequency}', '"weekly"'::jsonb),
updated_at = now()
WHERE tenant_id = $1;
Event Sourcing / Audit Logs
Events and audit records often have variable payload structures that are natural to store as JSONB:
CREATE TABLE audit_events (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
event_type TEXT NOT NULL,
actor_id BIGINT NOT NULL,
entity_type TEXT NOT NULL,
entity_id BIGINT NOT NULL,
payload JSONB NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (occurred_at);
CREATE INDEX idx_audit_events_actor ON audit_events(actor_id, occurred_at DESC);
CREATE INDEX idx_audit_events_entity ON audit_events(entity_type, entity_id, occurred_at DESC);
CREATE INDEX idx_audit_events_payload ON audit_events USING GIN (payload);
-- Find all events where a specific field changed
SELECT * FROM audit_events
WHERE entity_type = 'order'
AND payload @> '{"changed_fields": ["status"]}';
Mixing Relational and Document Models
The real power of JSONB in Postgres is that you don’t have to choose between relational and document models — you can use both in the same database, with joins between them.
-- A user has a relational identity, but flexible profile data
SELECT
u.id,
u.email,
u.profile ->> 'display_name' AS display_name,
u.profile ->> 'bio' AS bio,
count(o.id) AS order_count,
sum(o.total_cents) AS lifetime_value
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.profile @> '{"newsletter_opted_in": true}'
GROUP BY u.id;
This query joins across relational and JSONB data in a single pass. No data synchronization, no ETL pipeline, no application-layer join. You cannot do this in MongoDB without a separate orders collection and an application-side join.
Postgres vs. MongoDB: An Honest Comparison
Jacob Kaplan-Moss’s claim deserves examination. Where does each tool actually win?
Postgres wins:
- ACID transactions across multiple documents
- Joining documents with relational data
- Rich indexing options (multiple index types, partial indexes, expression indexes)
- SQL — a query language everyone knows, with window functions, CTEs, aggregates
- Full-text search alongside documents (Chapter 6)
- RLS for document-level access control
- One less database to operate
- Mature ecosystem of monitoring, backup, and replication tools
- jsonpath is as expressive as MongoDB’s query language for most operations
MongoDB wins:
- Horizontal write scaling with sharding (Postgres requires Citus or manual sharding)
- Schema-less by default — no DDL required to add fields
- Native document-centric programming model — insert a nested object exactly as your application sees it
- Aggregation pipeline syntax (some find it more intuitive than SQL for document transformations)
- Easier to start with if your team doesn’t know SQL
- Change streams for reactive document subscriptions (logical replication can provide similar capabilities in Postgres, but it’s more complex)
The honest assessment: For most applications that chose MongoDB to avoid relational schema design or to store semi-structured data, Postgres + JSONB is the better choice today. The flexibility advantages of MongoDB are largely available in Postgres, and Postgres’s consistency and operational advantages are substantial.
The cases where MongoDB makes more sense: write-heavy workloads at extreme scale that would require sharding, teams that are genuinely more productive with a document-centric model, or systems where the aggregation pipeline is a better fit than SQL for complex document transformations.
For the vast majority of CRUD applications storing semi-structured data: Postgres is enough.
Performance Considerations
Document Size
JSONB documents larger than about 8KB will be TOASTed (stored in the overflow table). This is transparent but adds overhead on reads. Very large documents (hundreds of KB or larger) should prompt you to reconsider whether they should be stored in the JSONB column or in a separate table.
Partial Updates
Postgres doesn’t support in-place partial update of JSONB — every update rewrites the entire column value. For frequently-updated JSONB columns with large documents, this creates significant write amplification. Consider splitting frequently-updated fields into their own columns.
Index Size
GIN indexes on JSONB can be large — easily larger than the data itself for documents with many unique values. Monitor index sizes and use expression indexes on specific paths when a full GIN index is overkill.
-- Check index sizes
SELECT
relname AS table,
indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE relname = 'products'
ORDER BY pg_relation_size(indexrelid) DESC;
The key insight is this: JSONB is a power tool. It’s excellent for genuinely variable or schema-less data. It’s a poor substitute for proper columns when you know the schema. Use it where it shines, and the result is a flexible, performant document store that lives right next to your relational data, with no synchronization required.
Full Text Search
Full text search is the capability that teams add Elasticsearch for more than almost anything else. And it’s the capability that, more often than not, Postgres can handle just as well — sometimes better — at a fraction of the operational complexity.
Elasticsearch is a remarkable piece of engineering. It’s also a remarkable operational burden: a JVM cluster to provision and tune, heap sizes to configure, shard counts to get right, index mappings to manage, a query DSL to learn, and a synchronization pipeline to maintain between it and your primary database. If you’re doing 100 million documents with complex relevance tuning and faceted navigation, Elasticsearch earns its keep. If you’re doing full-text search over your application’s content — blog posts, products, tickets, documents — Postgres might be all you need.
How Postgres Full Text Search Works
Postgres full-text search is built around two types: tsvector and tsquery.
tsvector is a sorted list of lexemes — normalized word forms. When you convert text to a tsvector, Postgres:
- Tokenizes the text into words
- Normalizes each word (lowercasing, removing punctuation)
- Applies the configured text search dictionary (which handles stop words, stemming, synonyms)
- Stores the result as a sorted list of (lexeme, position_list) pairs
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');
-- Returns: 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
-- "The" and "over" are stop words (removed)
-- "jumps" → "jump", "lazy" → "lazi" (stemmed)
-- Positions are stored for phrase search
tsquery is a search query that can be matched against a tsvector:
SELECT to_tsquery('english', 'quick & fox');
-- Returns: 'quick' & 'fox'
SELECT to_tsquery('english', 'quick | fast');
-- Returns: 'quick' | 'fast'
SELECT to_tsquery('english', '!dog');
-- Returns: !'dog' (NOT)
SELECT to_tsquery('english', 'quick <-> fox');
-- Returns: 'quick' <-> 'fox' (phrase: "quick" followed immediately by "fox")
SELECT to_tsquery('english', 'quick <2> fox');
-- Returns: 'quick' <2> 'fox' (proximity: "quick" within 2 positions of "fox")
The @@ operator tests whether a tsvector matches a tsquery:
SELECT to_tsvector('english', 'The quick brown fox') @@ to_tsquery('english', 'quick & fox');
-- Returns: true
Text Search Configurations
A text search configuration tells Postgres how to tokenize and normalize text. Configurations are language-specific — english handles English stemming and stop words, spanish handles Spanish, etc.
-- List available configurations
SELECT cfgname FROM pg_ts_config;
-- Show what a configuration does to text
SELECT * FROM ts_debug('english', 'The quick brown foxes are jumping');
You can create custom configurations for specialized vocabularies:
-- Create a custom configuration based on English
CREATE TEXT SEARCH CONFIGURATION custom_english (COPY = english);
-- Add a custom dictionary for domain-specific terms
-- (e.g., "PostgreSQL" → "postgres" + "sql")
For most applications, english (or the appropriate language configuration) is sufficient.
plainto_tsquery and websearch_to_tsquery
Users typing search queries don’t write to_tsquery syntax. Two alternatives convert natural language input:
plainto_tsquery: Converts a string to a tsquery treating all words as AND terms:
SELECT plainto_tsquery('english', 'quick brown fox');
-- Returns: 'quick' & 'brown' & 'fox'
websearch_to_tsquery (PostgreSQL 11+): Parses a web-search-style query with "quoted phrases", -excluded terms, and OR operators:
SELECT websearch_to_tsquery('english', '"quick fox" OR lazy dog -cat');
-- Returns: 'quick' <-> 'fox' | ( 'lazi' & 'dog' ) & !'cat'
websearch_to_tsquery is the best choice for user-facing search — it handles the query syntax users already expect.
Storing and Indexing tsvectors
Computing to_tsvector() on every query is expensive. The standard approach is to store the tsvector in a generated column and index it:
CREATE TABLE articles (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL,
author TEXT NOT NULL,
published_at TIMESTAMPTZ,
-- Generated tsvector column (automatically maintained)
search_vector TSVECTOR GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(author, '')), 'B') ||
setweight(to_tsvector('english', coalesce(body, '')), 'C')
) STORED
);
-- GIN index on the stored tsvector
CREATE INDEX idx_articles_search ON articles USING GIN (search_vector);
Generated columns (PostgreSQL 12+) maintain the tsvector automatically — you don’t need triggers. When title or body changes, Postgres recomputes search_vector automatically.
For older Postgres versions, use a trigger:
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS TRIGGER AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector('english', coalesce(NEW.title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(NEW.body, '')), 'C');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_articles_search
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION update_search_vector();
Ranking Results
Postgres can rank search results by relevance using ts_rank and ts_rank_cd:
ts_rank(vector, query): Computes a relevance score based on how often and in how many fields the query terms appear.
ts_rank_cd(vector, query): Cover density ranking — considers how close together the query terms appear in the document.
SELECT
id,
title,
ts_rank(search_vector, query) AS rank
FROM articles,
websearch_to_tsquery('english', 'postgres performance') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;
setweight is how you boost certain fields. The weights are ‘A’ (highest), ‘B’, ‘C’, ‘D’:
-- Title matches count more than body matches
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'C')
The normalization parameter to ts_rank controls how document length affects ranking:
0(default): rank is not adjusted for document length1: rank / (1 + log(number of unique words))2: rank / document length4: rank / geometric mean of word count8: rank / unique word count16: rank / 1 + log(unique word count)32: rank / rank + 1
For most applications, ts_rank(vector, query, 1) (length-normalized) produces better results than the default.
Highlighting Matches
ts_headline generates highlighted snippets showing where the query terms appear in the original text:
SELECT
id,
title,
ts_headline('english', body, query,
'StartSel=<mark>, StopSel=</mark>, MaxWords=30, MinWords=15, MaxFragments=3'
) AS snippet
FROM articles,
websearch_to_tsquery('english', 'postgres indexing') AS query
WHERE search_vector @@ query
ORDER BY ts_rank(search_vector, query) DESC
LIMIT 10;
ts_headline is deliberately run against the original text (not the tsvector), which is why it can show context and formatting. Configuration options:
StartSel/StopSel: HTML tags to wrap matching termsMaxWords/MinWords: length of each fragmentMaxFragments: how many non-contiguous fragments to returnShortWord: minimum word length to highlightHighlightAll: highlight the entire document if no match is found
ts_headline is computationally expensive — it’s parsing the original text, not the indexed tsvector. Call it only for the documents you’re displaying, not in the WHERE clause.
Trigram Search: Fuzzy Matching and LIKE
Standard full-text search doesn’t handle typos or partial matches well. The pg_trgm extension adds trigram-based similarity, which does.
A trigram is a 3-character sequence. The word “postgres” generates trigrams: pos, ost, stg, tgr, gre, res. Similarity between two strings is measured by how many trigrams they share.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Similarity score (0.0 to 1.0)
SELECT similarity('postgres', 'Postgre');
-- Returns: ~0.5
-- Distance (1.0 - similarity)
SELECT 'postgres' <-> 'Postgre';
-- Find similar strings
SELECT name FROM products
WHERE name % 'potsgres' -- % is the similarity operator
ORDER BY name <-> 'potsgres';
GIN (or GiST) indexes on trigrams make LIKE and ILIKE queries fast — something normal B-tree indexes can’t do for non-prefix patterns:
CREATE INDEX idx_products_name_trgm ON products USING GIN (name gin_trgm_ops);
-- This now uses the index:
SELECT * FROM products WHERE name ILIKE '%postrges%'; -- handles typo
SELECT * FROM products WHERE name LIKE '%laptop%';
Trigram indexes are the right tool for:
- User-facing “search as you type” autocomplete
- Fuzzy matching (tolerate typos)
- Substring matching (
LIKE '%foo%') - Similarity-based “did you mean?” suggestions
Multi-table Full Text Search
Real applications search across multiple tables. The options:
Union Approach
SELECT id, 'article' AS type, title, ts_rank(search_vector, query) AS rank
FROM articles, websearch_to_tsquery('english', 'postgres') AS query
WHERE search_vector @@ query
UNION ALL
SELECT id, 'product' AS type, name AS title, ts_rank(search_vector, query) AS rank
FROM products, websearch_to_tsquery('english', 'postgres') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;
Search Index Table
For complex multi-entity search, a dedicated search index table often works best:
CREATE TABLE search_index (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
entity_type TEXT NOT NULL,
entity_id BIGINT NOT NULL,
title TEXT NOT NULL,
search_vector TSVECTOR NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_search_index_vector ON search_index USING GIN (search_vector);
CREATE INDEX idx_search_index_entity ON search_index(entity_type, entity_id);
-- Maintain via triggers or application logic
-- Query:
SELECT entity_type, entity_id, title,
ts_rank(search_vector, query) AS rank
FROM search_index,
websearch_to_tsquery('english', 'postgres performance') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;
This pattern centralizes search logic and makes cross-entity search fast. The trade-off is maintaining the index table when source data changes.
Dictionaries and Customization
Postgres’s text search is highly customizable through dictionaries:
Thesaurus: Map terms to canonical equivalents. “Postgres” → “postgresql”, “PG” → “postgresql”.
Synonym dictionaries: Replace words with a set of synonyms during indexing.
Ispell dictionaries: Full morphological analysis dictionaries for specific languages.
Stop word lists: Words to ignore (articles, prepositions).
-- Create a thesaurus entry
CREATE TEXT SEARCH DICTIONARY my_thesaurus (
TEMPLATE = thesaurus,
DictFile = my_thesaurus,
Dictionary = english_stem
);
-- Create a custom configuration using it
ALTER TEXT SEARCH CONFIGURATION custom_english
ALTER MAPPING FOR asciiword WITH my_thesaurus, english_stem;
For most applications, the built-in language configurations are sufficient. Customization pays off for domain-specific vocabulary — medical, legal, technical terms.
Practical Search API
Putting it together into a real search endpoint:
-- Full search with pagination, ranking, and snippets
WITH query AS (
SELECT websearch_to_tsquery('english', $1) AS q
)
SELECT
a.id,
a.title,
a.published_at,
ts_rank_cd(a.search_vector, q.q, 1) AS rank,
ts_headline(
'english',
left(a.body, 1000), -- headline from first 1000 chars
q.q,
'MaxWords=20, MinWords=10, MaxFragments=2, StartSel=<em>, StopSel=</em>'
) AS snippet
FROM articles a, query q
WHERE a.search_vector @@ q.q
AND a.published_at IS NOT NULL
ORDER BY rank DESC, a.published_at DESC
LIMIT $2
OFFSET $3;
Postgres vs. Elasticsearch: An Honest Assessment
Postgres full-text search wins when:
- Your dataset fits on a single machine (up to hundreds of millions of documents)
- You want consistent search with your transactional data (no sync lag)
- You need to filter on relational data alongside search (user permissions, category filters)
- Search is one feature among many, not the core product
- You want to minimize operational complexity
- You need ACID-consistent search (immediately visible after insert)
Elasticsearch wins when:
- Horizontal scaling of search is required (multi-terabyte, billions of documents)
- Complex relevance tuning is a core product requirement (BM25 tuning, learning-to-rank)
- You need real-time faceted navigation with aggregations across large datasets
- Your team is already deeply invested in the Elastic ecosystem
- You have specific language analysis requirements that exceed Postgres’s configurations
The honest truth: Most applications that use Elasticsearch don’t need it. The typical use case — searching products, articles, tickets, comments across a few million records — works perfectly in Postgres. The teams that truly need Elasticsearch are running search as a primary product feature at scale. For everyone else, websearch_to_tsquery + a GIN index is enough.
The synchronization problem alone should give you pause. The moment you add Elasticsearch, you have two sources of truth — your Postgres primary and your Elasticsearch index. They will diverge. You’ll have sync jobs that sometimes fail, search results that don’t reflect recent changes, and bugs that only appear in the search path. None of that complexity exists when search lives in the same database as your data.
That said: Postgres full-text search is good, not magical. If you need deep relevance tuning, advanced language analysis for non-Latin scripts, or complex aggregated facets over billions of documents, Elasticsearch is the right tool. Acknowledge that threshold honestly and stay on the right side of it.
Postgres as a Job Queue
The job queue is the canonical Redis use case. Redis is fast, Redis has pub/sub, Redis has sorted sets for priority queues. Every tutorial adds Redis as the “obvious” choice for background job processing. And so teams end up with Postgres for their data and Redis for their job queue — two databases to operate, two failure modes to handle, and crucially, no way to enqueue a job inside the same transaction that creates the data the job needs.
That last point is what makes Postgres-backed job queues not just viable but better for most applications. When you enqueue a job inside a Postgres transaction, the job is only visible to workers if and when the transaction commits. If the transaction rolls back, the job disappears too. This atomicity is something Redis queues fundamentally cannot provide.
Brandur Leach and Blake Gentry built River — a Postgres-native job queue in Go — to prove this point definitively. River handles millions of jobs per day in production, with sub-millisecond job insertion latency, reliable at-least-once delivery, and zero additional infrastructure dependencies.
The Primitives: SKIP LOCKED and Advisory Locks
The key innovation in modern Postgres-backed job queues is SKIP LOCKED, introduced in PostgreSQL 9.5.
The naive approach to a job queue in any database is:
-- Worker picks up a job
SELECT * FROM jobs WHERE status = 'pending' ORDER BY created_at LIMIT 1 FOR UPDATE;
-- (update status to 'processing')
The problem: FOR UPDATE locks the selected row. When multiple workers run this query simultaneously, they queue up waiting for each other’s lock. This is a thundering herd — poor concurrency under load.
SKIP LOCKED tells Postgres to skip rows that are currently locked by another transaction, instead of waiting for them:
-- Worker picks up a job, skipping rows other workers are processing
SELECT * FROM jobs
WHERE status = 'pending'
AND scheduled_at <= now()
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
Now multiple workers can run this query concurrently with no contention. Each worker gets a different pending job. If there are no unlocked pending jobs, the query returns zero rows immediately — no waiting.
SKIP LOCKED is the foundation of every serious Postgres job queue implementation.
Advisory Locks
Advisory locks provide an alternative coordination mechanism. Instead of row-level locks, advisory locks are named locks identified by a 64-bit integer. They’re lighter than row locks and can be used to coordinate across processes:
-- Try to acquire a per-job advisory lock
-- pg_try_advisory_xact_lock returns false if already held, true if acquired
SELECT pg_try_advisory_xact_lock(id) FROM jobs
WHERE status = 'pending'
AND scheduled_at <= now()
ORDER BY created_at
LIMIT 1;
Some job queue implementations combine these: SKIP LOCKED for fast selection among many workers, advisory locks for longer-running jobs where holding a row lock is expensive.
Building a Job Queue From Scratch
Understanding the fundamentals is useful even if you use a library. Here’s a minimal job queue implementation:
CREATE TYPE job_status AS ENUM ('pending', 'processing', 'completed', 'failed', 'discarded');
CREATE TABLE jobs (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
queue TEXT NOT NULL DEFAULT 'default',
kind TEXT NOT NULL,
args JSONB NOT NULL DEFAULT '{}',
status job_status NOT NULL DEFAULT 'pending',
priority INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 3,
attempt INTEGER NOT NULL DEFAULT 0,
attempted_at TIMESTAMPTZ,
scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
finalized_at TIMESTAMPTZ,
errors JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Index for workers to efficiently find work
CREATE INDEX idx_jobs_pending ON jobs(queue, priority DESC, scheduled_at)
WHERE status = 'pending';
Worker fetch query:
-- Fetch and lock a job in one atomic operation
WITH selected AS (
SELECT id FROM jobs
WHERE queue = $1
AND status = 'pending'
AND scheduled_at <= now()
ORDER BY priority DESC, scheduled_at
LIMIT $2 -- batch size
FOR UPDATE SKIP LOCKED
)
UPDATE jobs
SET status = 'processing',
attempt = attempt + 1,
attempted_at = now()
FROM selected
WHERE jobs.id = selected.id
RETURNING jobs.*;
Complete the job:
UPDATE jobs
SET status = 'completed',
finalized_at = now()
WHERE id = $1 AND status = 'processing';
Retry a failed job:
UPDATE jobs
SET status = CASE
WHEN attempt >= max_attempts THEN 'discarded'::job_status
ELSE 'pending'::job_status
END,
errors = coalesce(errors, '[]'::jsonb) || jsonb_build_object(
'attempt', attempt,
'error', $2::text,
'at', now()
)::jsonb,
scheduled_at = now() + (power(2, attempt) * interval '1 second'), -- exponential backoff
finalized_at = CASE WHEN attempt >= max_attempts THEN now() END
WHERE id = $1;
Transactional enqueue (the key advantage):
-- In the same transaction as your business logic:
BEGIN;
INSERT INTO orders (user_id, total_cents) VALUES ($1, $2) RETURNING id INTO order_id;
INSERT INTO jobs (kind, args) VALUES ('send_order_confirmation', jsonb_build_object('order_id', order_id));
COMMIT;
-- The email job is only enqueued if the order insert commits.
-- No order → no job. No partial state.
River: The Gold Standard
River (riverqueue.com) is a Postgres-native job queue built by Brandur Leach and Blake Gentry. It’s written in Go, though the concepts apply broadly.
What River does well:
High performance. River is designed for throughput. It handles tens of thousands of jobs per second on a single Postgres instance. It batches inserts, uses SKIP LOCKED efficiently, and minimizes roundtrips.
Transactional enqueue. River’s client accepts a pgx.Tx or database/sql.Tx and enqueues jobs within the caller’s transaction. This is the killer feature.
// In Go, with River:
tx, err := db.Begin(ctx)
if err != nil { return err }
defer tx.Rollback(ctx)
// Insert your business data
order, err := queries.WithTx(tx).InsertOrder(ctx, params)
if err != nil { return err }
// Enqueue job in same transaction
_, err = riverClient.InsertTx(ctx, tx, &OrderConfirmationArgs{OrderID: order.ID}, nil)
if err != nil { return err }
return tx.Commit(ctx)
// Job is only enqueued if both the order insert AND commit succeed.
Reliability guarantees. River provides at-least-once delivery. Jobs are not acknowledged until the worker completes them, and failed jobs are retried with exponential backoff. A configurable number of max attempts sends jobs to the “discarded” state.
Multi-queue and priority. Jobs can be assigned to named queues (e.g., “critical”, “bulk”, “email”) with separate worker pools. Priority within a queue is configurable per job.
Job uniqueness. River supports unique jobs — inserting a job with a key that’s already pending is a no-op. Useful for debouncing or deduplication.
Scheduled jobs. Insert a job with a future scheduled_at to delay execution. Build cron-like recurring jobs using the scheduled_at field or River’s periodic job functionality.
Schema and migrations. River ships its own migration system and manages its own schema (river_job, river_leader, etc.).
The River project demonstrates that a Postgres-backed job queue can be a first-class infrastructure primitive, not a stopgap.
pg-boss: The Node.js Contender
For Node.js applications, pg-boss is the equivalent of River. It provides:
- A rich API for enqueuing, scheduling, and handling jobs
- Transactional job creation via
boss.sendDebounced()andboss.insert() - Job state management, retry, exponential backoff
- Scheduled and deferred jobs
- Job completion handlers and result storage
const boss = new PgBoss(connectionString);
await boss.start();
// Enqueue within a transaction
await boss.send('order-confirmation', { orderId: 123 }, { priority: 2 });
// Worker
boss.work('order-confirmation', async ([job]) => {
await sendConfirmationEmail(job.data.orderId);
});
pg-boss has been battle-tested in production at considerable scale. It’s a mature, actively maintained library.
The Outbox Pattern: Guaranteed Event Publishing
A related pattern — the transactional outbox — uses the same Postgres-based atomicity for reliable event publishing:
CREATE TABLE outbox_events (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
aggregate_type TEXT NOT NULL,
aggregate_id BIGINT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
published_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Within a transaction:
BEGIN;
-- Business operation
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- Record the event in the same transaction
INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload)
VALUES ('transfer', 1, 'MoneyTransferred', '{"from": 1, "to": 2, "amount": 100}');
COMMIT;
A separate process (or Debezium, listening to WAL changes) reads the outbox table and publishes events to Kafka or another message broker. Because the event is written in the same transaction as the business operation, you’re guaranteed that either both happen or neither does.
This is how you bridge the gap between Postgres and a message queue like Kafka: the outbox table is your reliable bridge.
Dead Letter Queue and Observability
Production job queues need dead letter queues — a place where permanently-failed jobs land for human review:
-- Jobs that have exhausted all retries
SELECT id, kind, args, attempt, errors, finalized_at
FROM jobs
WHERE status = 'discarded'
ORDER BY finalized_at DESC;
-- Queue depth by queue and priority
SELECT queue, count(*) FILTER (WHERE status = 'pending') AS pending,
count(*) FILTER (WHERE status = 'processing') AS processing,
count(*) FILTER (WHERE status = 'failed') AS failed,
count(*) FILTER (WHERE status = 'discarded') AS discarded
FROM jobs
GROUP BY queue
ORDER BY queue;
-- Average wait time (time from enqueue to processing start)
SELECT queue, kind,
avg(extract(epoch from (attempted_at - created_at))) AS avg_wait_seconds
FROM jobs
WHERE status IN ('processing', 'completed')
AND attempted_at IS NOT NULL
AND created_at > now() - interval '1 hour'
GROUP BY queue, kind;
Maintenance: Cleaning Up Completed Jobs
Completed and discarded jobs accumulate. Set up a periodic cleanup:
-- Delete jobs completed more than 7 days ago
DELETE FROM jobs
WHERE status IN ('completed', 'discarded')
AND finalized_at < now() - interval '7 days';
River handles this automatically via its maintenance system. If you’re rolling your own, this cleanup should run regularly — either via a cron job, a pg_cron schedule (Chapter 11), or a maintenance worker.
Postgres vs. Redis/RabbitMQ: The Real Comparison
Where Postgres job queues win:
- Transactional enqueue — insert a job and business data atomically
- Joins — query jobs alongside business data in one query
- No sync required — jobs are visible immediately, same transaction
- ACID semantics — at-least-once delivery with strong consistency
- No additional infrastructure — one less database to operate, monitor, and back up
- Rich query capabilities — complex job routing, priority queues, scheduled jobs
- Point-in-time recovery works for jobs too
Where Redis/RabbitMQ win:
- Raw throughput at very high rates (millions of messages per second)
- Pub/sub broadcast patterns (Postgres has LISTEN/NOTIFY but it’s not a full pub/sub)
- Streaming workloads (Redis Streams, RabbitMQ exchanges)
- Jobs that need sub-millisecond enqueue latency at extreme scale
- Interoperability with other systems that don’t know about Postgres
The honest threshold: If you’re doing fewer than ~100,000 jobs per second per instance, a Postgres job queue is not your bottleneck. River has been benchmarked at tens of thousands of jobs/second — this covers the vast majority of workloads. The teams that genuinely need Redis for job queueing are running massive throughput requirements, not the typical “send a welcome email” background job.
The ACID transaction story is not just convenience — it prevents entire categories of bugs. With a Postgres job queue, it’s impossible to insert a job for a record that doesn’t exist (because the insert would roll back). It’s impossible to miss a job because the process crashed between the database write and the enqueue. These are real bugs that Redis-based job queues require application-level workarounds to handle.
For most applications: Postgres is enough for job queuing, and the atomic transactional enqueue makes it better.
Postgres as a Key-Value Store
Redis is fast. There’s no disputing that. A single Redis instance can handle a million operations per second on simple key-value workloads — that’s an order of magnitude faster than what most Postgres setups deliver on random single-row reads. If you need that throughput profile, Redis is the right tool.
But that’s not most Redis usage. The actual Redis usage in most production applications looks like this: sessions, rate limiting counters, feature flags, user preferences, configuration values, temporary tokens, and caches. Most of these workloads have a few hundred or a few thousand operations per second, not millions. And for those workloads, Postgres — with the right configuration — is competitive, consistent, and eliminates a database dependency.
The HSTORE Type
PostgreSQL has a native key-value type: HSTORE. It stores a flat map of string key-value pairs as a single column value.
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE TABLE user_preferences (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
prefs HSTORE NOT NULL DEFAULT ''::hstore
);
-- Insert
INSERT INTO user_preferences (user_id, prefs)
VALUES (1, 'theme => dark, language => en, timezone => UTC');
-- Update a single key
UPDATE user_preferences
SET prefs = prefs || 'notifications_enabled => false'::hstore
WHERE user_id = 1;
-- Read a key
SELECT prefs -> 'theme' FROM user_preferences WHERE user_id = 1;
-- Check key existence
SELECT * FROM user_preferences WHERE prefs ? 'notifications_enabled';
-- Delete a key
UPDATE user_preferences
SET prefs = delete(prefs, 'deprecated_key')
WHERE user_id = 1;
HSTORE supports GIN and GiST indexing for key existence and containment queries:
-- Find all users with dark theme
CREATE INDEX idx_user_prefs ON user_preferences USING GIN (prefs);
SELECT user_id FROM user_preferences WHERE prefs @> 'theme => dark';
HSTORE vs. JSONB: HSTORE is a flat string→string map. JSONB supports nested structures, mixed types (numbers, booleans, arrays, objects), and richer query operators. For simple flat key-value storage, HSTORE is slightly more efficient. For anything with structure or mixed types, JSONB is better. In practice, most teams use JSONB for everything and HSTORE is rarely worth reaching for.
Unlogged Tables: Trading Durability for Speed
The single most impactful Postgres feature for key-value-like performance is unlogged tables.
Normal Postgres tables write all changes to the WAL before writing to the heap. This durability guarantee is what makes Postgres crash-safe — but WAL writes have a cost. For data that’s inherently ephemeral (sessions, caches, rate limiting counters, temporary computation results), WAL durability is pure overhead.
Unlogged tables bypass WAL writes entirely:
CREATE UNLOGGED TABLE sessions (
token TEXT PRIMARY KEY,
user_id BIGINT NOT NULL,
data JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_sessions_user_id ON sessions(user_id);
CREATE INDEX idx_sessions_expires_at ON sessions(expires_at);
The performance difference is significant — unlogged tables are typically 5-10x faster for write-heavy workloads. The trade-off:
On crash recovery: Unlogged tables are truncated on database startup after an unclean shutdown. Any data in them is gone. This is acceptable for sessions (users just have to log in again), caches (they get rebuilt from the source of truth), and rate limiting counters (a brief reset is usually tolerable).
Replication: Unlogged tables are not replicated via streaming replication. Standbys have empty versions of unlogged tables. If you fail over to a standby, the unlogged data is gone on the standby.
Unlogged tables are a deliberate, intentional durability trade-off. Use them when the data is truly ephemeral.
-- A rate limiter using an unlogged table
CREATE UNLOGGED TABLE rate_limits (
key TEXT NOT NULL,
window_start TIMESTAMPTZ NOT NULL,
request_count INTEGER NOT NULL DEFAULT 0,
PRIMARY KEY (key, window_start)
);
-- Increment counter, upsert-style
INSERT INTO rate_limits (key, window_start, request_count)
VALUES ($1, date_trunc('minute', now()), 1)
ON CONFLICT (key, window_start)
DO UPDATE SET request_count = rate_limits.request_count + 1
RETURNING request_count;
-- If request_count > limit, reject the request
A Simple Key-Value API in Postgres
For applications that want a true key-value store in Postgres, a purpose-built table works well:
CREATE UNLOGGED TABLE kv_store (
namespace TEXT NOT NULL DEFAULT 'default',
key TEXT NOT NULL,
value JSONB NOT NULL,
expires_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (namespace, key)
);
CREATE INDEX idx_kv_expires ON kv_store(expires_at)
WHERE expires_at IS NOT NULL;
Get:
SELECT value FROM kv_store
WHERE namespace = $1 AND key = $2
AND (expires_at IS NULL OR expires_at > now());
Set:
INSERT INTO kv_store (namespace, key, value, expires_at)
VALUES ($1, $2, $3::jsonb, $4)
ON CONFLICT (namespace, key)
DO UPDATE SET value = EXCLUDED.value,
expires_at = EXCLUDED.expires_at,
updated_at = now();
Delete:
DELETE FROM kv_store WHERE namespace = $1 AND key = $2;
Expire cleanup (run periodically via pg_cron or a background worker):
DELETE FROM kv_store WHERE expires_at < now();
This is a fully functional key-value store with TTL support, namespacing, and JSON values. Connection pooling (PgBouncer in transaction mode) makes the per-operation overhead minimal.
Connection Pooling Is Mandatory
Postgres’s process-per-connection model means connections are expensive. Each connection uses ~5-10MB of memory and involves OS process management overhead. For a key-value workload with many short operations, the bottleneck quickly becomes connection management, not query execution.
PgBouncer, running in transaction mode, solves this:
Application → PgBouncer → Postgres
(1000 app connections) → (20-50 server connections)
In transaction mode, PgBouncer assigns a Postgres connection to an application for the duration of a transaction, then returns it to the pool. For key-value operations (which are typically single-statement transactions), this means the connection is held for milliseconds, enabling many more effective operations per second.
Configure PgBouncer for a key-value workload:
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000 ; High — many app connections are fine
default_pool_size = 50 ; Low — fewer actual Postgres connections
min_pool_size = 10
reserve_pool_size = 5
server_idle_timeout = 600
With PgBouncer in transaction mode, a Postgres instance with 50 server connections can handle thousands of concurrent application connections with sub-millisecond key-value operations.
Benchmark Reality
The throughput gap between Redis and Postgres narrows significantly with the right configuration:
- Redis: ~500,000–1,000,000 simple GET/SET ops/second on a single server
- Postgres (standard table, no pooling): ~5,000–15,000 ops/second on a single connection
- Postgres (unlogged table + PgBouncer transaction mode): ~50,000–200,000+ ops/second
The Postgres numbers vary widely with hardware, connection count, and query complexity. On modern NVMe SSDs with proper tuning, the gap is smaller than the benchmarks of five years ago suggest.
For most application use cases — sessions, feature flags, rate limiting, user preferences — the difference between 50,000 and 500,000 ops/second is irrelevant. Your application will never come close to saturating either system.
Feature Flags in Postgres
Feature flags are a classic Redis use case. In practice, feature flag reads are infrequent relative to other database operations, and the latency is dominated by network round-trips rather than database processing time.
CREATE TABLE feature_flags (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
enabled BOOLEAN NOT NULL DEFAULT false,
rollout_percentage SMALLINT CHECK (rollout_percentage BETWEEN 0 AND 100),
allowed_user_ids BIGINT[],
allowed_groups TEXT[],
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Check if a feature is enabled for a user
SELECT
enabled AND (
rollout_percentage = 100 OR
(rollout_percentage IS NOT NULL AND
hashtext(name || user_id::text) % 100 < rollout_percentage) OR
(allowed_user_ids IS NOT NULL AND user_id = ANY(allowed_user_ids))
) AS is_enabled
FROM feature_flags
WHERE name = $1;
Most applications read feature flags at application startup or cache them at the application layer with a short TTL — avoiding per-request database reads entirely. The point is that Postgres stores the authoritative state, and your application layer caches it. This is simpler and more reliable than Redis.
Sessions
Storing sessions in Postgres (rather than Redis or in-process memory) provides:
- Session persistence across application restarts
- Consistent session state across multiple application instances
- The ability to query and manage sessions (force-logout a user, list active sessions)
CREATE UNLOGGED TABLE http_sessions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
data JSONB NOT NULL DEFAULT '{}',
ip_address INET,
user_agent TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL DEFAULT now() + interval '30 days'
);
CREATE INDEX idx_sessions_user_id ON http_sessions(user_id)
WHERE expires_at > now();
CREATE INDEX idx_sessions_expires ON http_sessions(expires_at);
-- Update last accessed (touch the session)
UPDATE http_sessions
SET last_accessed_at = now(),
expires_at = now() + interval '30 days'
WHERE id = $1 AND expires_at > now()
RETURNING user_id, data;
Most session libraries for Node.js, Python, Go, and Ruby have Postgres adapters that implement this pattern. Using an unlogged table gives you Redis-like performance with no additional infrastructure.
Caching Patterns in Postgres
Postgres itself is not a general-purpose cache (that’s what shared_buffers and the OS page cache are). But application-level caching tables — storing the result of expensive computations or aggregations — can be implemented in Postgres:
CREATE UNLOGGED TABLE computed_cache (
cache_key TEXT PRIMARY KEY,
value JSONB NOT NULL,
computed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL
);
-- Cache aside pattern
-- 1. Check cache
SELECT value FROM computed_cache
WHERE cache_key = $1 AND expires_at > now();
-- 2. If miss, compute and store
INSERT INTO computed_cache (cache_key, value, expires_at)
VALUES ($1, $2::jsonb, now() + interval '5 minutes')
ON CONFLICT (cache_key) DO UPDATE
SET value = EXCLUDED.value,
computed_at = now(),
expires_at = EXCLUDED.expires_at;
This pattern is useful for expensive aggregations, denormalized read models, or computed statistics that change slowly. It’s not a replacement for a full caching layer, but it eliminates a Redis dependency for many workloads.
LISTEN/NOTIFY: The Lightweight Pub/Sub
Redis pub/sub is often used for simple real-time notification patterns — “notify workers that new data is available,” “send a signal when a job finishes.” Postgres has a built-in equivalent: LISTEN / NOTIFY.
-- Notify a channel
NOTIFY job_available, 'queue:default';
-- Or with PG's NOTIFY function in a trigger
CREATE OR REPLACE FUNCTION notify_job_available()
RETURNS TRIGGER AS $$
BEGIN
PERFORM pg_notify('job_available', NEW.queue);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_notify_jobs
AFTER INSERT ON jobs
FOR EACH ROW EXECUTE FUNCTION notify_job_available();
In application code, a worker can LISTEN on the channel and wake up immediately when a job is inserted, rather than polling:
// Go example with pgx
conn.Exec(ctx, "LISTEN job_available")
for {
notification, err := conn.WaitForNotification(ctx)
// Process notification, fetch a job
}
LISTEN/NOTIFY is not a full pub/sub system — notifications are not durable, can be lost during reconnection, and don’t support message history. But for “wake up a worker when there’s work” patterns, it eliminates Redis entirely.
Postgres vs. Redis for KV: The Honest Assessment
Postgres wins when:
- You need atomic operations with other relational data
- You need querying (find all sessions for a user, flag expiry, TTL management)
- Operational simplicity matters (one database to manage)
- Your throughput requirement is under ~100k ops/second
- You need persistence across database restarts (use a regular table, not unlogged)
Redis wins when:
- You need raw single-digit millisecond latency at scale
- You need millions of operations per second
- You need Redis-specific data structures (sorted sets, hyperloglogs, streams, bitmaps)
- You need pub/sub broadcast to many consumers
- You need LRU eviction policies that automatically manage memory
The key insight: most applications use Redis because it’s easy to add and fast to demo, not because they’ve hit the throughput wall of a properly-configured Postgres. Unlogged tables + PgBouncer transaction mode cover the vast majority of KV workloads. The applications that genuinely need Redis’s throughput profile are the exception, not the rule.
Start with Postgres. Add Redis when you can demonstrate the bottleneck.
Postgres as a Vector Database
Every AI-powered feature generates embeddings. Your RAG pipeline encodes documents as vectors. Your semantic search bar converts queries to vectors. Your recommendation engine represents users and items as vectors. And the conventional wisdom says you need a specialized vector database — Pinecone, Weaviate, Milvus, Qdrant — to store and search them.
You don’t. Not for most workloads.
pgvector is a PostgreSQL extension that adds a native vector type, vector operations, and multiple index types for approximate nearest neighbor (ANN) search. It integrates seamlessly with your existing Postgres tables — vectors live alongside the data they describe, joinable with no synchronization required, secured with row-level security, and backed up with everything else.
pgvector Basics
Install and enable:
CREATE EXTENSION IF NOT EXISTS vector;
Define a vector column with a fixed dimension:
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small dimensions
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Insert a vector:
INSERT INTO documents (content, embedding)
VALUES ('PostgreSQL is a powerful open-source relational database', '[0.02, -0.15, 0.08, ...]');
In practice, you’ll generate embeddings in your application code and pass them as arrays:
# Python example with openai + psycopg2
from openai import OpenAI
import psycopg2
client = OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Insert
embedding = embed("PostgreSQL is a powerful database")
cur.execute(
"INSERT INTO documents (content, embedding) VALUES (%s, %s)",
(content, embedding)
)
Vector Operators
pgvector supports three distance metrics:
-- L2 distance (Euclidean) — good for OpenAI embeddings
SELECT content, embedding <-> '[0.02, -0.15, 0.08, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 10;
-- Inner product (negative cosine similarity when normalized) — faster on normalized vectors
SELECT content, embedding <#> '[0.02, -0.15, 0.08, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 10;
-- Cosine distance — measures angle, invariant to magnitude
SELECT content, embedding <=> '[0.02, -0.15, 0.08, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 10;
Which distance metric to use depends on how the embeddings were generated:
- OpenAI embeddings: Cosine similarity (or inner product on normalized vectors) is recommended by OpenAI
- Sentence transformers: Cosine similarity is standard
- Custom embeddings: Check your model documentation
You can also compute vector functions:
-- L2 norm (magnitude)
SELECT vector_norm(embedding) FROM documents WHERE id = 1;
-- Element-wise average of multiple vectors
SELECT avg(embedding) FROM documents WHERE category = 'database';
Index Types: IVFFlat and HNSW
Without an index, pgvector does exact nearest neighbor search — scanning every row and computing the distance to the query vector. This is exact but O(n). For large tables, it’s impractical.
pgvector provides two approximate nearest neighbor (ANN) index types:
IVFFlat (Inverted File with Flat Compression)
IVFFlat divides the vector space into clusters (Voronoi cells) and, at query time, searches only the most relevant clusters. It trades some accuracy for much faster queries.
-- Create IVFFlat index
-- lists = number of cluster centroids; typically sqrt(n_rows) is a good starting point
CREATE INDEX idx_documents_embedding_ivfflat
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Tuning IVFFlat:
lists: Number of clusters. More lists = higher build cost but better accuracy at query time. Rule of thumb:sqrt(n)ton/1000.probes(per-query): How many clusters to search. Higher = more accurate but slower. Set per session:SET ivfflat.probes = 10;
IVFFlat limitations:
- Must be built after data is inserted (it learns cluster positions from existing data). Adding data after index creation may degrade accuracy unless the index is rebuilt.
- The number of
listsshould be set based on expected data volume.
HNSW (Hierarchical Navigable Small World)
HNSW is a graph-based ANN index. It builds a multi-layer graph where each layer is a progressively coarser approximation of the data. Queries start at the top (coarse) layer and navigate toward the query vector, descending to finer layers.
-- Create HNSW index
CREATE INDEX idx_documents_embedding_hnsw
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW parameters:
m: Number of connections per node in the graph. Higher = better recall, larger index, slower build. Default 16; 8–64 is typical.ef_construction: Search breadth during index construction. Higher = better recall, slower build. Default 64; 64–200 is typical.ef_search(per-query): Search breadth at query time.SET hnsw.ef_search = 100;Higher = better recall, slower queries. Default 40.
HNSW vs. IVFFlat:
| Aspect | IVFFlat | HNSW |
|---|---|---|
| Build time | Faster | Slower |
| Query time | Slower | Faster |
Recall at low lists/probes | Lower | Higher |
| Incrementally updated? | Degrades over time | Handles inserts well |
| Memory usage | Lower | Higher |
| Best for | Large static datasets, memory-constrained | Dynamic datasets, high-throughput queries |
For most production workloads with continuous data ingestion, HNSW is the better choice. For batch-loaded, rarely-updated datasets, IVFFlat is more efficient.
Measuring Recall
ANN search is approximate — it won’t always return the true nearest neighbors. Recall measures what fraction of the true top-k neighbors are found:
-- Exact nearest neighbors (no index, full scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, embedding <=> $1::vector AS distance
FROM documents
ORDER BY distance
LIMIT 10;
-- Approximate (with index)
SET enable_seqscan = off; -- Force index use for comparison
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, embedding <=> $1::vector AS distance
FROM documents
ORDER BY distance
LIMIT 10;
Comparing results from both gives you a recall estimate. Acceptable recall depends on your use case — for recommendation systems, 90% recall is often fine; for factual RAG pipelines, you may want 99%+.
Hybrid Search: Vectors + Full Text + Filters
The most powerful pgvector feature is seamless integration with the rest of Postgres. Real-world searches almost always combine semantic similarity with metadata filters.
-- Semantic search within a specific category, by a specific author
SELECT
d.id,
d.title,
d.content,
d.embedding <=> $1::vector AS semantic_distance
FROM documents d
WHERE d.category = $2
AND d.published = true
AND d.author_id = ANY($3::bigint[])
ORDER BY semantic_distance
LIMIT 20;
This runs the vector search and applies the filters in one query. In a dedicated vector database, you’d need to fetch more candidates (with approximate filtering support) or filter post-retrieval — less efficient and more complex.
Reciprocal Rank Fusion (RRF) for Hybrid Search
Combining semantic search with full-text search using RRF gives better results than either alone:
WITH semantic AS (
SELECT id, row_number() OVER (ORDER BY embedding <=> $1::vector) AS rank
FROM documents
WHERE published = true
LIMIT 100
),
fulltext AS (
SELECT id, row_number() OVER (ORDER BY ts_rank(search_vector, query) DESC) AS rank
FROM documents,
websearch_to_tsquery('english', $2) AS query
WHERE search_vector @@ query
AND published = true
LIMIT 100
),
rrf AS (
SELECT
COALESCE(s.id, f.id) AS id,
COALESCE(1.0 / (60 + s.rank), 0) +
COALESCE(1.0 / (60 + f.rank), 0) AS score
FROM semantic s
FULL OUTER JOIN fulltext f ON f.id = s.id
)
SELECT d.id, d.title, rrf.score
FROM rrf
JOIN documents d ON d.id = rrf.id
ORDER BY rrf.score DESC
LIMIT 20;
Reciprocal Rank Fusion with a constant of 60 is a well-established algorithm for combining rankings. This query does full-text and semantic search simultaneously and merges the results — in a single SQL query, in a single database.
RAG (Retrieval-Augmented Generation) with pgvector
pgvector is an excellent foundation for RAG pipelines:
-- Store document chunks with their embeddings
CREATE TABLE document_chunks (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
token_count INTEGER,
embedding vector(1536),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (document_id, chunk_index)
);
CREATE INDEX idx_chunks_embedding ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Retrieve relevant chunks for a query
SELECT
dc.content,
d.title,
d.url,
dc.embedding <=> $1::vector AS distance
FROM document_chunks dc
JOIN documents d ON d.id = dc.document_id
WHERE d.accessible_to_user_id = $2 -- Row-level access control
ORDER BY distance
LIMIT 5;
This is a complete RAG retrieval layer in Postgres. Access control via RLS or join conditions ensures users only see documents they’re allowed to see — something that requires additional application logic in dedicated vector databases.
Multi-Vector and Sparse-Dense Hybrid Search
Some embedding approaches use multiple vectors per document (e.g., ColBERT, which stores one vector per token). pgvector supports this naturally:
CREATE TABLE colbert_embeddings (
document_id BIGINT NOT NULL REFERENCES documents(id),
token_index INTEGER NOT NULL,
embedding vector(128),
PRIMARY KEY (document_id, token_index)
);
Sparse-dense hybrid models (like SPLADE + dense embeddings) can also be implemented by storing a JSONB column for sparse vectors alongside the dense vector column.
Dimensionality and Model Considerations
Vector dimensions range from 256 (compact models) to 3072 (OpenAI text-embedding-3-large). Larger dimensions generally give better semantic accuracy but have higher storage and compute costs.
Storage cost for a vector column:
vector(384)— Sentence-BERT small: ~1.5KB per rowvector(1536)— OpenAI text-embedding-3-small: ~6KB per rowvector(3072)— OpenAI text-embedding-3-large: ~12KB per row
At 1 million documents with vector(1536), the embedding column alone is ~6GB. At 100 million documents, it’s 600GB. Plan your storage accordingly.
HNSW index size is typically 1–3x the size of the vector data it indexes.
pgvector vs. Dedicated Vector Databases
Postgres + pgvector wins when:
- You need to filter by metadata alongside vector similarity (no sync required)
- You need transactional consistency — vectors and relational data change together
- You’re storing < ~100M vectors (pgvector handles this range well)
- You need row-level security on vector search
- You want one database to operate
- You want to join vector results with relational data
Dedicated vector databases (Pinecone, Weaviate, Qdrant, Milvus) win when:
- You’re operating at 100M+ vectors and need horizontal scaling
- You need sub-10ms ANN search at very high query rates (thousands of QPS)
- You need multi-tenancy with strong isolation at scale
- You need advanced reranking or neural retrieval features
- Your team is building a search product, not using search as one feature
The practical threshold: For most AI-powered features — RAG pipelines, semantic search, recommendation widgets — pgvector handles the scale. The applications that need Pinecone are the ones where vector search is the core product, not a supporting feature. If you’re adding a “similar documents” feature to your blog platform or semantic search to your internal knowledge base, pgvector is enough.
The hybrid search story is where pgvector especially shines. Running semantic search, keyword search, and metadata filtering in a single query with ACID guarantees is genuinely difficult with dedicated vector databases. In Postgres, it’s a CTE.
Operational Notes
Maintenance: HNSW indexes don’t require explicit maintenance. IVFFlat indexes should be rebuilt periodically if data distribution changes significantly.
Parallel builds: pgvector supports parallel index builds. Set max_parallel_maintenance_workers appropriately.
Progress monitoring:
SELECT phase, blocks_done, blocks_total,
round(blocks_done::numeric / nullif(blocks_total, 0) * 100, 2) AS pct
FROM pg_stat_progress_create_index
WHERE relid = 'documents'::regclass;
SET ivfflat.probes and SET hnsw.ef_search can be set per session or per transaction for workload-specific tuning — fast approximate queries at low recall, high-recall queries for critical retrieval.
The emergence of pgvector represents the same arc as JSONB and SKIP LOCKED before it: a capability once requiring a specialized tool, now integrated into Postgres with a level of quality that meets production requirements for most workloads. The trend line is clear.
Postgres as a Time-Series Database
Time-series data is everywhere: application metrics, IoT sensor readings, financial tick data, audit logs, user events, system telemetry. The conventional wisdom is that relational databases are bad at time series and you should reach for InfluxDB, TimescaleDB, or ClickHouse.
Partially true, partially not. Vanilla Postgres handles time-series workloads better than most people realize. TimescaleDB — which is Postgres — handles them even better. And for most applications generating moderate volumes of time-stamped data, the native Postgres approach is sufficient.
What Makes Time-Series Data Different
Time-series data has characteristics that stress ordinary relational tables:
- Append-heavy writes: New data arrives continuously; old data is rarely updated
- Time-range queries: “Give me all readings between 9am and 10am yesterday”
- Recent data is hot: Queries almost always touch recent data; old data is accessed infrequently
- Data volume: Time-series data grows without bound; retention and downsampling are operational requirements
- Aggregation patterns: “Average temperature per 5-minute bucket over the last week”
A plain events table with no special handling will work fine at small scale but degrade as it grows — indexes become large, queries scan too much data, and writes slow down.
The Postgres solutions:
- Declarative partitioning by time range for large tables
- BRIN indexes for efficient time-range queries
- pg_partman for automated partition management
- Materialized views for pre-aggregated rollups
- TimescaleDB for production-grade time-series at scale
Native Postgres: Time-Range Partitioning
The most impactful optimization for time-series data in Postgres is range partitioning on the timestamp column.
-- Create a partitioned events table
CREATE TABLE metrics (
id BIGINT NOT NULL,
metric_name TEXT NOT NULL,
value DOUBLE PRECISION NOT NULL,
labels JSONB NOT NULL DEFAULT '{}',
recorded_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (id, recorded_at) -- partition key must be in PK
) PARTITION BY RANGE (recorded_at);
-- Create initial partitions (monthly)
CREATE TABLE metrics_2024_01 PARTITION OF metrics
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE metrics_2024_02 PARTITION OF metrics
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- ... etc.
-- Default partition catches anything outside defined ranges
CREATE TABLE metrics_default PARTITION OF metrics DEFAULT;
With range partitioning on recorded_at:
- Queries with time-range predicates benefit from partition pruning — Postgres only scans partitions that could contain matching rows
- Old data can be dropped instantly by dropping an entire partition (no DELETE, no VACUUM)
- Indexes on each partition are smaller and faster to scan
The partition granularity (daily, monthly, yearly) depends on your write volume and query patterns. Monthly partitions work well for most applications. Daily partitions work for very high write rates. Yearly partitions may be too coarse for retention purposes.
BRIN Indexes on Partitions
Within each partition, a BRIN index on recorded_at is extremely efficient:
CREATE INDEX idx_metrics_2024_01_recorded_at
ON metrics_2024_01 USING BRIN (recorded_at);
Since time-series data is naturally inserted in roughly chronological order, the physical order of rows in each partition correlates with recorded_at. BRIN’s block-range min/max statistics are very accurate, allowing it to quickly identify which blocks to read.
For high-selectivity queries (short time ranges), you might prefer a B-tree index:
CREATE INDEX idx_metrics_2024_01_recorded_at_btree
ON metrics_2024_01(recorded_at, metric_name);
The choice depends on your query patterns. BRIN is tiny and fast for range scans; B-tree is better for point queries and high-selectivity filters.
Time-Bucket Aggregations
Time-series queries almost always involve grouping into time buckets:
-- Average metric value per 5-minute bucket
SELECT
date_trunc('minute', recorded_at) -
INTERVAL '1 minute' * (EXTRACT(MINUTE FROM recorded_at)::integer % 5) AS bucket,
metric_name,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value,
count(*) AS sample_count
FROM metrics
WHERE metric_name = 'cpu_usage'
AND recorded_at >= now() - interval '1 hour'
GROUP BY 1, 2
ORDER BY 1;
The time-bucket expression is verbose in vanilla Postgres. TimescaleDB provides time_bucket():
-- With TimescaleDB:
SELECT
time_bucket('5 minutes', recorded_at) AS bucket,
metric_name,
avg(value) AS avg_value
FROM metrics
WHERE recorded_at >= now() - interval '1 hour'
GROUP BY 1, 2
ORDER BY 1;
For vanilla Postgres, a helper function:
CREATE OR REPLACE FUNCTION time_bucket(bucket_size interval, ts timestamptz)
RETURNS timestamptz AS $$
SELECT date_trunc('epoch',
(EXTRACT(epoch FROM ts)::bigint / EXTRACT(epoch FROM bucket_size)::bigint)::bigint *
EXTRACT(epoch FROM bucket_size)::bigint * interval '1 second' + 'epoch'::timestamptz)
$$ LANGUAGE SQL IMMUTABLE;
Or more simply for fixed intervals:
-- Hourly buckets
date_trunc('hour', recorded_at)
-- 15-minute buckets
date_trunc('hour', recorded_at) +
INTERVAL '15 minutes' * floor(EXTRACT(MINUTE FROM recorded_at) / 15)
Materialized Views for Rollups
Querying raw time-series data for aggregations over long time ranges is expensive. Pre-aggregating into rollup tables via materialized views dramatically improves query performance at the cost of some write overhead and refresh latency.
-- Hourly rollup materialized view
CREATE MATERIALIZED VIEW metrics_hourly AS
SELECT
date_trunc('hour', recorded_at) AS hour,
metric_name,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value,
percentile_cont(0.95) WITHIN GROUP (ORDER BY value) AS p95,
count(*) AS sample_count
FROM metrics
WHERE recorded_at >= '2024-01-01'
GROUP BY 1, 2;
CREATE INDEX idx_metrics_hourly ON metrics_hourly(metric_name, hour DESC);
-- Refresh strategy: either on a schedule (via pg_cron) or incrementally
-- Full refresh:
REFRESH MATERIALIZED VIEW CONCURRENTLY metrics_hourly;
REFRESH MATERIALIZED VIEW CONCURRENTLY allows reads during refresh (it swaps in the new version atomically). Without CONCURRENTLY, the view is locked for the duration.
For incremental refresh (only updating recent hours), a UNIQUE index on the view is required for CONCURRENTLY:
CREATE UNIQUE INDEX idx_metrics_hourly_unique ON metrics_hourly(metric_name, hour);
Hierarchical Rollups
A common pattern is multi-level rollups: raw → 5-minute → 1-hour → 1-day. Each level is computed from the previous:
-- 5-minute rollup from raw
CREATE MATERIALIZED VIEW metrics_5min AS
SELECT time_bucket('5 minutes', recorded_at) AS bucket, ...;
-- 1-hour rollup from 5-minute
CREATE MATERIALIZED VIEW metrics_1hour AS
SELECT date_trunc('hour', bucket) AS hour, ...
FROM metrics_5min
GROUP BY ...;
This avoids reprocessing raw data for higher-level aggregations.
Data Retention
Time-series data needs retention policies. Keeping five years of second-resolution sensor data is expensive and usually unnecessary.
With partitioning, dropping old data is instant:
-- Drop everything older than 6 months
DROP TABLE IF EXISTS metrics_2024_01; -- instant, no VACUUM needed
More carefully, using pg_partman (see Chapter 11):
-- Configure pg_partman retention
UPDATE partman.part_config
SET retention = '6 months',
retention_keep_table = false
WHERE parent_table = 'public.metrics';
Without partitioning, a scheduled DELETE:
-- Run via pg_cron or a maintenance job
DELETE FROM metrics WHERE recorded_at < now() - interval '6 months';
DELETE on large tables is slow and generates dead tuples that must be vacuumed. Partitioning is strongly preferred for high-volume time-series data.
TimescaleDB: Postgres for Time-Series at Scale
TimescaleDB is a PostgreSQL extension (from Timescale, Inc.) that adds time-series capabilities to Postgres. It’s not a separate database — it’s Postgres with additional time-series features.
Core concept: Hypertables
TimescaleDB abstracts partitioned tables as hypertables — regular-looking tables that are automatically partitioned into chunks under the hood:
-- Enable TimescaleDB
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- Create a regular table
CREATE TABLE metrics (
recorded_at TIMESTAMPTZ NOT NULL,
metric_name TEXT NOT NULL,
value DOUBLE PRECISION NOT NULL,
labels JSONB NOT NULL DEFAULT '{}'
);
-- Convert it to a hypertable, partitioned by 1-day chunks
SELECT create_hypertable('metrics', 'recorded_at', chunk_time_interval => INTERVAL '1 day');
TimescaleDB handles:
- Automatic chunk creation as time advances
- Chunk pruning for time-range queries
- Per-chunk indexes (small, fast)
- Retention policies and chunk deletion
- Compression of old chunks
Continuous Aggregates
TimescaleDB’s most powerful feature is continuous aggregates — materialized views that are incrementally refreshed automatically:
CREATE MATERIALIZED VIEW metrics_1hour
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', recorded_at) AS hour,
metric_name,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value,
count(*) AS sample_count
FROM metrics
GROUP BY 1, 2;
-- Policy: refresh continuously, covering new data
SELECT add_continuous_aggregate_policy('metrics_1hour',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Unlike standard Postgres materialized views, continuous aggregates update incrementally — only processing new data since the last refresh, not the entire dataset. This makes them practical for real-time dashboards.
Compression
TimescaleDB can compress old chunks to dramatically reduce storage:
-- Enable compression on chunks older than 7 days
ALTER TABLE metrics SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'metric_name'
);
SELECT add_compression_policy('metrics', INTERVAL '7 days');
Compressed chunks typically achieve 10–20x compression ratios for time-series data. They remain queryable but are read-only and slower to scan (decompression happens on read). TimescaleDB is smart about querying mixed compressed/uncompressed chunks.
Retention Policies
-- Automatically drop chunks older than 90 days
SELECT add_retention_policy('metrics', INTERVAL '90 days');
Write Performance
TimescaleDB is designed for high-ingest workloads. Tuning recommendations:
- Increase
max_wal_sizefor batch inserts - Set
timescaledb.max_background_workersappropriately - Use
COPYfor bulk loads - Consider
synchronous_commit = offfor non-critical time-series (with understood trade-offs)
When to use TimescaleDB vs. vanilla Postgres:
Use TimescaleDB when:
- You’re storing millions of data points per day
- You need automatic chunk management (don’t want to manage partitions manually)
- Continuous aggregates are a core requirement
- Compression is important for storage costs
- You want time-series-specific query optimizations
Use vanilla Postgres when:
- Your time-series volume is modest (tens of thousands of rows per day)
- You want to minimize extensions and complexity
- You’re already using partitioning via pg_partman
- TimescaleDB’s licensing model (the Community vs. Enterprise distinction) is a concern
Window Functions for Time-Series Analysis
Postgres’s window functions are powerful for time-series analysis:
-- Running average (7-period moving average)
SELECT
recorded_at,
metric_name,
value,
avg(value) OVER (
PARTITION BY metric_name
ORDER BY recorded_at
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7
FROM metrics
WHERE recorded_at >= now() - interval '1 day'
ORDER BY metric_name, recorded_at;
-- Compare to previous period
SELECT
date_trunc('hour', recorded_at) AS hour,
metric_name,
avg(value) AS current_avg,
lag(avg(value)) OVER (
PARTITION BY metric_name
ORDER BY date_trunc('hour', recorded_at)
) AS prev_hour_avg
FROM metrics
WHERE recorded_at >= now() - interval '2 days'
GROUP BY 1, 2
ORDER BY 1, 2;
-- Cumulative sum
SELECT
recorded_at,
metric_name,
value,
sum(value) OVER (
PARTITION BY metric_name
ORDER BY recorded_at
ROWS UNBOUNDED PRECEDING
) AS cumulative_sum
FROM metrics
ORDER BY metric_name, recorded_at;
Gaps and Islands: Detecting Missing Data
A common time-series challenge: detecting gaps in data (missing sensor readings, outages):
-- Find gaps larger than 5 minutes in metric data
WITH ordered AS (
SELECT
recorded_at,
lead(recorded_at) OVER (PARTITION BY metric_name ORDER BY recorded_at) AS next_recorded_at
FROM metrics
WHERE metric_name = 'temperature'
AND recorded_at >= now() - interval '24 hours'
),
gaps AS (
SELECT
recorded_at AS gap_start,
next_recorded_at AS gap_end,
next_recorded_at - recorded_at AS gap_duration
FROM ordered
WHERE next_recorded_at - recorded_at > interval '5 minutes'
)
SELECT * FROM gaps ORDER BY gap_start;
Postgres vs. InfluxDB: The Honest Comparison
Postgres/TimescaleDB wins when:
- You need SQL for time-series analysis (InfluxQL/Flux are less powerful)
- You need to join time-series with relational data
- You want one database to operate and understand
- Your scale is below ~100M data points per day per node
- You need ACID guarantees on time-series writes
- Operational simplicity is a priority
InfluxDB/specialized time-series databases win when:
- You’re ingesting billions of events per day across multiple nodes
- You need horizontal write scaling
- Your team’s workflow is built around Grafana/InfluxDB tooling
- You need line protocol for high-throughput IoT ingestion
- Your data is exclusively time-series with no relational requirements
The honest truth: TimescaleDB has been benchmarked at millions of data points per second on a single node. The threshold at which you need InfluxDB is genuinely high. For application metrics, user events, financial data, and IoT telemetry at typical startup or midsize company scale, TimescaleDB is an excellent choice. And since it’s Postgres, you get SQL, JOINs, window functions, and all the operational benefits of the Postgres ecosystem.
The architecture story also matters: InfluxDB requires a separate database to operate and synchronize. TimescaleDB is your Postgres instance — one backup strategy, one monitoring setup, one mental model.
Time-series data is one of the clearer “Postgres is enough” cases. The specialized time-series databases solve a real problem, but that problem is usually at a scale that most teams don’t face. Start with partitioned tables or TimescaleDB, and re-evaluate when you actually hit the limits.
The Extension Ecosystem
Extensions are what make PostgreSQL a platform rather than just a database. The core Postgres installation is already excellent. The extension ecosystem transforms it into a system that handles geospatial data, fuzzy string matching, full-text with synonyms, time series, vector search, cryptographic operations, job scheduling, and much more — all without leaving Postgres.
This chapter covers the extensions worth knowing and installing, how extensions work under the hood, and how to evaluate whether an extension is worth the dependency.
How Extensions Work
A PostgreSQL extension is a bundle of SQL objects (functions, types, operators, indexes, tables) and optional shared libraries (C code loaded into the Postgres process) that extend the database’s capabilities.
To install an extension, you first make the extension files available to the Postgres installation (usually by installing a package like postgresql-16-pgvector), then enable it in a specific database:
CREATE EXTENSION IF NOT EXISTS vector;
This runs the extension’s SQL install script, creating the types, functions, and other objects in the current database. Extensions are per-database, not per-cluster.
To see installed extensions:
SELECT name, default_version, installed_version, comment
FROM pg_available_extensions
WHERE installed_version IS NOT NULL;
To see what objects an extension created:
SELECT classid::regclass, objid, deptype
FROM pg_depend
WHERE refobjid = (SELECT oid FROM pg_extension WHERE extname = 'pg_trgm');
Extensions that use C code load a shared library into the Postgres backend processes. These extensions can crash the backend if they have bugs — this is a risk worth understanding. Stick to well-maintained, widely-used extensions.
The Essential Extensions
pg_stat_statements
What it does: Tracks statistics on all SQL statements executed — query text, execution count, total/mean/min/max execution time, rows processed, buffer usage.
Why you need it: Without pg_stat_statements, finding your slow queries means trawling through logs. With it, you can directly query which queries are taking the most time.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Must be loaded at startup (add to postgresql.conf):
-- shared_preload_libraries = 'pg_stat_statements'
-- Top 10 queries by total execution time:
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
left(query, 100) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Install this on every Postgres instance. No exceptions.
pg_trgm
What it does: Trigram-based similarity functions and operators for fuzzy string matching, plus GIN/GiST index support for fast LIKE/ILIKE queries.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_products_name_trgm ON products USING GIN (name gin_trgm_ops);
-- Fast fuzzy search, handles typos
SELECT name FROM products WHERE name % 'postrges';
-- Fast LIKE query (normally requires full scan)
SELECT name FROM products WHERE name ILIKE '%laptop%';
Essential for any application with user-facing search. See Chapter 6.
pgcrypto
What it does: Cryptographic functions — hashing, symmetric and asymmetric encryption, random bytes.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- Secure password hashing (bcrypt)
INSERT INTO users (email, password_hash)
VALUES ('alice@example.com', crypt('password123', gen_salt('bf', 12)));
-- Verify password
SELECT id FROM users
WHERE email = 'alice@example.com'
AND password_hash = crypt('password123', password_hash);
-- Random UUID (also available as gen_random_uuid() without pgcrypto in PG 13+)
SELECT gen_random_uuid();
-- Cryptographically random bytes
SELECT encode(gen_random_bytes(32), 'hex');
Note: For password hashing, prefer bcrypt (gen_salt('bf', cost)) over md5 or sha variants. The cost factor (default 8, reasonable values 10-14) controls bcrypt’s computational expense.
uuid-ossp
What it does: Functions for generating various UUID versions.
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
SELECT uuid_generate_v4(); -- random UUID
SELECT uuid_generate_v1(); -- timestamp-based UUID (leaks MAC address — avoid)
As of PostgreSQL 13, gen_random_uuid() (v4) is built-in without an extension. For most purposes, you don’t need uuid-ossp. Use gen_random_uuid() directly.
pg_partman
What it does: Automated partition management for range and list partitions. Creates new partitions on schedule, maintains partition sets, enforces retention policies.
CREATE EXTENSION IF NOT EXISTS pg_partman;
-- Register a partitioned table with pg_partman
SELECT partman.create_parent(
p_parent_table => 'public.metrics',
p_control => 'recorded_at',
p_type => 'native',
p_interval => '1 month',
p_premake => 3 -- Create 3 future partitions in advance
);
-- Configure retention
UPDATE partman.part_config
SET retention = '12 months',
retention_keep_table = false,
infinite_time_partitions = true
WHERE parent_table = 'public.metrics';
-- Run maintenance (typically via pg_cron)
SELECT partman.run_maintenance();
pg_partman eliminates the operational burden of manually creating monthly partitions. Instead of a cron job that creates next month’s partition, pg_partman handles it automatically via a background maintenance function.
Use pg_partman for any time-partitioned table that grows indefinitely.
pg_cron
What it does: A cron-style scheduler built into Postgres. Runs SQL statements or stored procedures on a schedule.
-- Must be added to shared_preload_libraries first:
-- shared_preload_libraries = 'pg_cron'
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Clean up expired sessions every hour
SELECT cron.schedule(
'cleanup-sessions',
'0 * * * *', -- cron expression: every hour at :00
$$DELETE FROM sessions WHERE expires_at < now()$$
);
-- Run partition maintenance daily at midnight
SELECT cron.schedule(
'partition-maintenance',
'0 0 * * *',
$$SELECT partman.run_maintenance()$$
);
-- Refresh a materialized view every 5 minutes
SELECT cron.schedule(
'refresh-metrics-view',
'*/5 * * * *',
$$REFRESH MATERIALIZED VIEW CONCURRENTLY metrics_hourly$$
);
-- List scheduled jobs
SELECT jobid, jobname, schedule, command, active
FROM cron.job;
-- View job execution history
SELECT jobid, status, return_message, start_time, end_time
FROM cron.job_run_details
ORDER BY start_time DESC
LIMIT 20;
pg_cron eliminates cron entries on the host machine for database maintenance tasks. Everything lives in Postgres, visible and auditable from SQL.
PostGIS
What it does: The gold standard for geospatial data in any database. Adds geometry and geography types, hundreds of spatial functions, and GiST indexes for spatial queries.
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE locations (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL,
location GEOGRAPHY(POINT, 4326) -- WGS84 lat/lon
);
CREATE INDEX idx_locations_gist ON locations USING GIST (location);
-- Insert a point (longitude, latitude order for GeoJSON compatibility)
INSERT INTO locations (name, location)
VALUES ('Empire State Building', ST_MakePoint(-73.9857, 40.7484));
-- Find locations within 1 km of a point
SELECT name, ST_Distance(location, ST_MakePoint(-73.9857, 40.7484)::geography) AS distance_m
FROM locations
WHERE ST_DWithin(location, ST_MakePoint(-73.9857, 40.7484)::geography, 1000)
ORDER BY distance_m;
PostGIS is mature, extremely capable, and essential for any geospatial application. If your application uses geospatial data and you’re not using PostGIS, you’re doing it wrong.
pgvector
Covered extensively in Chapter 9. The extension for vector similarity search. Essential for any AI-powered feature.
timescaledb
Covered extensively in Chapter 10. The extension for production-grade time-series data management.
pg_repack
What it does: Online table and index reorganization — removes bloat, reclaims space, reorders rows — without locking. VACUUM FULL is the standard approach but locks the table. pg_repack does it concurrently.
-- pg_repack is run as a command-line tool, not a SQL function
-- pg_repack --table orders mydb
-- Or via pgRepack function (after installation):
SELECT repack.repack_table('orders');
Use pg_repack when a table has severe bloat that VACUUM can’t reclaim (e.g., after massive bulk deletes) and you can’t afford VACUUM FULL’s table lock.
pgtap
What it does: A unit testing framework for Postgres. Write tests in SQL.
CREATE EXTENSION IF NOT EXISTS pgtap;
-- Test that users.email is unique
SELECT plan(2);
SELECT has_unique('public', 'users', ARRAY['email'], 'users.email is unique');
SELECT col_not_null('public', 'users', 'email', 'users.email is not null');
SELECT finish();
Useful for CI pipelines that need to verify schema constraints, function behavior, and data integrity.
pg_hint_plan
What it does: Allows query plan hints — telling the query planner which index to use, which join strategy to apply, etc. Similar to Oracle’s hint system.
CREATE EXTENSION IF NOT EXISTS pg_hint_plan;
-- Hint: use the specific index
/*+ IndexScan(users idx_users_email) */
SELECT * FROM users WHERE email = 'alice@example.com';
-- Hint: prefer nested loop join
/*+ NestLoop(orders users) */
SELECT * FROM orders JOIN users ON users.id = orders.user_id WHERE orders.id = 1;
pg_hint_plan is a debugging and emergency tool, not a crutch. If you find yourself needing hints regularly, the underlying problem is bad statistics, a missing index, or a schema design issue. Fix the root cause. Use hints as a temporary workaround when you’re debugging a production issue.
pgaudit
What it does: Detailed audit logging of database operations — which user ran which query, when, on which objects.
CREATE EXTENSION IF NOT EXISTS pgaudit;
-- Configure via postgresql.conf:
-- pgaudit.log = 'write, ddl' -- Log all writes and DDL
-- pgaudit.log_relation = on -- Log relation name in log
-- Or per-role:
ALTER ROLE audited_user SET pgaudit.log = 'all';
Essential for compliance workloads (PCI, HIPAA, SOC 2) that require detailed audit trails. See Chapter 16.
plpgsql (built-in) and Other Languages
PL/pgSQL is built-in and is the standard language for Postgres functions and triggers. Other languages are available as extensions:
plpython3u: Write functions in Python (untrusted — can access the filesystem)plv8: Write functions in JavaScript (V8 engine)plrust: Write functions in Rust (safe, high-performance)plperl: Write functions in Perl
Most applications don’t need server-side code beyond PL/pgSQL. PL/pgSQL handles complex triggers, functions, and procedures well. Reach for other languages only when PL/pgSQL is genuinely limiting.
Evaluating an Extension
Before adding an extension to production, ask:
-
Is it actively maintained? Check the GitHub repo. Are there recent commits? Open issues being responded to? An unmaintained extension is a liability for every Postgres major version upgrade.
-
Does it require
shared_preload_libraries? Extensions loaded at startup (pg_stat_statements,pg_cron,timescaledb,pgaudit) require a server restart to add or remove. Plan for this. -
Is it compiled C code or pure SQL? C extensions can crash the backend if they have bugs. Pure SQL extensions have no such risk. Prefer pure SQL or well-audited C extensions.
-
What happens when it’s dropped?
DROP EXTENSIONremoves the SQL objects. Any data stored in extension tables is lost. Any columns using extension types would need to be migrated first. -
Will it survive a major version upgrade? Extensions must be upgraded alongside Postgres. Some have better upgrade paths than others. Check the extension’s documentation for upgrade instructions.
-
Does the cloud provider support it? AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL, and Supabase have curated extension lists. Some extensions require superuser privileges that managed providers don’t grant. Verify your target environment before committing.
Extensions and Managed Postgres
Most managed Postgres services support a subset of extensions:
-
AWS RDS/Aurora: Supports most common extensions.
pg_stat_statements,pg_trgm,pgcrypto,pgvector,PostGIS,pg_partman— yes.pg_cron— yes, as of recent versions.timescaledb— no (use Timescale Cloud or self-hosted). -
Supabase: Excellent extension support. Most extensions in this chapter are available with one click.
-
Google Cloud SQL: Good extension support.
pg_stat_statements,pg_trgm,pgvector,PostGIS, others. -
Neon: Growing extension support, focused on serverless Postgres.
Always verify extension availability on your target platform before architecture decisions that depend on them.
The Right Philosophy
Extensions are a superpower, but they’re also dependencies. Each extension you add is a library your Postgres installation depends on — for upgrades, for security patches, for operational support. The extensions in this chapter are all mature and widely used. They’re worth the dependency.
The test for a new extension: is it solving a real problem, is it actively maintained, and is its maintenance burden less than the alternative (a separate database, a different approach)? Most of the extensions in this chapter pass this test decisively. For others, evaluate carefully before committing to them in production.
The core insight: Postgres’s extension system is what makes “Postgres is enough” possible at all. The base database is excellent. The extensions make it comprehensive.
Migrations Done Right
Schema migrations are the most dangerous routine operation in a production database. Done wrong, they cause downtime, lock tables, and corrupt data. Done right, they’re invisible — the schema evolves while the application serves traffic, and nobody notices.
This chapter covers the tools, techniques, and mental models for migrating a live Postgres database without pain.
The Migration Mindset
Every migration you write will run against a live database. Your application will be serving requests while Postgres executes your SQL. This changes how you think about schema changes.
The fundamental constraints:
-
Locking: DDL operations take locks. Some locks block reads and writes for the duration. A migration that takes
ACCESS EXCLUSIVEon a busy table will queue behind all active transactions, then block everything that comes after it until it completes. If it takes 10 minutes to complete, everything waits 10 minutes. -
Application compatibility: During a deployment, you may have old application code running against the new schema, or new application code running against the old schema — or both simultaneously (during a rolling deployment). Your migration must not break either.
-
Irreversibility: Dropping a column is immediate and permanent. Getting the data back requires restoring from backup. Don’t drop things you might need.
-
Performance: Rebuilding an index on a 500-million-row table takes hours. Adding a NOT NULL constraint without a default rewrites the table. These operations have real time costs.
Locking and Dangerous Operations
PostgreSQL takes different lock levels for different DDL operations. The ones that cause production problems:
ACCESS EXCLUSIVE Lock (Blocks Everything)
-- These take ACCESS EXCLUSIVE:
ALTER TABLE orders DROP COLUMN legacy_field; -- locks entire table
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT; -- rewrites table
ALTER TABLE orders ADD CONSTRAINT check_amount CHECK (amount > 0) NOT VALID; -- actually fine with NOT VALID
TRUNCATE TABLE orders;
DROP TABLE orders;
The danger: ALTER TABLE takes ACCESS EXCLUSIVE and waits for all existing transactions to finish first. On a busy production table, there may always be active transactions. When the ALTER TABLE sits waiting, it also blocks all new queries on the table. This creates a backlog that looks like an outage.
Fix: Set a lock_timeout before dangerous DDL:
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;
-- If it can't acquire the lock within 5 seconds, it fails (doesn't wait and block)
Retry the migration during lower-traffic periods if it times out.
Adding a Column
PostgreSQL 11+: Adding a nullable column with no default or a constant default is instant — Postgres doesn’t rewrite the table.
-- Instant in PG11+:
ALTER TABLE orders ADD COLUMN notes TEXT;
ALTER TABLE orders ADD COLUMN processed BOOLEAN NOT NULL DEFAULT false;
Pre-PostgreSQL 11: Adding a column with a non-null default rewrites the entire table. This can take hours on large tables and locks the table the entire time.
Modern Postgres (11+) uses a metadata trick: it stores the default value in the system catalog and returns it for any row that doesn’t have the column physically written. Only new rows get the column written. This is “invisible” to queries but avoids the rewrite.
Adding a NOT NULL Constraint
Adding NOT NULL to an existing column in PostgreSQL requires a table scan to verify no NULL values exist. On large tables, this is slow and takes ACCESS EXCLUSIVE.
The safe pattern:
-- Step 1: Add with default, NOT VALID (no scan needed)
ALTER TABLE orders ADD COLUMN customer_email TEXT;
-- Step 2: Backfill existing rows
UPDATE orders SET customer_email = '' WHERE customer_email IS NULL;
-- Step 3: Add the constraint as NOT VALID (doesn't check existing rows)
ALTER TABLE orders ADD CONSTRAINT orders_customer_email_not_null
CHECK (customer_email IS NOT NULL) NOT VALID;
-- Step 4: Validate the constraint (light scan, doesn't block queries in PG 12+)
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_email_not_null;
NOT VALID creates the constraint but skips the validation scan. VALIDATE CONSTRAINT then validates existing rows, but in PostgreSQL 12+, this only takes SHARE UPDATE EXCLUSIVE (doesn’t block reads and writes).
Creating Indexes
CREATE INDEX (without CONCURRENTLY) takes SHARE lock — it blocks writes but not reads. For large tables, this is still a problem.
CREATE INDEX CONCURRENTLY builds the index in the background, only taking brief locks at the start and end:
-- Always use CONCURRENTLY on production:
CREATE INDEX CONCURRENTLY idx_orders_customer_email ON orders(customer_email);
Caveats:
CREATE INDEX CONCURRENTLYcannot run inside a transaction block (BEGIN/COMMIT)- It takes longer than a blocking
CREATE INDEX - If it fails, it leaves behind an
INVALIDindex that must be dropped manually:-- Find and drop invalid indexes: SELECT indexrelid::regclass FROM pg_index WHERE NOT indisvalid; DROP INDEX CONCURRENTLY idx_orders_customer_email; -- then retry
Renaming a Column
Renaming a column is fast (it only updates the system catalog) but requires ACCESS EXCLUSIVE. More importantly, it breaks any application code that uses the old name.
Never rename a column while application code is deployed that uses the old name. Use the expand/contract pattern instead (see below).
The Expand/Contract Pattern
The expand/contract pattern (also called the parallel change pattern) is the safe way to rename columns, change types, or do any transformation that requires old and new states to coexist.
Example: Renaming user_id to customer_id
Phase 1: Expand — add the new column, start writing to both:
ALTER TABLE orders ADD COLUMN customer_id BIGINT REFERENCES customers(id);
Update application code to write to both user_id and customer_id, and read from customer_id (falling back to user_id if null).
Phase 2: Backfill — populate the new column for existing rows:
UPDATE orders
SET customer_id = user_id
WHERE customer_id IS NULL;
On very large tables, do this in batches:
DO $$
DECLARE
batch_size INT := 10000;
rows_updated INT;
BEGIN
LOOP
UPDATE orders
SET customer_id = user_id
WHERE customer_id IS NULL
AND ctid IN (
SELECT ctid FROM orders WHERE customer_id IS NULL LIMIT batch_size
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- brief pause to avoid overwhelming autovacuum
END LOOP;
END$$;
Phase 3: Cut over — once all rows are populated and application code only uses customer_id, add the NOT NULL constraint and deploy the final application version that removes the old column reference:
ALTER TABLE orders ALTER COLUMN customer_id SET NOT NULL;
Phase 4: Contract — remove the old column:
ALTER TABLE orders DROP COLUMN user_id;
This can be done in a later migration, after confirming the new column is correct.
Migration Tools
golang-migrate
Simple, widely-used, language-agnostic. Migrations are plain SQL files with up/down versions.
File naming: 001_create_users.up.sql, 001_create_users.down.sql
-- 001_create_users.up.sql
CREATE TABLE users (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- 001_create_users.down.sql
DROP TABLE users;
migrate -path ./migrations -database $DATABASE_URL up
migrate -path ./migrations -database $DATABASE_URL down 1
golang-migrate tracks migration state in a schema_migrations table. Simple, reliable, integrates easily into CI/CD pipelines.
Flyway
JVM-based, enterprise-grade, widely used in Java ecosystems. Supports complex migration scenarios, callbacks, undo migrations, and dry runs. Excellent integration with Spring Boot.
-- V1__Create_users.sql
CREATE TABLE users (...);
-- V2__Add_email_index.sql
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
Flyway is the right choice if you’re in the Java ecosystem or need enterprise features.
Liquibase
Similar to Flyway but also supports XML/YAML/JSON migration formats in addition to SQL. Stronger focus on database-agnostic migrations. Overkill if you’re Postgres-only.
Atlas
A newer tool focused on declarative schema management. Instead of writing incremental migrations, you describe the desired schema state, and Atlas generates the migration SQL automatically.
# schema.hcl
table "users" {
schema = schema.public
column "id" {
type = bigint
identity { generated = ALWAYS }
}
column "email" {
type = text
}
primary_key { columns = [column.id] }
unique "email_unique" { columns = [column.email] }
}
atlas schema apply --url $DATABASE_URL --to file://schema.hcl
Atlas has excellent Postgres support and generates safe migration SQL. Its --dry-run flag shows what changes would be made without applying them.
ORM Migration Systems
ORMs like SQLAlchemy, Diesel, Prisma, and GORM include migration systems. They work well for simple cases but often generate unsafe SQL for production (blocking index creation, non-concurrent operations). Review and adjust ORM-generated migrations for large tables.
Zero-Downtime Deployment Checklist
For a migration to be safe to run against a live production database:
- All DDL operations use
CONCURRENTLYwhere applicable (indexes) - All potentially long-running operations use
NOT VALID/VALIDATE CONSTRAINTpattern -
lock_timeoutis set before operations that takeACCESS EXCLUSIVE - New columns are nullable or have constant defaults (not computed values)
- No
RENAME COLUMNon a column currently in use by running application code - No
DROP COLUMNunless application code has been deployed that no longer uses it - Large backfills are batched, not a single massive
UPDATE - The migration has been tested against a production-like data volume
Rolling Back
Postgres has no built-in rollback for DDL that was committed. If a migration runs successfully but you realize it was wrong:
- Additive changes (new columns, new tables, new indexes) can be rolled back by removing them
- Destructive changes (dropped columns, dropped tables) require restoring from backup
- Type changes may require recreating the column and backfilling
This is why down migrations are important: they document what “undo” looks like. But they’re only useful if the data hasn’t changed since the up migration ran. A down migration that adds back a dropped column doesn’t restore the data that was in it.
The safest rollback strategy: never drop a column in the same migration as the one that stopped using it. Leave dropped columns in place for at least one deployment cycle. This gives you time to roll back the application code if something is wrong, before the data is gone.
Migration in Practice: The Workflow
-
Write the migration. Consider what lock it takes, how long it will run on your production data volume, and whether the new schema is backward-compatible with currently-deployed application code.
-
Test on a production-sized clone. Many migrations that seem fast in development are slow against 100 million rows. Test with realistic data volumes.
-
Deploy in two phases if needed. If old and new code can’t coexist:
- Phase 1: Expand (add new column, no constraints yet)
- Deploy new application code (reads new column, writes both)
- Phase 2: Contract (add constraints, drop old column)
-
Monitor lock wait times. During the migration, watch
pg_stat_activityfor blocked queries:SELECT pid, wait_event_type, wait_event, query, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND wait_event_type = 'Lock' ORDER BY duration DESC; -
Verify completion. Check that
pg_stat_progress_create_index,pg_stat_progress_copy, etc. show the expected progress and finish.
Safe migrations aren’t complicated — they require discipline, a few patterns (expand/contract, NOT VALID, CONCURRENTLY), and respect for what’s running in production while you work. The teams that get burned by migrations are almost always the teams that wrote the SQL in development and ran it directly on production without considering the implications at scale.
Performance Tuning
A default Postgres installation is configured conservatively — appropriate for a development machine or a small VM, but leaving significant performance on the table for a production server. This chapter covers the configuration parameters that matter, the autovacuum behavior you need to understand, connection pooling as a force multiplier, and the query optimization habits that prevent most performance problems.
Postgres performance tuning is iterative. There is no single configuration change that makes everything fast. The process is: measure, identify bottlenecks, tune, measure again.
Memory Configuration
shared_buffers
The most important memory setting. Controls the size of Postgres’s shared buffer cache — the pool of memory that all backends share for caching data pages.
Default: 128MB (egregiously low for production)
Recommendation: ~25% of available RAM
shared_buffers = 8GB # on a 32GB server
Setting shared_buffers higher than 25% of RAM has diminishing returns because the OS page cache also caches frequently-read data. Going above 40% of RAM can actually hurt performance by reducing the OS page cache.
effective_cache_size
Not a memory allocation — it’s a hint to the query planner about how much total memory (shared_buffers + OS cache) is available for caching. The planner uses this to decide between index scans and sequential scans.
Recommendation: 50–75% of total RAM
effective_cache_size = 24GB # on a 32GB server
If this is set too low, the planner incorrectly thinks the disk is slow and prefers sequential scans when it should use indexes.
work_mem
Memory available per sort or hash operation. Each sort, hash join, and hash aggregate can use up to work_mem. A query with multiple operations can use work_mem multiple times. With many concurrent connections, total memory usage can be work_mem × connections × operations_per_query.
Default: 4MB (too low for complex queries)
Recommendation: Balance between query complexity and connection count. A formula:
work_mem = (RAM - shared_buffers) / (max_connections × 2)
For a 32GB server with 8GB shared_buffers and 100 max_connections:
work_mem = (32 - 8) / (100 × 2) ≈ 120MB
Setting work_mem too high with many connections causes OOM. Set conservatively globally and increase per-session for complex analytical queries:
SET work_mem = '256MB';
SELECT ... ORDER BY ... LIMIT ...; -- benefits from higher work_mem
RESET work_mem;
Watch for “external sort” operations in EXPLAIN output — they indicate the sort spilled to disk because work_mem was too low.
maintenance_work_mem
Memory for maintenance operations: VACUUM, CREATE INDEX, ALTER TABLE, etc.
Recommendation: 1GB or more for systems with large tables. This speeds up index creation dramatically.
maintenance_work_mem = 2GB
max_wal_size
Controls when checkpoint processing forces a WAL flush. Too small causes frequent checkpoints, generating I/O spikes. Too large increases recovery time after a crash.
Recommendation: 2–4GB for most systems, higher for write-heavy workloads.
max_wal_size = 4GB
Watch pg_stat_bgwriter.checkpoint_req — if this is high, checkpoints are being triggered by WAL size rather than time. Increase max_wal_size.
Checkpoint Configuration
Checkpoints flush dirty pages from the buffer cache to disk. A checkpoint that runs too fast causes I/O spikes (all the dirty pages written in a burst). A checkpoint that runs too slow causes long recovery times after a crash.
checkpoint_completion_target = 0.9 # Spread I/O over 90% of checkpoint interval
checkpoint_timeout = 15min # Maximum time between checkpoints
checkpoint_completion_target = 0.9 (the default in recent Postgres versions) is good. It tells Postgres to spread checkpoint writes over 90% of the interval between checkpoints, smoothing I/O.
WAL Settings
wal_level = replica # Minimum for streaming replication
wal_compression = on # Compress WAL records (reduces WAL volume)
wal_buffers = 16MB # WAL write buffer (usually auto-configured from shared_buffers)
synchronous_commit = on # Don't change this unless you understand the trade-off
synchronous_commit = off gives a performance boost (no fsync on commit) at the cost of potentially losing the last few seconds of committed transactions on crash. This is acceptable for non-critical data (analytics events, logs, rate limit counters). Never set it off globally for transactional data.
Autovacuum: The Most Misunderstood Setting
Autovacuum is the background process that reclaims dead tuple space, updates table statistics, and prevents XID wraparound. It is not optional. Disabling autovacuum is not a performance optimization — it’s setting up a future disaster.
The most important autovacuum settings:
autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold
These control when autovacuum triggers on a table. The formula:
threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples
Defaults:
autovacuum_vacuum_threshold = 50(absolute minimum dead tuples)autovacuum_vacuum_scale_factor = 0.2(20% of table)
For a table with 10 million rows, autovacuum triggers when there are 2,000,050 dead tuples. For a frequently-updated table, this means autovacuum runs rarely, and dead tuples accumulate heavily before cleanup.
Recommendation for large tables: Lower the scale factor significantly, or set per-table thresholds:
-- Per-table: vacuum after 1% dead tuples (instead of 20%)
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000,
autovacuum_analyze_scale_factor = 0.005,
autovacuum_analyze_threshold = 500
);
autovacuum_max_workers
Default: 3. The number of autovacuum workers that can run simultaneously. For a system with many large, frequently-updated tables, the default is often insufficient.
autovacuum_max_workers = 6
More workers means more CPU and I/O, but tables stay cleaner and queries stay fast.
autovacuum_cost_delay and autovacuum_vacuum_cost_limit
Autovacuum throttles itself using a cost-based mechanism to avoid overwhelming I/O. Each page read, dirty page write, and dirty page hit has a cost. When the accumulated cost reaches autovacuum_vacuum_cost_limit, autovacuum sleeps for autovacuum_cost_delay milliseconds.
Defaults:
autovacuum_cost_delay = 2ms(recent versions)autovacuum_vacuum_cost_limit = 200
For SSDs, autovacuum can be much more aggressive. The default throttling is designed for spinning disks:
# For NVMe SSDs:
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 2000 # 10x more aggressive
Monitoring Autovacuum
-- See when tables were last vacuumed and their bloat
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Tables approaching autovacuum threshold
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup AS vacuum_threshold
FROM pg_stat_user_tables
JOIN pg_class ON relname = pg_class.relname
LEFT JOIN (
SELECT relid,
(array_to_string(reloptions, ',') ~ 'autovacuum_vacuum_threshold')::boolean AS has_threshold,
(regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_threshold=(\d+)'))[1]::bigint AS autovacuum_vacuum_threshold,
(regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_scale_factor=(\d+\.?\d*)'))[1]::numeric AS autovacuum_vacuum_scale_factor
FROM pg_class
) opts ON opts.relid = pg_class.oid
CROSS JOIN (
SELECT current_setting('autovacuum_vacuum_threshold')::bigint AS autovacuum_vacuum_threshold,
current_setting('autovacuum_vacuum_scale_factor')::numeric AS autovacuum_vacuum_scale_factor
) defaults;
Connection Pooling with PgBouncer
Postgres connections are expensive: each backend process uses ~5-10MB of RAM and involves OS process creation overhead. Applications that open many short-lived connections — serverless functions, high-concurrency APIs — can overwhelm Postgres’s connection capacity.
PgBouncer is a lightweight connection pooler that sits between your application and Postgres, multiplexing many application connections onto a smaller number of Postgres connections.
Pool Modes
Session mode: A server connection is assigned to a client for the duration of the client session. No query-level multiplexing. Only useful for reducing connection overhead (not connection count).
Transaction mode: A server connection is assigned for the duration of each transaction. After the transaction commits or rolls back, the connection returns to the pool. This is the most useful mode — a pool of 50 server connections can handle thousands of concurrent application connections.
Statement mode: A server connection is assigned for a single statement, then released. Most restrictive — prepared statements, SET commands, and transactions spanning multiple statements don’t work.
For most applications, transaction mode is the right choice.
PgBouncer Configuration
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
pool_mode = transaction
# Server connections
max_client_conn = 10000 # Allow many application connections
default_pool_size = 25 # Postgres sees max 25 connections from this app
min_pool_size = 5
reserve_pool_size = 5
# Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 0
# Logging
log_connections = 0
log_disconnections = 0
What doesn’t work in transaction mode: SET LOCAL (within a transaction is fine), session-level SET, advisory locks held across transactions, LISTEN/NOTIFY, server-side cursors outside transactions, prepared statements (unless you use server_reset_query or PgBouncer’s prepared statement tracking feature). Know these limitations before adopting.
How Many Postgres Connections?
The optimal number of Postgres server connections for throughput:
optimal_connections ≈ CPU_count * 2 + effective_spindle_count
For a 4-core server with an NVMe SSD:
optimal_connections ≈ 4 * 2 + 1 ≈ 9
This seems shockingly low but is supported by benchmarks. More connections than CPUs means context-switching overhead and lock contention outweigh concurrency benefits. For most servers, 20-50 Postgres connections is sufficient for high throughput. PgBouncer’s job is to make 1000 application clients share those efficiently.
max_connections
Default: 100. Set this based on what you can afford in terms of memory, not what you hope to use.
Each connection uses about 5-10MB of RAM. For a server with 32GB RAM and 8GB shared_buffers, the remaining 24GB can support roughly 2,400-4,800 connections. But you don’t want that many — use PgBouncer instead and keep max_connections low.
max_connections = 200 # With PgBouncer handling the fan-out
Query Performance: The EXPLAIN Habit
No amount of server configuration compensates for missing indexes or bad queries. The most impactful performance work is query-level.
The diagnostic workflow:
- Find slow queries via
pg_stat_statements:
SELECT query, calls, total_exec_time / calls AS avg_ms, rows / calls AS avg_rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY avg_ms DESC
LIMIT 20;
- Explain the slow query:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.created_at > now() - interval '7 days'
AND o.status = 'pending';
-
Look for:
- Sequential scans on large tables → need an index
- Nested loop with large outer relation → bad join choice, likely bad statistics
- High
Buffers: read→ cache miss, data not in buffer rows=X (actual rows=Y)with large discrepancy → bad statistics, runANALYZE- Sort operations with “external sort” → increase
work_mem - Filter:
(Rows Removed by Filter: N)with large N → index doesn’t exist or isn’t selective enough
-
Fix the issue (add index, update statistics, rewrite query)
-
Verify improvement (run
EXPLAIN ANALYZEagain)
Statistics and ANALYZE
Run ANALYZE after large bulk loads to update table statistics before the planner has to guess:
ANALYZE orders; -- Update statistics for one table
ANALYZE; -- Update statistics for all tables in current database
For columns with high cardinality or very uneven distributions, increase per-column statistics:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
Parallel Query
Postgres can parallelize query execution across multiple CPU cores for sequential scans and some aggregations. This is controlled by max_parallel_workers_per_gather.
max_parallel_workers_per_gather = 4 # Up to 4 workers per parallel query
max_parallel_workers = 8 # Total parallel workers (across all queries)
max_worker_processes = 16 # Total background workers
Parallel query is automatic — the planner decides when to use it. It helps large analytical queries; it doesn’t help indexed lookups.
Partitioning for Performance
As covered in Chapter 3, time-partitioned tables with recent-data access patterns benefit enormously from partition pruning:
-- This only scans the relevant monthly partition:
SELECT * FROM events
WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01';
For queries that always filter on the partition key, partitioning is a very effective performance strategy for large tables.
Identifying Bloat
Index and table bloat degrades performance and wastes disk space. A quick check:
-- Find tables with significant dead tuple bloat
SELECT schemaname, relname, n_dead_tup,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
-- pgstattuple for precise bloat measurement (requires extension)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
For severe bloat that autovacuum can’t reclaim (e.g., after a massive delete), consider VACUUM FULL (locks the table) or pg_repack (online, no locking).
Configuration Validation Tools
pgtune (pgtune.leopard.in.ua) generates a postgresql.conf based on your hardware profile. It’s a good starting point.
pgBadger analyzes Postgres log files and produces detailed reports on slow queries, wait events, and error patterns.
check_postgres is a Nagios/Icinga monitoring plugin that checks connection counts, bloat, vacuum age, and many other indicators.
The key principle: Postgres’s default configuration is a safe minimum, not a target. Every production Postgres instance should be tuned for its workload and hardware. The changes in this chapter — higher shared_buffers, lower autovacuum scale factors, PgBouncer in transaction mode — produce immediate, measurable improvements on almost every system.
Replication and High Availability
A single Postgres instance, no matter how well-tuned, is a single point of failure. High availability means the database continues to serve traffic when a hardware failure, network partition, or software crash takes down the primary server. Replication is the mechanism that makes this possible.
Postgres’s replication story is mature, well-understood, and capable of supporting serious production requirements — from simple read replicas to automatic failover to logical change data capture.
Physical (Streaming) Replication
Streaming replication is the primary replication mechanism in Postgres. The standby connects to the primary, streams WAL records, and applies them to maintain a byte-for-byte copy of the primary’s data.
How it works:
- The primary writes all changes to WAL (Write-Ahead Log)
- The standby’s WAL receiver process connects to the primary’s WAL sender process
- WAL records stream continuously to the standby
- The standby’s startup process applies WAL records, keeping the standby’s data in sync
- The standby is read-only — queries can run against it, but no writes are allowed
Setting Up Streaming Replication
On the primary, in postgresql.conf:
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB # Keep enough WAL for standbys to catch up
Create a replication role:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
In pg_hba.conf, allow the standby to connect for replication:
host replication replicator 10.0.0.2/32 scram-sha-256
On the standby, create a standby.signal file and configure postgresql.conf:
primary_conninfo = 'host=10.0.0.1 port=5432 user=replicator password=secure_password'
restore_command = '' # Only needed for WAL archiving
The standby automatically starts replicating when it finds standby.signal in the data directory.
Verify replication is working on the primary:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
Replication lag is shown in replay_lag — the time difference between when the primary committed a transaction and when the standby applied it.
Synchronous vs. Asynchronous Replication
Asynchronous replication (default): The primary commits and returns to the client without waiting for the standby to acknowledge. The standby eventually applies the changes. Replication lag is typically milliseconds to seconds. If the primary crashes, recent transactions may not have reached the standby — replication lag = potential data loss.
Synchronous replication: The primary waits for one or more standbys to acknowledge writing the WAL before committing. Zero data loss on primary failure, at the cost of added commit latency (every commit waits for network round-trip to standby).
Configure synchronous replication on the primary:
synchronous_standby_names = 'standby1' # Specific standby name
# Or: 'ANY 1 (standby1, standby2)' # Any 1 of multiple standbys
On the standby, set application_name in primary_conninfo:
primary_conninfo = 'host=primary port=5432 user=replicator application_name=standby1'
For most applications, asynchronous replication with a well-monitored replica is sufficient. The typical replication lag of milliseconds means RPO (Recovery Point Objective) is measured in seconds, not minutes.
Replication Slots
A replication slot guarantees that the primary retains WAL segments until all subscribers have consumed them. Without a slot, if a standby falls behind (network outage, maintenance), the primary might delete WAL that the standby still needs, requiring a full resync.
-- Create a physical replication slot
SELECT pg_create_physical_replication_slot('standby1_slot');
-- List slots and their lag
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
Warning: Replication slots that fall behind and stop consuming WAL will cause WAL to accumulate indefinitely, eventually filling the disk. Monitor slot lag and set max_slot_wal_keep_size as a safety valve:
max_slot_wal_keep_size = 10GB # Discard WAL for lagging slots beyond this
Read Replicas
A standby server in hot standby mode accepts read-only queries. This is a simple way to scale read traffic.
-- On primary: hot standby is the default
-- On standby, queries run normally (read-only)
SELECT COUNT(*) FROM orders; -- works on standby
-- Queries requiring writes fail:
INSERT INTO orders ...; -- ERROR: cannot execute INSERT in a read-only transaction
Read replica use cases:
- Long-running analytical queries that would impact primary performance
- Reporting and BI tools
- Full-text search queries
- Geographic distribution (read from a nearby region)
For application routing, you need to direct read traffic to replicas and write traffic to the primary. Tools like HAProxy, PgBouncer, or application-level connection routing handle this.
Standby hint bits: One subtlety — when a read query on the standby touches a tuple that needs its hint bits set (a tuple that hasn’t been frozen), the standby can’t update the hint bits (it’s read-only). Postgres handles this gracefully (it applies the visibility rules manually), but it means hot standby reads can be slightly slower than primary reads for data that’s never been frozen.
Logical Replication
Physical replication creates an exact byte-for-byte copy of the primary. Logical replication replicates changes at the row level — individual INSERT, UPDATE, DELETE events — allowing more flexibility:
- Replicate a subset of tables
- Replicate to a different major Postgres version
- Replicate to multiple downstream consumers
- Support bidirectional replication (with care)
- Use as a CDC (Change Data Capture) source
Logical replication uses a publish/subscribe model:
Publisher (primary):
-- Requires wal_level = logical in postgresql.conf
CREATE PUBLICATION my_publication
FOR TABLE users, orders, products;
-- Or: FOR ALL TABLES (replicates everything)
Subscriber (downstream Postgres):
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=primary port=5432 user=replicator password=xxx dbname=mydb'
PUBLICATION my_publication;
The subscription starts copying the initial data, then applies ongoing changes. Tables on the subscriber must exist with compatible schemas.
Logical Decoding and CDC
Logical replication is built on logical decoding — the ability to decode WAL into a stream of data changes. Tools like Debezium use logical decoding to stream Postgres changes into Kafka, enabling event sourcing and data integration patterns without polling:
-- Create a logical replication slot for an external consumer
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
-- Peek at changes (for debugging)
SELECT * FROM pg_logical_slot_peek_changes('debezium', NULL, NULL);
Debezium + Kafka is the gold standard for streaming Postgres changes to other systems. This is how you bridge Postgres (the source of truth) with Elasticsearch, Redis, data warehouses, and other consumers while maintaining Postgres as the authoritative store.
Automatic Failover with Patroni
Physical replication sets up the data flow, but failover — promoting a standby when the primary fails — requires additional tooling. Patroni is the most widely-used solution.
Patroni is a Postgres cluster manager that:
- Uses a distributed consensus store (etcd, Consul, or ZooKeeper) to elect a leader
- Automatically promotes a standby when the primary becomes unavailable
- Manages the Postgres configuration for the current role (primary vs. standby)
- Provides a REST API for cluster status and management
A Patroni cluster consists of:
- 2 or more Postgres instances with Patroni agents
- A distributed consensus store (3-node etcd cluster is typical)
- (Optionally) HAProxy or similar for client routing
Basic Patroni configuration (patroni.yml):
name: postgres1
scope: my-cluster
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.1:8008
etcd3:
hosts: 10.0.0.10:2379,10.0.0.11:2379,10.0.0.12:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB - max allowed lag for a standby to be eligible
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.1:5432
data_dir: /var/lib/postgresql/data
parameters:
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
Patroni handles:
- Leader election (which Postgres node is primary)
- Automatic failover (promotes best standby when primary fails)
- Replication configuration management
- Health checking
The patronictl CLI for cluster management:
patronictl -c /etc/patroni.yml list # Show cluster state
patronictl -c /etc/patroni.yml failover my-cluster # Manual failover
patronictl -c /etc/patroni.yml switchover my-cluster # Planned failover
Client Routing
When the primary changes (failover), your application needs to know the new primary’s address. Options:
DNS-based routing: Maintain a DNS record that always points to the current primary. Patroni updates this record on failover. Simple but has TTL lag.
HAProxy: A reverse proxy that routes connections based on health checks. Patroni’s REST API reports whether a node is primary or standby. HAProxy queries this to route traffic.
frontend postgres_primary
bind *:5432
default_backend primary
backend primary
option httpchk GET /primary
server postgres1 10.0.0.1:5432 check port 8008
server postgres2 10.0.0.2:5432 check port 8008
PgBouncer + Patroni: PgBouncer can be reconfigured to point to the new primary after failover, either via a patroni callback or by pointing to a virtual IP that moves with the primary.
Cluster-aware drivers: Some Postgres drivers support multiple hosts and automatically discover the current primary. For example, in Go with pgx:
connStr := "postgres://user:pass@host1,host2,host3/mydb?target_session_attrs=read-write"
This connects to the first host in the list that accepts read-write connections (i.e., the primary).
Replication Monitoring
Essential metrics to monitor:
-- Replication lag on primary
SELECT client_addr, replay_lag, sync_state
FROM pg_stat_replication;
-- On standby: how far behind is this standby?
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- WAL sender activity
SELECT pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;
Alert on:
- Replication lag > 60 seconds (normal is <100ms for async)
- Replication slot lag growing indefinitely
- Standby not connected at all
Cloud Managed HA
If you’re using a managed Postgres service (AWS RDS, Cloud SQL, Aurora, Supabase), high availability is typically handled for you:
- AWS RDS Multi-AZ: Synchronous standby in a different availability zone. Automatic failover in 60-120 seconds. Read replicas available separately.
- AWS Aurora: Multi-AZ by design with storage-level replication. Very fast failover (~30 seconds). Up to 15 read replicas.
- Google Cloud SQL: HA with an automatic standby in a different zone. Failover in ~60 seconds.
- Supabase: Built-in HA with read replicas.
Managed HA trades control for operational simplicity. For teams without dedicated DBAs, managed HA is often the right choice — let the cloud provider handle the Patroni equivalent.
The Availability Math
A single Postgres instance with good hardware and no HA has roughly 99.9% availability (about 8 hours of downtime per year from scheduled maintenance and unexpected failures). With HA and automatic failover, you can achieve 99.95% or better (under 5 hours per year). The gap between “I should add a standby” and “this will cause production downtime” is often just one hardware failure away.
For any production system with users, HA is not optional — it’s part of operating responsibly. The complexity of Patroni (or using a managed service) is a one-time investment that pays off the first time a disk fails or a host becomes unavailable and traffic just… fails over.
Backup and Recovery
A database you can’t restore is a database you don’t have. Backup is not about the backup — it’s about the restore. The question is not “do we have backups?” but “have we tested restoring from them recently?” and “what’s our actual recovery time in an emergency?”
This chapter covers the full spectrum of Postgres backup strategies: logical dumps with pg_dump, continuous archiving with WAL, point-in-time recovery, and the production-grade tools that make this manageable.
Terminology: RPO and RTO
Two metrics define your backup requirements:
RPO (Recovery Point Objective): How much data can you afford to lose? An RPO of 5 minutes means you can tolerate losing up to 5 minutes of transactions in a catastrophic failure. An RPO of 0 means you need synchronous replication with no data loss.
RTO (Recovery Time Objective): How long can the database be unavailable during recovery? An RTO of 4 hours means you can take up to 4 hours to restore from a backup and catch up. An RTO of 5 minutes requires a hot standby.
Your backup strategy flows from these numbers. A startup with moderate data loss tolerance might accept RPO=24h (daily backups) and RTO=4h. A financial system might require RPO=0 and RTO=minutes. Most production applications land somewhere in between.
pg_dump: Logical Backups
pg_dump is Postgres’s built-in tool for creating logical (SQL) backups. It connects to the database and dumps the schema and data as SQL commands or CSV.
# Dump to SQL file (text format)
pg_dump --dbname=mydb --file=backup.sql
# Dump to custom binary format (recommended: smaller, parallelizable restore)
pg_dump --dbname=mydb --format=custom --file=backup.dump
# Dump with connection string
pg_dump "postgresql://user:pass@host:5432/mydb" --format=custom --file=backup.dump
# Dump only specific tables
pg_dump --table=orders --table=users --format=custom --file=partial_backup.dump
# Dump schema only (no data)
pg_dump --schema-only --format=custom --file=schema_only.dump
# Dump data only (no schema)
pg_dump --data-only --format=custom --file=data_only.dump
Restoring with pg_restore
# Restore from custom format
pg_restore --dbname=mydb --jobs=4 backup.dump # parallel restore with 4 workers
# Restore to a new database
createdb mydb_restored
pg_restore --dbname=mydb_restored --jobs=4 backup.dump
# Restore a specific table
pg_restore --dbname=mydb --table=orders backup.dump
# Restore schema only
pg_restore --schema-only --dbname=mydb backup.dump
# From SQL format:
psql --dbname=mydb < backup.sql
pg_dumpall
Dumps the entire PostgreSQL cluster — all databases, global objects (roles, tablespaces):
pg_dumpall --file=cluster_backup.sql --globals-only # Just roles/tablespaces
pg_dumpall --file=cluster_backup.sql # Everything
Limitations of Logical Backups
- Slow: Dumping a large database takes hours; restoring takes longer (must replay all INSERT statements)
- Point-in-time is limited: A dump reflects the state at one moment. No PITR without WAL archiving.
- Not suitable for very large databases: For databases > several hundred GB, dump/restore is too slow for realistic RTO
For small databases (<50GB), pg_dump with daily automation is often sufficient. For larger databases or stricter RPO/RTO requirements, WAL archiving is necessary.
WAL Archiving: Continuous Backup
Physical backup (PITR) works by:
- Taking a base backup (a copy of the data directory)
- Archiving WAL segments continuously as they’re generated
- At recovery time: restore the base backup, then replay WAL segments up to the desired point in time
This allows recovery to any point in time since the base backup — with granularity limited only by WAL segment size.
Configuring WAL Archiving
In postgresql.conf:
wal_level = replica # Minimum for archiving
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
# Or to S3:
# archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
# Or with WAL-G:
# archive_command = 'wal-g wal-push %p'
%p is the path to the WAL file, %f is just the filename. The archive command must return 0 on success and non-zero on failure.
Taking a Base Backup
# Simple base backup with pg_basebackup
pg_basebackup \
--host=localhost \
--username=replicator \
--pgdata=/var/lib/postgresql/base_backup \
--format=tar \
--gzip \
--compress=9 \
--wal-method=stream \
--progress
# Or to a custom location:
pg_basebackup -h localhost -U replicator -D /backup/base -Ft -z -Xs -P
--wal-method=stream streams WAL during the backup, ensuring the backup is immediately usable without waiting for WAL segments to be archived.
Point-in-Time Recovery
To recover to a specific point in time:
- Restore the base backup to the data directory
- Configure recovery in
postgresql.conf:
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
# Or from S3: 'aws s3 cp s3://my-bucket/wal/%f %p'
recovery_target_time = '2024-06-15 14:30:00+00'
recovery_target_action = 'promote' # Promote to primary after reaching target
- Create
recovery.signalin the data directory - Start Postgres
Postgres will restore WAL segments up to the target time, then promote the instance to read-write mode.
Recovery targets:
recovery_target_time: Recover to a specific timestamprecovery_target_lsn: Recover to a specific WAL positionrecovery_target_xid: Recover to after a specific transactionrecovery_target_name: Recover to a named restore point (created withpg_create_restore_point())
PITR is invaluable for “logical corruption” disasters — a bad deployment drops the wrong column, a bug deletes the wrong rows. You can recover the database to just before the accident.
pgBackRest: The Production Standard
pgBackRest is an enterprise-grade backup tool for Postgres, widely considered the most complete solution available. It handles:
- Full, differential, and incremental backups
- WAL archiving and PITR
- Parallel backup and restore (dramatically faster than
pg_basebackup) - Backup catalog management
- Encryption and compression
- Remote repositories (S3, GCS, Azure, SFTP)
- Standby backup (don’t stress the primary)
pgBackRest Configuration
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=very_secure_passphrase
# Or S3:
# repo1-type=s3
# repo1-s3-bucket=my-postgres-backups
# repo1-s3-region=us-east-1
# repo1-s3-endpoint=s3.amazonaws.com
[main]
pg1-path=/var/lib/postgresql/data
pg1-user=postgres
PostgreSQL configuration (for archiving):
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
pgBackRest Operations
# Create the stanza (repository initialization)
pgbackrest --stanza=main stanza-create
# Take a full backup
pgbackrest --stanza=main backup --type=full
# Take a differential backup (since last full)
pgbackrest --stanza=main backup --type=diff
# Take an incremental backup (since last backup)
pgbackrest --stanza=main backup --type=incr
# List backups
pgbackrest --stanza=main info
# Restore to latest
pgbackrest --stanza=main restore
# PITR to a specific time
pgbackrest --stanza=main restore --recovery-option="recovery_target_time=2024-06-15 14:30:00"
# Restore to a new path
pgbackrest --stanza=main restore --pg1-path=/var/lib/postgresql/restored
Backup Schedule
A typical schedule:
- Full backup: Weekly (Sunday at 2am)
- Differential backup: Daily (every other day at 2am)
- WAL archiving: Continuous
This gives you the ability to restore to any point in time within the retention window, with a restore time of (base backup restore time) + (WAL replay time since backup). With incremental backups, the WAL replay window is small.
Barman: Another Production Option
Barman (Backup and Recovery Manager) is another mature backup tool, popular in enterprises. It focuses on remote backup management — Barman runs on a dedicated backup server and fetches backups from Postgres servers.
For teams that want Barman’s centralized management model, it’s an excellent choice. pgBackRest and Barman have roughly equivalent capabilities; the choice is often based on operational preference.
Testing Your Backups
Here is a hard rule: if you haven’t tested a restore, you don’t have a backup.
Databases fail in ways that expose backup configuration bugs you didn’t know existed. Backup processes fail silently. WAL archiving gets misconfigured. Encryption keys get lost. The restore command in your documentation is wrong.
A practical test regime:
Monthly: Restore the latest backup to a test environment. Verify the database starts, data is present, and basic queries work.
After major configuration changes: Any time you change the backup configuration, test a restore before assuming it works.
After Postgres upgrades: Major version upgrades may require updating backup tool versions. Test the restore process after any upgrade.
Restore test script:
#!/bin/bash
set -e
# Restore to a test instance
pgbackrest --stanza=main restore \
--pg1-path=/var/lib/postgresql/test_restore \
--target-exclusive \
--target=latest
# Start the test instance on a different port
postgres -D /var/lib/postgresql/test_restore -p 5433 &
# Wait for startup
until pg_isready -p 5433; do sleep 1; done
# Verify critical data
psql -p 5433 -d mydb -c "SELECT COUNT(*) FROM orders;" | grep -q "[0-9]"
psql -p 5433 -d mydb -c "SELECT max(created_at) FROM orders;"
# Verify WAL applied (check that recent transactions are present)
psql -p 5433 -d mydb -c "SELECT id FROM orders ORDER BY id DESC LIMIT 1;"
echo "Restore test passed"
# Cleanup
pg_ctl stop -D /var/lib/postgresql/test_restore
rm -rf /var/lib/postgresql/test_restore
Run this in CI or as a scheduled job. The 30 minutes this takes each month is cheap compared to discovering your backup is broken during an actual disaster.
The 3-2-1 Rule
Apply the 3-2-1 backup rule to Postgres:
- 3 copies of the data
- 2 different storage media
- 1 offsite copy
In practice:
- Copy 1: The live database
- Copy 2: A hot standby (streaming replication)
- Copy 3: WAL archives and base backups in cloud object storage (S3, GCS)
The standby gives you fast failover (RTO in minutes). The cloud backup gives you PITR and protection against logical corruption and catastrophic datacenter failures.
Monitoring Backup Health
Critical metrics to alert on:
-- Is WAL archiving working?
-- (Check pg_stat_archiver)
SELECT archived_count, last_archived_wal, last_archived_time,
failed_count, last_failed_wal, last_failed_time
FROM pg_stat_archiver;
Alert if:
last_archived_timeis more than 5 minutes oldfailed_countincreased since the last check- Base backup hasn’t run in the configured interval
- Backup storage usage is growing unexpectedly
pgBackRest has built-in check and verify commands:
pgbackrest --stanza=main check # Verify archiving is working
pgbackrest --stanza=main verify # Verify backup files are intact
Backup and recovery is the insurance policy for everything in this book. You can optimize performance, harden security, and design elegant schemas — but without working backups, a single hardware failure or operator error can erase it all. Invest in your backup strategy proportionally to the value of your data. For almost any production system, that means WAL archiving, regular base backups to offsite storage, and tested restores.
Security
PostgreSQL has one of the most sophisticated security models of any open-source database. Roles, row-level security, column-level permissions, encryption, audit logging — the tools are all there. The failure mode is not missing features; it’s not using them.
This chapter covers how to use Postgres’s security features to build a database that’s properly defended: least-privilege access, data-level access control, encrypted connections and storage, and comprehensive audit trails.
The Role System
Postgres security is organized around roles. A role is a named identity in the database that can hold privileges, own objects, and log in (or not). The distinction between “users” (roles that can log in) and “groups” (roles that can’t log in but grant privileges) is a convention, not a hard distinction.
Creating Roles
-- A login role (user)
CREATE ROLE app_user WITH LOGIN PASSWORD 'secure_password';
-- A login role with expiry
CREATE ROLE temp_analyst WITH LOGIN PASSWORD 'password'
VALID UNTIL '2025-01-01';
-- A non-login role (group)
CREATE ROLE readonly;
CREATE ROLE readwrite;
CREATE ROLE admin;
Granting Privileges
-- Grant schema usage (required to access objects within the schema)
GRANT USAGE ON SCHEMA public TO readonly;
GRANT USAGE ON SCHEMA public TO readwrite;
-- Grant SELECT on all existing tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
-- Grant SELECT/INSERT/UPDATE/DELETE on all existing tables
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO readwrite;
-- Grant sequence usage (for INSERT with serial/identity columns)
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO readwrite;
-- Default privileges for future objects
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO readwrite;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT USAGE, SELECT ON SEQUENCES TO readwrite;
-- Grant role membership (app_user gets readonly privileges)
GRANT readonly TO app_user;
The ALTER DEFAULT PRIVILEGES command is critical — without it, new tables created in the future won’t automatically grant the expected privileges to existing roles.
The Principle of Least Privilege
Every application connection should use a role with only the privileges it needs:
-- Application: reads and writes data, but never drops tables
CREATE ROLE myapp_user WITH LOGIN PASSWORD 'app_password';
GRANT readwrite TO myapp_user;
-- Read replica: SELECT only
CREATE ROLE myapp_reader WITH LOGIN PASSWORD 'reader_password';
GRANT readonly TO myapp_reader;
-- Migrations: need to create/alter tables
CREATE ROLE myapp_migrations WITH LOGIN PASSWORD 'migration_password';
GRANT readwrite TO myapp_migrations;
-- Also grant CREATE privileges:
GRANT CREATE ON SCHEMA public TO myapp_migrations;
-- Admin: SUPERUSER or carefully-scoped privileges
-- Prefer specific privileges over SUPERUSER in production
Separate credentials for migrations (run during deployments) vs. application connections (run all the time) is a security hygiene practice with real value: if the application credential is compromised, the attacker can’t run DDL.
pg_hba.conf: Client Authentication
pg_hba.conf (host-based authentication) controls which hosts can connect to which databases with which roles and which authentication methods.
# TYPE DATABASE USER ADDRESS METHOD
local all all peer # OS user = DB user
host all all 127.0.0.1/32 scram-sha-256
host mydb app_user 10.0.1.0/24 scram-sha-256
hostssl all all 0.0.0.0/0 scram-sha-256 # Require SSL
hostnossl all all 0.0.0.0/0 reject # Reject non-SSL
Key authentication methods:
peer: Use OS username (local connections only). The OS user must match the database role name.scram-sha-256: Password authentication using SCRAM. Modern and secure. Prefer overmd5.md5: Password authentication using MD5. Deprecated — usescram-sha-256instead.cert: Client certificate authentication. Most secure for automated connections.reject: Deny the connection outright.
Always use scram-sha-256 or cert, never md5 or trust in production.
trust means any connection from the matched host/user is allowed without a password. Appropriate only for local Unix socket connections on tightly controlled systems, and even then, risky.
SSL/TLS
All external connections should use SSL/TLS to encrypt data in transit. Any credential or sensitive data query over an unencrypted connection can be sniffed.
In postgresql.conf:
ssl = on
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'
ssl_ca_file = 'ca.crt' # For client certificate verification
ssl_min_protocol_version = 'TLSv1.2'
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
Force SSL in pg_hba.conf by using hostssl entries (only allow SSL) and hostnossl ... reject (reject non-SSL):
hostssl all all 0.0.0.0/0 scram-sha-256
hostnossl all all 0.0.0.0/0 reject
In your application connection string, include sslmode=verify-full (or at minimum sslmode=require) to verify the server certificate.
Row-Level Security (RLS)
Row-Level Security is Postgres’s mechanism for restricting which rows a role can see or modify. It’s declarative, transparent to the application, and enforced at the database layer regardless of how the query arrives.
Enable RLS on a table:
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
Create a policy:
-- Users can only see their own orders
CREATE POLICY orders_user_isolation ON orders
FOR ALL
TO app_user
USING (user_id = current_setting('app.current_user_id')::bigint);
The USING clause is checked for SELECT, UPDATE, and DELETE (which rows can you read/modify?). The WITH CHECK clause is checked for INSERT and UPDATE (which values can you write?):
-- Users can only insert orders for themselves
CREATE POLICY orders_insert_policy ON orders
FOR INSERT
TO app_user
WITH CHECK (user_id = current_setting('app.current_user_id')::bigint);
Setting RLS Context
The application must set the context before executing queries:
-- At the start of each request/transaction:
SET LOCAL app.current_user_id = '42';
-- Now all queries on `orders` only return rows for user_id = 42
SELECT * FROM orders; -- Only returns user 42's orders
SET LOCAL is scoped to the current transaction — it resets when the transaction ends. Safe for use in connection-pooled environments (transaction mode).
Multi-tenant RLS
RLS is the foundation of multi-tenant data isolation in a shared database:
-- Tenant isolation across all tables
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE products ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON users
USING (tenant_id = current_setting('app.tenant_id')::bigint);
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.tenant_id')::bigint);
-- Admin role bypasses RLS
CREATE ROLE admin_user WITH LOGIN;
ALTER TABLE orders FORCE ROW LEVEL SECURITY; -- Applies RLS even to table owner
GRANT BYPASS RLS ON orders TO admin_user; -- Explicitly grant bypass when needed
Note: FORCE ROW LEVEL SECURITY applies RLS even to the table owner. Without it, the table owner bypasses RLS. In multi-tenant systems, you typically want FORCE ROW LEVEL SECURITY to prevent ownership-based bypass.
RLS Performance
RLS policies add a predicate to every query. For policies on indexed columns (like tenant_id), the query planner can use the index, and performance impact is minimal. For policies on unindexed columns or complex expressions, every query pays the cost of evaluating the policy.
Index the columns used in RLS policies:
CREATE INDEX idx_orders_tenant_id ON orders(tenant_id);
CREATE INDEX idx_orders_user_id ON orders(user_id);
Column-Level Security
Restrict access to specific columns using column-level privileges:
-- Grant SELECT on all columns except ssn and salary
GRANT SELECT (id, name, email, created_at) ON employees TO hr_viewer;
-- Or: grant all columns, then revoke the sensitive ones
GRANT SELECT ON employees TO hr_viewer;
REVOKE SELECT (ssn, salary) ON employees FROM hr_viewer;
Column-level security is useful for compliance requirements where certain columns (PII, financial data) must be visible only to specific roles.
Encryption at Rest
Postgres itself doesn’t encrypt data files. Encryption at rest is typically handled at the storage level:
- OS-level: Linux LUKS (dm-crypt) encrypts the entire filesystem, including Postgres data files
- Cloud: AWS RDS encrypts with AWS KMS, GCS with Cloud KMS — this is typically enabled by default
- Tablespace-level: Some storage systems provide per-volume encryption
- pgcrypto: Column-level encryption within the database (see below)
For cloud databases, ensure encryption at rest is enabled (it usually is by default).
Column-Level Encryption with pgcrypto
For particularly sensitive data that should be encrypted even from database administrators:
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- Store an encrypted value
INSERT INTO medical_records (patient_id, diagnosis_encrypted)
VALUES (1, pgp_sym_encrypt('Type 2 Diabetes', 'encryption_key'));
-- Retrieve and decrypt
SELECT patient_id, pgp_sym_decrypt(diagnosis_encrypted, 'encryption_key')
FROM medical_records
WHERE patient_id = 1;
For public-key encryption (where different parties encrypt vs. decrypt):
-- Encrypt with public key
UPDATE users
SET ssn_encrypted = pgp_pub_encrypt(ssn, dearmor('-----BEGIN PGP PUBLIC KEY...'))
WHERE id = 1;
-- Decrypt requires private key (held outside the DB)
SELECT pgp_pub_decrypt(ssn_encrypted, dearmor('-----BEGIN PGP PRIVATE KEY...'), 'passphrase')
FROM users WHERE id = 1;
Column-level encryption has real costs: the data is not indexable, not queryable by value, and requires the application to manage keys. Use it only where the compliance requirement specifically demands it.
Audit Logging
For compliance and security incident investigation, knowing who did what and when is essential.
pgaudit
The pgaudit extension (Chapter 11) provides statement-level and object-level audit logging. Configure it in postgresql.conf:
shared_preload_libraries = 'pgaudit'
# Log all write operations and DDL
pgaudit.log = 'write, ddl'
# Include connection info in log
pgaudit.log_client = on
# Log role that granted the privilege being used
pgaudit.log_relation = on
Per-role audit configuration:
-- Audit all operations by this user
ALTER ROLE sensitive_user SET pgaudit.log = 'all';
Built-in Logging
Postgres’s standard logging can also provide audit trails:
log_connections = on
log_disconnections = on
log_duration = on
log_statement = 'mod' # Log all DML; or 'all' for everything
log_min_duration_statement = 0 # Log all statements (combine with log_statement)
Log everything to a table via log_destination = 'csvlog' and import into a log analysis system.
Trigger-Based Audit Tables
For high-fidelity row-level audit trails:
CREATE TABLE audit_log (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
table_name TEXT NOT NULL,
operation TEXT NOT NULL CHECK (operation IN ('INSERT', 'UPDATE', 'DELETE')),
row_id BIGINT,
old_row JSONB,
new_row JSONB,
changed_by TEXT NOT NULL DEFAULT current_user,
changed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE OR REPLACE FUNCTION audit_trigger()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO audit_log(table_name, operation, row_id, old_row, new_row)
VALUES (
TG_TABLE_NAME,
TG_OP,
CASE WHEN TG_OP = 'DELETE' THEN OLD.id ELSE NEW.id END,
CASE WHEN TG_OP != 'INSERT' THEN row_to_json(OLD)::jsonb END,
CASE WHEN TG_OP != 'DELETE' THEN row_to_json(NEW)::jsonb END
);
RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER audit_orders
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION audit_trigger();
This captures a complete before/after picture for every row change. Partition the audit_log table by date for manageability.
Security Checklist
A practical checklist for a production Postgres deployment:
- All external connections use SSL/TLS (
hostsslin pg_hba.conf) - Password authentication uses
scram-sha-256, notmd5ortrust - Application uses a role with only necessary privileges (not superuser)
- Separate credentials for migration runner vs. application
-
pg_hba.conflimits which hosts can connect - Default passwords changed (especially
postgresrole) - RLS enabled for multi-tenant or user-scoped tables
- Column-level grants for sensitive data (SSN, salary, health data)
-
pgauditor trigger-based audit logging for compliance requirements - Encryption at rest enabled (storage-level or column-level for most-sensitive data)
- Postgres data directory accessible only to the
postgresOS user -
pg_hba.confandpostgresql.confnot world-readable
PostgreSQL provides all the mechanisms needed for a properly secured database. The work is in applying them consistently. The most common security failures are not clever exploits — they’re misconfigured authentication, overly privileged application accounts, and missing audit trails for compliance requirements. The checklist above addresses all of them.
Observability
You can’t improve what you can’t measure. Postgres ships with a remarkably complete observability system built in — the pg_stat_* family of views, EXPLAIN ANALYZE, slow query logging, wait event tracking, and more. This chapter is about using those instruments effectively.
The goal of database observability is to be able to answer: “What is the database doing right now? What has it been doing? What’s slow? Why?” Without good observability, you’re flying blind. With it, performance investigations that used to take days take minutes.
The pg_stat_* Views
Postgres maintains statistics about nearly everything it does, surfaced as views in the pg_catalog schema. These views are essential for understanding what your database is doing.
pg_stat_activity
The most important view for real-time diagnostics. Shows all active connections and what they’re doing.
SELECT
pid,
usename,
application_name,
client_addr,
state,
wait_event_type,
wait_event,
left(query, 100) AS query_snippet,
now() - query_start AS duration,
now() - state_change AS state_duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC NULLS LAST;
The state column:
active: Actively executing a queryidle: Waiting for a query from the clientidle in transaction: In a transaction, waiting between statementsidle in transaction (aborted): In a failed transaction (must be rolled back)fastpath function call: Executing a fast-path function
idle in transaction is a red flag. A connection that’s idle in transaction holds locks for the duration. If a long transaction is idle (application code is doing slow work between database calls), those locks block other queries. Alert on connections that are idle in transaction for more than 30 seconds.
The wait_event_type and wait_event columns tell you what the backend is waiting for:
Common wait events:
Lock / relation: Waiting for a table lockLock / tuple: Waiting for a row lockLWLock / buffer_content: Waiting for a shared buffer lockIO / DataFileRead: Waiting for a page read from disk (cache miss)Client / ClientRead: Waiting for client to send the next queryIPC / BgWorkerShutdown: Internal — background worker coordination
pg_stat_user_tables
Per-table statistics:
SELECT
relname,
seq_scan,
idx_scan,
n_live_tup,
n_dead_tup,
round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS bloat_pct,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
High seq_scan relative to idx_scan suggests missing or unused indexes. High n_dead_tup suggests autovacuum isn’t keeping up. last_autovacuum and last_autoanalyze tell you when maintenance last ran.
pg_stat_user_indexes
Per-index usage statistics:
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
idx_scan = 0 means this index has never been used since the last statistics reset. These are candidates for removal — they add write overhead and waste space without helping any query.
pg_stat_bgwriter
Background writer and checkpoint statistics:
SELECT
checkpoints_timed,
checkpoints_req,
round(checkpoints_req::numeric / nullif(checkpoints_timed + checkpoints_req, 0) * 100, 2) AS checkpoint_req_pct,
buffers_checkpoint,
buffers_clean,
maxwritten_clean,
buffers_alloc,
stats_reset
FROM pg_stat_bgwriter;
checkpoint_req_pct high (> 20%) means checkpoints are being triggered by WAL size rather than time — increase max_wal_size. maxwritten_clean > 0 means the background writer hit its limit — increase bgwriter_lru_maxpages.
pg_stat_replication
Covered in Chapter 14. Essential for monitoring standby lag.
pg_stat_database
Database-level aggregates:
SELECT
datname,
numbackends AS active_connections,
xact_commit,
xact_rollback,
round(xact_rollback::numeric / nullif(xact_commit + xact_rollback, 0) * 100, 2) AS rollback_pct,
blks_read,
blks_hit,
round(blks_hit::numeric / nullif(blks_read + blks_hit, 0) * 100, 2) AS cache_hit_pct,
deadlocks,
temp_files,
pg_size_pretty(temp_bytes) AS temp_bytes_used
FROM pg_stat_database
WHERE datname = current_database();
cache_hit_pct below 95% suggests insufficient shared_buffers or a workload with poor cache locality. temp_files and temp_bytes nonzero indicates sorts and hash joins spilling to disk — increase work_mem. deadlocks counts should be near zero; any count warrants investigation.
pg_stat_statements: Your Query Analysis Workhorse
Already covered in Chapter 11, but it bears repeating: pg_stat_statements is the single most important observability tool in Postgres.
-- Most time-consuming queries total
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
round(rows::numeric / calls, 1) AS avg_rows,
left(query, 120) AS query
FROM pg_stat_statements
WHERE calls > 50
ORDER BY total_exec_time DESC
LIMIT 20;
-- High variance queries (sometimes fast, sometimes slow)
SELECT
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
round(stddev_exec_time::numeric / nullif(mean_exec_time, 0) * 100, 2) AS cv_pct,
calls,
left(query, 120) AS query
FROM pg_stat_statements
WHERE calls > 100
ORDER BY cv_pct DESC
LIMIT 20;
-- Queries with most cache misses (lots of disk reads)
SELECT
round(shared_blks_read::numeric / calls, 1) AS avg_disk_reads,
round(shared_blks_hit::numeric / calls, 1) AS avg_cache_hits,
calls,
left(query, 120) AS query
FROM pg_stat_statements
WHERE calls > 10
ORDER BY avg_disk_reads DESC
LIMIT 20;
Reset statistics to get a fresh baseline:
SELECT pg_stat_reset(); -- Reset all pg_stat_* views
SELECT pg_stat_statements_reset(); -- Reset pg_stat_statements only
EXPLAIN ANALYZE: Query-Level Diagnostics
EXPLAIN (ANALYZE, BUFFERS) is the microscope for individual queries. Know how to read it.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT TEXT)
SELECT u.id, u.email, count(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > now() - interval '30 days'
GROUP BY u.id, u.email
ORDER BY order_count DESC
LIMIT 20;
Example output breakdown:
Sort (cost=14523.00..14523.05 rows=20 width=48) (actual time=234.1..234.1 rows=20 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=8234 read=126
-> HashAggregate (cost=14500.00..14520.00 rows=2000 width=48) (actual time=231.4..234.0 rows=1523 loops=1)
Group Key: u.id, u.email
Batches: 1 Memory Usage: 369kB
Buffers: shared hit=8234 read=126
-> Hash Left Join (cost=... rows=... (actual time=... loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=8234 read=126
-> Seq Scan on orders (cost=0.00..5280.00 rows=... (actual time=... loops=1)
Buffers: shared hit=2100 read=80
-> Hash (cost=... (actual time=... loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 121kB
Buffers: shared hit=6134 read=46
-> Index Scan using idx_users_created on users (cost=... (actual time=...)
Index Cond: (created_at > ...)
Buffers: shared hit=6134 read=46
Planning Time: 0.8 ms
Execution Time: 234.2 ms
Key patterns to recognize:
Seq Scan on a large table: The query is reading the whole table. Look for missing indexes.
Nested Loop with large outer:
Nested Loop (cost=... rows=100000)
-> Seq Scan on large_table (rows=100000)
-> Index Scan on other_table (rows=1)
This is 100000 × 1 index lookups. Fine if the outer is small; expensive if large.
Rows estimate vs. actual significantly different:
Seq Scan on orders (cost=... rows=100 ...) (actual ... rows=95000 ...)
The planner thought there’d be 100 rows but found 95,000 — bad statistics. Run ANALYZE orders.
Buffers: read is high: Pages being read from disk, not buffer cache. Either the data isn’t hot in the cache, or shared_buffers is too small for this workload.
Sort Method: external merge: Sort spilled to disk. Increase work_mem for this type of query.
Slow Query Logging
Configure Postgres to log queries that exceed a duration threshold:
log_min_duration_statement = 1000 # Log queries > 1 second
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
This generates log entries for slow queries, including the full query text, duration, and context. Feed these logs to pgBadger for analysis.
Careful with log_statement = 'all': This logs every statement and generates enormous log volumes on a busy system. Use log_min_duration_statement instead.
pgBadger: Log Analysis
pgBadger is a Perl script that analyzes Postgres log files and generates HTML reports showing:
- Slowest queries (sorted by count, total time, average time)
- Most frequent queries
- Queries per second over time
- Connections, errors, and lock waits over time
- Wait event statistics
pgbadger /var/log/postgresql/postgresql.log -o report.html
Run pgBadger daily on your log files and review the report. The “top 10 slowest queries by total time” section is where you start performance investigations.
Prometheus Integration: postgres_exporter
For infrastructure-level metrics and alerting, postgres_exporter (from Prometheus community) exposes Postgres metrics in Prometheus format:
# docker-compose.yml
services:
postgres_exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://exporter:password@postgres:5432/mydb?sslmode=disable"
ports:
- "9187:9187"
Key metrics exported:
pg_stat_database_blks_hit/pg_stat_database_blks_read(cache hit ratio)pg_stat_user_tables_n_dead_tup(bloat)pg_stat_replication_*(replication lag)pg_stat_bgwriter_*(checkpoint and writer stats)pg_locks_count(lock contention)pg_stat_activity_count(active connections)
Essential Grafana dashboards:
- PostgreSQL Database (by Percona or official Postgres exporter)
- Dashboard ID 9628 or 12273 on grafana.com
Set up alerts for:
- Cache hit ratio < 95%
- Replication lag > 60 seconds
- Dead tuple percentage > 20% on any table
- Number of
idle in transactionconnections > 5 - Checkpoint requested count growing (WAL pressure)
- Total connections > 80% of
max_connections
Wait Event Analysis
When Postgres is slow and you’re not sure why, wait events tell you where time is being spent:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY 1, 2
ORDER BY count DESC;
What the common wait events mean:
Lock/relation: Table-level lock contention — migrations, VACUUM FULL, or high DDL rateLock/tuple: Row-level lock contention — hot rows being updated by many connectionsLWLock/buffer_content: Shared buffer access contention — many connections reading/writing the same bufferIO/DataFileRead: Disk reads — cache misses, large sequential scansIO/WALWrite: WAL write overhead — heavy write workloadClient/ClientRead: Waiting for application — application is slow sending the next query
Wait events give you a diagnostic shortcut: instead of guessing why queries are slow, you can observe what they’re blocked on and fix that specifically.
The Observability Stack
A complete observability setup for a production Postgres instance:
Postgres logs → pgBadger → daily HTML report
Postgres metrics → postgres_exporter → Prometheus → Grafana dashboards
pg_stat_statements → periodic snapshot → trend analysis
pg_stat_activity → alerting → PagerDuty/Slack (on blocked queries, lock waits)
EXPLAIN ANALYZE → when investigating specific slow queries
This is not complex to set up, and the payoff is enormous. The difference between a team that’s reactive to database performance issues and one that catches them early is almost always whether they have this observability stack in place.
The best observability investment: add pg_stat_statements to shared_preload_libraries, set up postgres_exporter with Prometheus and Grafana, and configure log_min_duration_statement = 1000. These three steps give you 80% of the visibility you need, and they can be done in an afternoon.
Operability Wins
The case for Postgres isn’t only about what it can do. It’s about what you don’t have to do.
Operational simplicity is underrated in software engineering. The glamour is in new capabilities — “it can do vectors!” “it can queue jobs!” “it can search!” The cost is in maintenance — the 3am page when the Redis cluster gets into a split-brain, the week spent diagnosing why Elasticsearch’s JVM heap is growing, the Monday spent carefully migrating the MongoDB collection while keeping the app running. Operational work doesn’t make the product better. It just keeps it from getting worse.
When you commit to Postgres as your single database, you get an operational dividend that compounds over time. This chapter is about that dividend.
One Thing to Monitor
Every database you add to your stack is a monitoring surface. Redis has its own metrics — memory usage, eviction rate, connected clients, command rate, keyspace hits. Elasticsearch has JVM heap, shard allocation, index size, query latency, and node health. Kafka has consumer lag, partition leadership, and throughput per topic. MongoDB has its own oplog, index build progress, and lock metrics.
Learning these metrics, setting up the dashboards, calibrating the thresholds for alerts, and knowing what actions to take when things look wrong — this is non-trivial knowledge. It takes time to build. It requires someone on your team to carry it.
With Postgres as your single database, you have one monitoring surface. One set of metrics to understand deeply. One dashboard. One alert runbook. The pg_stat_* views, EXPLAIN ANALYZE, pg_stat_statements — these are the instruments, and there are books (this one) and careers built around understanding them.
Depth of understanding is not equally valuable on every system. Understanding Postgres deeply is worth more than understanding six different databases shallowly. The team with one database and deep expertise in it is more reliable than the team with six databases and shallow expertise in all of them.
One Thing to Back Up
Every database has a backup story. That story includes:
- Which tool to use (pg_dump? mongodump? elasticdump?)
- Where backups go (S3? local disk? backup service?)
- How to test that a backup is good (by trying to restore it)
- What the recovery procedure is (documented somewhere?)
- Who knows how to do it under pressure at 3am
Multiply this by the number of databases in your stack. In practice, most teams have a reasonable backup story for their primary relational database, and a confused or absent backup story for everything else. “We can restore the Postgres database” is reassuring; “we can restore the Postgres database, but we’re not sure about the Elasticsearch index, and the Redis cluster has no WAL archiving” is not.
A Postgres-only stack means one backup story, executed well. WAL archiving to S3, base backups via pgBackRest, retention policies, and tested restores. This is achievable for any team. Doing it for five different databases is not.
One Thing to Understand
The complexity of a system is roughly proportional to the number of concepts it asks you to hold simultaneously. Each database brings its own concepts:
- Redis: RESP protocol, eviction policies, persistence modes (RDB, AOF), cluster mode, Sentinel
- Elasticsearch: Lucene segments, shard allocation, index lifecycle management, mapping types, cluster state
- MongoDB: replica set oplog, sharding, document validation, aggregation pipeline, read preferences
- Kafka: partitions, consumer groups, offsets, compaction, ISR, ZooKeeper/KRaft
A new engineer joining a team with all of these in the stack faces a months-long onboarding process before they can contribute confidently. Not because any one database is impenetrable, but because the aggregate is a lot. And the more senior engineers spend time mentoring that onboarding, the less time they spend building.
With Postgres, the learning curve is real but finite. SQL is universal. The query planner, MVCC, EXPLAIN ANALYZE, WAL, autovacuum — these are learnable, well-documented, and transferable. An engineer who understands Postgres deeply can work effectively on any Postgres-backed system. That’s not true for a polyglot stack.
One Deployment to Update
Every database version in your stack is a version to track. When a critical security vulnerability is announced in Redis, you have to update Redis. When Elasticsearch releases a major version, you have to evaluate and execute the migration. When MongoDB changes its licensing, you have to evaluate alternatives. When Kafka’s coordinator migration is required for a new version, you have to plan the maintenance.
These are not optional. Security vulnerabilities get exploited. Outdated software accrues technical debt. Vendor support ends. The more databases you run, the more version management overhead you carry, indefinitely.
Postgres has a major version release roughly once a year, with a 5-year support window. There is one vendor (the PostgreSQL Global Development Group), one community, and an exceptionally stable upgrade path. pg_upgrade handles major version upgrades with minimal downtime. The upgrade discipline for one database is manageable.
One Transaction Boundary
The most underappreciated benefit of a single database: transactional consistency across all your data.
Consider a common operation: a user purchases a product, which should:
- Decrement the product inventory
- Create an order record
- Enqueue an order confirmation email
- Update the user’s last_purchase_at timestamp
- Record an audit event
- Award loyalty points
In a polyglot stack:
- Steps 1, 2, 4, 5 are in Postgres
- Step 3 is a Redis queue operation
- Step 6 might be in Redis or Mongo
There is no transaction that spans Postgres and Redis. You can use the Saga pattern, the outbox pattern, or careful application logic to maintain consistency, but you’ve accepted that these operations can be partially applied. A process crash after step 2 and before step 3 leaves you with an order that never sends a confirmation email. A Redis timeout leaves you with an order that has no confirmation job.
With Postgres handling all of these (jobs in a jobs table, events in an audit table), steps 1-6 are all in a single transaction. Either all happen or none do. No partial state. No compensating transactions. No Saga coordinator. No idempotency keys. This eliminates entire categories of bugs that distributed consistency requires application code to handle.
The Staffing Argument
The argument for operational simplicity becomes most tangible when you think about staffing.
A startup with five databases needs:
- A backend engineer who understands Postgres
- Someone who understands Redis well enough to tune it when it misbehaves
- Someone who understands Elasticsearch well enough to rebalance shards
- Someone who can handle Kafka consumer lag incidents
In a small team, these roles overlap. The backend engineer is also the Redis person is also the Elasticsearch person. That person is necessarily shallow on all of them. Or you hire specialists — which is expensive, and you have a team full of people who don’t fully understand each other’s domains.
A startup with one database needs people who understand Postgres. That’s a smaller requirement, easier to hire for, and produces a team that can cover for each other.
As you scale, the calculus changes — large organizations can afford deep specialization. But the “move fast” phase of a startup is exactly when you should resist the urge to build a complex data infrastructure. The complexity you introduce early becomes the technical debt you manage forever.
The Knowledge Half-Life Problem
Technology knowledge has a half-life. The Redis you learned in 2018 is not quite the same as the Redis you need to operate in 2024. The Elasticsearch you understood in 2020 has changed substantially. MongoDB Atlas has different operational characteristics than self-hosted MongoDB.
As the databases in your stack change, the knowledge of how to operate them depreciates. Your team’s operational expertise becomes stale, and the gaps become risks.
Postgres’s architecture is remarkably stable. The concepts you learn today — MVCC, WAL, VACUUM, the query planner, the extension system — will be relevant in ten years. The investment in Postgres expertise has a longer half-life than most. That makes it more valuable per hour of learning invested.
What Operational Simplicity Actually Looks Like
Here is what a team running Postgres-only actually experiences differently:
Incident response: An alert fires. The on-call engineer opens pg_stat_activity, pg_stat_statements, and the Grafana dashboard. All the relevant data is in one place. The mental model they need is one they’ve built over time.
Onboarding: A new engineer joins. You point them to the database. They can run \dt and see everything. The schema is visible and queryable. There’s no “but the search index is over here and the job queue is over here” orientation session.
Capacity planning: You need to know if you can handle twice the traffic. You look at query times, index sizes, and cache hit rates — all in one system. You can project with reasonable confidence.
Disaster recovery: Something goes catastrophically wrong. You restore from backup. One backup, one restore procedure, one system to verify. You’ve tested it. You’re confident.
Debugging a production bug: A record is in an unexpected state. You can query the audit log, the job history, and the data — all with SQL, all in one transaction boundary, with consistent timestamps and IDs.
When the Simplicity Argument Breaks Down
Operational simplicity is compelling, but it’s not infinite. There’s a scale at which the “one database” model genuinely creates problems:
- When your database is so large that read replicas can’t keep up with reporting queries and you need a data warehouse
- When your write throughput exceeds what a single Postgres instance can handle
- When your team is large enough that multiple teams need isolation from each other’s schemas
- When you have regulatory requirements that mandate different storage for different data types
These thresholds are real, but they’re typically reached at a scale larger than most teams realize. The team that adds a data warehouse when they have 10 million rows is over-engineering. The team that adds Redis for rate limiting before profiling whether Postgres can handle it is over-engineering.
The principle is: default to operational simplicity, add complexity only when the evidence demands it, and be honest about what the evidence actually says.
The Compounding Effect
The benefits of operational simplicity compound. The engineer who avoids three database-specific rabbit holes in their first year spends that time building product. The team that doesn’t debug a Kafka consumer lag incident doesn’t get paged at 2am. The startup that ships with one database has a simpler stack to migrate when requirements change.
These are not dramatic wins in any individual moment. They’re small, consistent advantages that accumulate into a significant competitive edge over time. Operational simplicity is not glamorous, but it is one of the most durable engineering advantages a team can have.
Postgres is enough. The operability dividend is a large part of why.
Where Postgres Isn’t Enough
This book has argued, chapter by chapter, that Postgres can do more than most engineers realize — and that the operational cost of reaching for specialized databases is higher than it looks. That argument is true, and it’s important.
But the argument for Postgres has limits. There are workloads and requirements where Postgres genuinely cannot compete, and where adding a specialized tool is the right call. Knowing where that line is — and being honest about it — is what separates principled engineering from dogmatism.
This chapter is about those limits. It will not hedge or equivocate. If you’re in one of these situations, the right answer is not Postgres.
Global Geographic Distribution
The situation: Your users are on five continents and you need reads and writes to complete with minimal latency for all of them. Latency from Tokyo to a Postgres primary in Virginia is 150ms+ at the speed of light — before any query processing. That’s 300ms round trip for a write, minimum.
Why Postgres can’t solve this: Postgres’s primary-standby model means writes must go to the primary. You can have read replicas in every region, but writes still go to one place. For geographically distributed write workloads, you need multi-primary or multi-region write capabilities.
What to reach for: Google Spanner (or CockroachDB for open-source) provide globally-distributed SQL with multi-region writes and serializable transactions. YugabyteDB is another option.
The honest threshold: Most applications have users globally but write patterns that tolerate the round-trip to a single primary. If your p99 write latency of 300ms for Japanese users writing to a US primary is acceptable (it often is), Postgres with regional read replicas is fine. This limit applies when you need sub-50ms write latency globally, which is a hard requirement.
Truly Extreme Write Throughput
The situation: You’re ingesting millions of events per second — IoT sensors, financial tick data, high-frequency telemetry. Sustained write throughput of 100,000+ rows per second per second for sustained periods.
Why Postgres can’t solve this: Postgres is a disk-based, WAL-journaled database with MVCC overhead. Each write must go through WAL sync, tuple versioning, and index maintenance. On excellent hardware (NVMe RAID), you might sustain 50,000-100,000 simple INSERTs per second with synchronous_commit = off. This isn’t slow — but it’s a ceiling.
What to reach for: Apache Cassandra or ScyllaDB for high-ingest append workloads (no transactions, eventual consistency). ClickHouse for high-ingest analytical workloads (columnar, batch-optimized). InfluxDB (or TimescaleDB on modern hardware) for time-series ingestion. Apache Kafka as a durability layer before writes reach any database.
The honest threshold: TimescaleDB + aggressive batching (COPY instead of INSERT) can handle millions of data points per minute. This covers most IoT and telemetry workloads. The hard limit is when you need millions of transactional rows per second with ACID guarantees — that’s genuinely beyond what Postgres delivers.
Complex Graph Traversal
The situation: Your data is fundamentally a graph — social networks, recommendation engines, fraud detection networks, knowledge graphs. You need to traverse relationships of arbitrary depth efficiently. “Find all users within 4 hops of user X” or “find the shortest path between nodes A and B.”
Why Postgres can’t solve this: Postgres can represent graphs (a nodes table, an edges table). Recursive CTEs let you traverse them. But recursive CTEs in Postgres are not optimized for graph traversal — they do level-by-level BFS/DFS and have no specialized graph index structures. For shallow traversals (2-3 hops) on reasonably sized graphs, Postgres works. For deep traversals on large graphs, query time grows exponentially.
What to reach for: Neo4j (or ArangoDB, TigerGraph) if graph traversal is a core product capability. The Cypher query language is genuinely expressive for graph queries in a way SQL recursive CTEs are not.
The honest threshold: If you need “find friends of friends” (depth 2) on a million-user social network, Postgres with proper indexes works fine. If you need “find all nodes reachable within 6 hops” on a billion-edge knowledge graph with sub-second latency, you need a dedicated graph database. Most applications don’t have the latter requirement.
Real-Time Streaming and Event Processing
The situation: You need to consume a stream of events, perform real-time aggregations and joins across the stream, and emit derived events — all with sub-second latency at high throughput.
Why Postgres can’t solve this: Postgres is a request-response database, not a stream processor. The LISTEN/NOTIFY mechanism provides basic pub/sub, but it’s not durable (messages are lost if no listener is connected), has no message ordering guarantees, and doesn’t support stream processing patterns (windowed aggregations, stream joins).
What to reach for: Apache Kafka for durable, high-throughput event streaming. Apache Flink or Kafka Streams for real-time stream processing. RiverDB for event sourcing patterns that do fit within Postgres’s capabilities.
The honest threshold: If your use case is “notify workers that there’s new data” or “process events asynchronously,” Postgres’s job queue (Chapter 7) and LISTEN/NOTIFY are sufficient. If you need “join two high-throughput streams in real time and emit aggregated results within 1 second,” that’s a stream processor, not a database.
Search at Scale With Complex Relevance Requirements
The situation: Search is your core product feature. You have hundreds of millions of documents, complex BM25 relevance tuning requirements, learning-to-rank with ML models, faceted navigation across many dimensions, and a team dedicated to improving search quality.
Why Postgres can’t solve this: Postgres full-text search (Chapter 6) handles most search use cases well, but it’s not designed for search-as-a-product at scale. The built-in ranking functions are less sophisticated than Elasticsearch’s BM25 implementation. Faceted aggregations across hundreds of millions of documents under load require Elasticsearch’s aggregation framework. Learning-to-rank is not a first-class concept.
What to reach for: Elasticsearch (or OpenSearch). Typesense or Meilisearch for simpler requirements with better operational simplicity.
The honest threshold: This limit applies to teams where search quality is a core competitive differentiator and a team is dedicated to optimizing it. For “good enough” search on a modest content corpus (millions of documents, not hundreds of millions), Postgres full-text search with websearch_to_tsquery and GIN indexes is excellent.
Vector Search at Hundreds of Millions of Vectors
The situation: You have 500 million+ embeddings and need sub-10ms ANN search at thousands of queries per second with high recall.
Why Postgres can’t solve this: pgvector (Chapter 9) handles tens to low hundreds of millions of vectors well on capable hardware. At hundreds of millions of vectors, HNSW indexes become very large (potentially hundreds of GB), index build time is significant, and query throughput at sub-10ms latency becomes difficult to guarantee on a single server.
What to reach for: Pinecone, Weaviate, Qdrant, or Milvus for dedicated high-scale vector search. These systems are purpose-built for horizontal scaling of vector indexes.
The honest threshold: For most RAG pipelines and semantic search features (tens of millions of document chunks), pgvector is enough. The teams that need dedicated vector databases are building search products where vector retrieval is the primary workload, not a supporting feature.
Multi-Tenancy at Extreme Scale With Strong Isolation
The situation: You’re building a SaaS product with thousands of tenants, each with millions of rows, and tenants require strict resource isolation — one tenant’s heavy query load should not affect another tenant’s latency.
Why Postgres can’t solve this: Row-level security in a shared table (Chapter 16) provides data isolation but not resource isolation. If tenant A runs a massive analytical query, it competes for the same buffer cache and query threads as tenant B’s transactional queries. Schema-per-tenant (separate schemas in one Postgres instance) provides better isolation but still shares resources. Database-per-tenant provides full isolation but operational overhead scales with tenant count.
What to reach for: For full resource isolation with large numbers of tenants: Citus (Postgres extension) for horizontal sharding, or dedicated Postgres instances per tenant (feasible with managed services like Supabase or Neon’s serverless branching).
The honest threshold: For SaaS products with hundreds of tenants, shared-table RLS or schema-per-tenant in a single Postgres instance typically works fine. Resource isolation becomes a real issue when you have thousands of tenants with highly variable and potentially adversarial query loads.
Analytics at Petabyte Scale
The situation: Your data warehouse has terabytes to petabytes of data. You need to scan it for analytics, run arbitrary aggregations, and support dozens of analysts simultaneously.
Why Postgres can’t solve this: Postgres is a row-oriented database — data is stored in pages with all columns of a row together. Analytical queries that scan one or two columns from a billion-row table must still read all the row data. Columnar databases store data column-by-column, making analytical scans 10-100x more efficient.
What to reach for: ClickHouse (open-source, self-hosted) or BigQuery/Redshift/Snowflake (managed) for petabyte-scale analytical workloads. These are columnar databases optimized specifically for analytics.
The honest threshold: For reporting and analytics on operational data up to a few hundred million rows, Postgres with read replicas, partitioning, and materialized views is excellent. The pivot to a columnar warehouse makes sense when your analytics team needs to scan the entire fact table regularly, your data is in the tens of billions of rows, and query times become unacceptably long.
Recognizing the Pattern
Notice what these limits have in common: they’re all about scale or streaming, not capability. Postgres can do graphs, search, analytics, vector search, and time series — but at a scale threshold that most teams don’t reach in the first several years of operation.
The failure mode isn’t “Postgres can’t do X.” The failure mode is “we added Y database because we read that Postgres couldn’t do X at 100x our actual scale.” The requirement doesn’t exist yet; the complexity does.
Before adding a specialized database, make the case based on evidence:
- What is the actual current scale of this workload?
- Have you measured Postgres’s performance at this scale?
- Is the bottleneck actually Postgres, or is it something else (network, application, query design)?
- What is the operational cost of the specialized database, and is the performance gain worth it?
If you’ve done this analysis and the evidence is clear, add the specialized database. The teams that add databases preemptively, based on theoretical scale concerns, accumulate operational debt without evidence that the complexity is warranted.
Postgres is enough — until it isn’t. Know the difference, and act on evidence rather than assumptions.
The Stack — Putting It All Together
Twenty chapters later, here’s what we know: Postgres is a relational database that can also store documents, search text, queue jobs, serve as a key-value store, index vectors, handle time-series data, and do most of what five specialized databases would otherwise do. It’s ACID-compliant, well-understood, battle-tested, and backed by thirty years of engineering. And it doesn’t require you to run six separate systems to do it.
This final chapter ties everything together into a reference architecture — the stack that a well-run, Postgres-centric team actually runs.
The Core Principle
Design your stack around your actual requirements, not theoretical scale. Start simple. Add components only when you have evidence that the simpler approach is insufficient. Every component you add is a component you operate, monitor, debug, and eventually upgrade forever.
The reference architecture in this chapter is for a typical production web application or API — something serving thousands to millions of users. The vast majority of products fall into this category. Scale up from here when evidence demands it.
The Reference Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ (your web app / API servers) │
└──────────────────────────┬──────────────────────────────────┘
│ connection pool
▼
┌─────────────────────────────────────────────────────────────┐
│ PgBouncer │
│ (transaction mode pooler) │
└──────────────────────────┬──────────────────────────────────┘
│ 20–50 Postgres connections
┌──────────────┴──────────────┐
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Postgres Primary │◄────────│ Postgres Standby │
│ (read/write) │ stream │ (read-only) │
└────────────┬──────┘ repl. └───────────────────┘
│
│ WAL archiving
▼
┌───────────────────┐
│ S3 / GCS │
│ (WAL archive + │
│ base backups) │
└───────────────────┘
That’s the core. One primary, one standby, PgBouncer in front, WAL-archived to object storage.
What Postgres Handles
In this architecture, Postgres handles everything by default:
Core relational data: Users, orders, products, accounts — your domain model, normalized, with proper foreign keys and constraints.
Document storage: Product attributes, configuration, event payloads — JSONB columns alongside relational columns. GIN indexes for queries.
Full-text search: Article content, product descriptions, support tickets — tsvector columns with GIN indexes. websearch_to_tsquery for user-facing search.
Background jobs: Order processing, email sending, report generation — a jobs table with SKIP LOCKED, or River/pg-boss. Transactional enqueue: jobs created atomically with the business data they operate on.
Sessions and ephemeral data: User sessions, rate limiting counters, temporary tokens — unlogged tables for performance with acceptable durability trade-offs.
Caching: Computed summaries, denormalized read models — materialized views or application-managed cache tables. Refreshed on schedule via pg_cron.
Feature flags: Configuration, A/B test assignments — a feature_flags table, read at application startup or cached with a short TTL.
Vector search: Embeddings for semantic search, RAG pipelines, recommendation systems — pgvector with HNSW indexes. Hybrid search combining vectors with full-text and relational filters.
Time-series data: Application metrics, user events, audit logs — partitioned tables or TimescaleDB, with BRIN indexes and continuous aggregates.
Audit trails: Every data change recorded with before/after values — trigger-based audit tables, partitioned by month.
Scheduled tasks: Partition maintenance, expired session cleanup, materialized view refresh — pg_cron, all visible and manageable from SQL.
The Configuration Baseline
A Postgres configuration for a 32GB production server:
# Memory
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
maintenance_work_mem = 2GB
# Checkpoints
checkpoint_completion_target = 0.9
max_wal_size = 4GB
checkpoint_timeout = 15min
# WAL
wal_level = replica
wal_compression = on
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
# Connections
max_connections = 200 # PgBouncer handles fan-out
# Autovacuum
autovacuum_max_workers = 6
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 2000
# Logging
log_min_duration_statement = 1000
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
# Stats
shared_preload_libraries = 'pg_stat_statements, pg_cron'
pg_stat_statements.track = all
Tune from this baseline based on your workload profile and actual performance data.
The Extension Set
The extensions to install on day one:
-- Observability (MUST HAVE)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Fuzzy search and LIKE acceleration
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Partition management
CREATE EXTENSION IF NOT EXISTS pg_partman;
-- Scheduled tasks
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Cryptography
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- Vector search (when you need it)
CREATE EXTENSION IF NOT EXISTS vector;
-- Geospatial (when you need it)
CREATE EXTENSION IF NOT EXISTS postgis;
-- TimescaleDB (when time-series volume warrants it)
CREATE EXTENSION IF NOT EXISTS timescaledb;
These extensions are mature, widely-deployed, and add significant capability with minimal operational complexity.
The Schema Foundation
Every application table should have:
-- The template:
CREATE TABLE <table_name> (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
-- ... business columns ...
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Auto-update updated_at:
CREATE TRIGGER trigger_<table_name>_updated_at
BEFORE UPDATE ON <table_name>
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
-- The trigger function (once per schema):
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = now();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
For multi-tenant applications, every table additionally has:
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
with RLS policies enforcing tenant isolation and indexes on tenant_id to keep those policies fast.
The Migration Workflow
Every schema change goes through:
- Write SQL migration with
CONCURRENTLYfor indexes,NOT VALID+VALIDATEfor constraints - Test against production-sized data on a clone
- Deploy in phases if old and new code can’t coexist (expand, then contract)
SET lock_timeout = '5s'before any DDL that takesACCESS EXCLUSIVE- Monitor during migration:
pg_stat_activity,pg_locks, slow query log
The expand/contract pattern for column renames and type changes. Never drop a column in the same deployment cycle as the one that stops using it.
The Backup Setup
Daily: Full backup via pgBackRest → S3
Hourly: Differential backup via pgBackRest → S3
Continuous: WAL archiving via pgBackRest → S3
Retention:
Full backups: 4 weeks
Differential: 7 days
WAL: 7 days (enables PITR within that window)
Testing:
Monthly: Full restore test to a staging environment
After config changes: Verify archiving still works
The goal: restore any database state from the past 7 days in under 2 hours (RTO), with data loss no greater than the last WAL segment (RPO measured in seconds).
The Observability Stack
pg_stat_statements → query performance dashboard
postgres_exporter → Prometheus → Grafana
pgBadger → daily slow query report (from logs)
pg_stat_activity → real-time connection monitoring
Alerts on:
- Cache hit ratio < 95%
- Replication lag > 60 seconds
- Dead tuple percentage > 20%
idle in transactionconnections > 5 for > 30 seconds- Connections > 150 (80% of max_connections=200)
- Checkpoint_req > 20% of total checkpoints (WAL pressure)
The Security Baseline
-- Application credentials (not superuser):
CREATE ROLE myapp WITH LOGIN PASSWORD '...';
GRANT readwrite TO myapp;
-- Separate migration credentials:
CREATE ROLE migrations WITH LOGIN PASSWORD '...';
GRANT readwrite TO migrations;
GRANT CREATE ON SCHEMA public TO migrations;
-- Read replica credentials:
CREATE ROLE myapp_reader WITH LOGIN PASSWORD '...';
GRANT readonly TO myapp_reader;
-- RLS on multi-tenant tables:
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.tenant_id')::bigint);
-- Force SSL:
-- pg_hba.conf: hostssl all all 0.0.0.0/0 scram-sha-256
-- hostnossl all all 0.0.0.0/0 reject
The High-Availability Topology
For production workloads:
Primary (read/write)
↓ streaming replication (async)
Standby (read-only, hot standby)
↓ HAProxy or DNS
Application (routes writes to primary, reads to standby or primary)
For critical workloads where data loss is unacceptable:
- Synchronous replication to one standby
- The other standbys async (for read scaling without commit latency impact)
For managed environments: use the cloud provider’s multi-AZ HA. AWS RDS Multi-AZ, Cloud SQL HA, Aurora — these handle failover automatically.
When to Add Something
The reference architecture above has no Redis, no Elasticsearch, no Kafka. When does that change?
Add a caching layer (Redis or Memcached) when:
- You’ve profiled and found that Postgres is the actual bottleneck for read-heavy workloads
- Your application has data that is fetched very frequently but changes rarely, and the cost of cache invalidation logic is less than the cost of the database load
- You need sub-millisecond latency at a scale that Postgres unlogged tables can’t provide
Add Elasticsearch when:
- You have hundreds of millions of documents and search relevance is a core product feature that requires sophisticated tuning
- Your analytics team needs faceted navigation with complex aggregations at a scale that Postgres can’t support
Add Kafka when:
- You need durable, ordered event streaming between multiple independent systems
- Your outbox table pattern is generating replication lag that’s affecting your primary’s performance
- You’re integrating with external systems that speak Kafka
Add a data warehouse (ClickHouse, BigQuery, Snowflake) when:
- Your analytics queries are scanning billions of rows and impacting primary/replica performance
- Your data team needs complex ad-hoc analytical queries over your full dataset
Notice the pattern: these additions are made in response to evidence, not in anticipation of theoretical scale. The starting point is Postgres. The additions happen when you’ve hit the actual limit.
A Final Word
The engineers who run sophisticated systems on a single Postgres instance are not lazy. They’re not unaware of the alternatives. They’ve made a deliberate choice: to invest deeply in one tool and use it well, rather than spread thin knowledge across many.
Postgres rewards that investment. The more you know about it — MVCC, the query planner, VACUUM, WAL, extensions — the more valuable it becomes. The knowledge compounds. The expertise transfers to every team and system you work with. The operational habits you build (proper indexing, safe migrations, autovacuum monitoring, backup testing) make you a better engineer on any system.
The thesis of this book is simple: most teams reach for specialized databases before they need to, and pay a complexity tax they don’t have to pay. Postgres, properly understood and configured, handles a remarkable range of workloads — relational, document, search, queuing, key-value, vector, time-series — with a level of reliability and operational simplicity that no polyglot stack can match.
Use it well. Use all of it. And when you eventually do hit its limits — and you might, if you build something large enough — you’ll know exactly where the limits are, and you’ll have the expertise to make the right call about what to add next.
Postgres is enough. Now go build something.
Quick Reference: The Postgres-First Decision Tree
Is this about storing and querying data?
├── Yes
│ ├── Does it fit on one machine at current scale? → Postgres
│ ├── Is it time-series with high ingest? → TimescaleDB (it's still Postgres)
│ ├── Is it geospatial? → PostGIS (it's still Postgres)
│ ├── Is it vectors? → pgvector (it's still Postgres)
│ └── Is it truly global with write latency requirements? → Evaluate Spanner/CockroachDB
│
├── Do you need a job queue?
│ └── Is it < 100k jobs/sec? → River or pg-boss (Postgres)
│
├── Do you need full-text search?
│ └── Is it < 100M docs, not a search product? → Postgres FTS
│
├── Do you need caching?
│ └── Have you measured and proven Postgres is too slow? → Maybe Redis
│
└── Do you need pub/sub?
├── For worker notification? → LISTEN/NOTIFY (Postgres)
└── For distributed event streaming? → Kafka
The answer to “do you need X?” is almost always “try Postgres first.”