Performance Tuning

A default Postgres installation is configured conservatively — appropriate for a development machine or a small VM, but leaving significant performance on the table for a production server. This chapter covers the configuration parameters that matter, the autovacuum behavior you need to understand, connection pooling as a force multiplier, and the query optimization habits that prevent most performance problems.

Postgres performance tuning is iterative. There is no single configuration change that makes everything fast. The process is: measure, identify bottlenecks, tune, measure again.

Memory Configuration

`shared_buffers`

The most important memory setting. Controls the size of Postgres’s shared buffer cache — the pool of memory that all backends share for caching data pages.

Default: 128MB (egregiously low for production)

Recommendation: ~25% of available RAM

shared_buffers = 8GB  # on a 32GB server

Setting shared_buffers higher than 25% of RAM has diminishing returns because the OS page cache also caches frequently-read data. Going above 40% of RAM can actually hurt performance by reducing the OS page cache.

`effective_cache_size`

Not a memory allocation — it’s a hint to the query planner about how much total memory (shared_buffers + OS cache) is available for caching. The planner uses this to decide between index scans and sequential scans.

Recommendation: 50–75% of total RAM

effective_cache_size = 24GB  # on a 32GB server

If this is set too low, the planner incorrectly thinks the disk is slow and prefers sequential scans when it should use indexes.

`work_mem`

Memory available per sort or hash operation. Each sort, hash join, and hash aggregate can use up to work_mem. A query with multiple operations can use work_mem multiple times. With many concurrent connections, total memory usage can be work_mem × connections × operations_per_query.

Default: 4MB (too low for complex queries)

Recommendation: Balance between query complexity and connection count. A formula:

work_mem = (RAM - shared_buffers) / (max_connections × 2)

For a 32GB server with 8GB shared_buffers and 100 max_connections:

work_mem = (32 - 8) / (100 × 2) ≈ 120MB

Setting work_mem too high with many connections causes OOM. Set conservatively globally and increase per-session for complex analytical queries:

SET work_mem = '256MB';
SELECT ... ORDER BY ... LIMIT ...;  -- benefits from higher work_mem
RESET work_mem;

Watch for “external sort” operations in EXPLAIN output — they indicate the sort spilled to disk because work_mem was too low.

`maintenance_work_mem`

Memory for maintenance operations: VACUUM, CREATE INDEX, ALTER TABLE, etc.

Recommendation: 1GB or more for systems with large tables. This speeds up index creation dramatically.

maintenance_work_mem = 2GB

`max_wal_size`

Controls when checkpoint processing forces a WAL flush. Too small causes frequent checkpoints, generating I/O spikes. Too large increases recovery time after a crash.

Recommendation: 2–4GB for most systems, higher for write-heavy workloads.

max_wal_size = 4GB

Watch pg_stat_bgwriter.checkpoint_req — if this is high, checkpoints are being triggered by WAL size rather than time. Increase max_wal_size.

Checkpoint Configuration

Checkpoints flush dirty pages from the buffer cache to disk. A checkpoint that runs too fast causes I/O spikes (all the dirty pages written in a burst). A checkpoint that runs too slow causes long recovery times after a crash.

checkpoint_completion_target = 0.9  # Spread I/O over 90% of checkpoint interval
checkpoint_timeout = 15min           # Maximum time between checkpoints

checkpoint_completion_target = 0.9 (the default in recent Postgres versions) is good. It tells Postgres to spread checkpoint writes over 90% of the interval between checkpoints, smoothing I/O.

WAL Settings

wal_level = replica         # Minimum for streaming replication
wal_compression = on        # Compress WAL records (reduces WAL volume)
wal_buffers = 16MB          # WAL write buffer (usually auto-configured from shared_buffers)
synchronous_commit = on     # Don't change this unless you understand the trade-off

synchronous_commit = off gives a performance boost (no fsync on commit) at the cost of potentially losing the last few seconds of committed transactions on crash. This is acceptable for non-critical data (analytics events, logs, rate limit counters). Never set it off globally for transactional data.

Autovacuum: The Most Misunderstood Setting

Autovacuum is the background process that reclaims dead tuple space, updates table statistics, and prevents XID wraparound. It is not optional. Disabling autovacuum is not a performance optimization — it’s setting up a future disaster.

The most important autovacuum settings:

`autovacuum_vacuum_scale_factor` and `autovacuum_vacuum_threshold`

These control when autovacuum triggers on a table. The formula:

threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples

Defaults:

autovacuum_vacuum_threshold = 50 (absolute minimum dead tuples)
autovacuum_vacuum_scale_factor = 0.2 (20% of table)

For a table with 10 million rows, autovacuum triggers when there are 2,000,050 dead tuples. For a frequently-updated table, this means autovacuum runs rarely, and dead tuples accumulate heavily before cleanup.

Recommendation for large tables: Lower the scale factor significantly, or set per-table thresholds:

-- Per-table: vacuum after 1% dead tuples (instead of 20%)
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_vacuum_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_analyze_threshold = 500
);

`autovacuum_max_workers`

Default: 3. The number of autovacuum workers that can run simultaneously. For a system with many large, frequently-updated tables, the default is often insufficient.

autovacuum_max_workers = 6

More workers means more CPU and I/O, but tables stay cleaner and queries stay fast.

`autovacuum_cost_delay` and `autovacuum_vacuum_cost_limit`

Autovacuum throttles itself using a cost-based mechanism to avoid overwhelming I/O. Each page read, dirty page write, and dirty page hit has a cost. When the accumulated cost reaches autovacuum_vacuum_cost_limit, autovacuum sleeps for autovacuum_cost_delay milliseconds.

Defaults:

autovacuum_cost_delay = 2ms (recent versions)
autovacuum_vacuum_cost_limit = 200

For SSDs, autovacuum can be much more aggressive. The default throttling is designed for spinning disks:

# For NVMe SSDs:
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 2000  # 10x more aggressive

Monitoring Autovacuum

-- See when tables were last vacuumed and their bloat
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Tables approaching autovacuum threshold
SELECT
    schemaname,
    relname,
    n_dead_tup,
    n_live_tup,
    autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup AS vacuum_threshold
FROM pg_stat_user_tables
JOIN pg_class ON relname = pg_class.relname
LEFT JOIN (
    SELECT relid,
        (array_to_string(reloptions, ',') ~ 'autovacuum_vacuum_threshold')::boolean AS has_threshold,
        (regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_threshold=(\d+)'))[1]::bigint AS autovacuum_vacuum_threshold,
        (regexp_match(array_to_string(reloptions, ','), 'autovacuum_vacuum_scale_factor=(\d+\.?\d*)'))[1]::numeric AS autovacuum_vacuum_scale_factor
    FROM pg_class
) opts ON opts.relid = pg_class.oid
CROSS JOIN (
    SELECT current_setting('autovacuum_vacuum_threshold')::bigint AS autovacuum_vacuum_threshold,
           current_setting('autovacuum_vacuum_scale_factor')::numeric AS autovacuum_vacuum_scale_factor
) defaults;

Connection Pooling with PgBouncer

Postgres connections are expensive: each backend process uses ~5-10MB of RAM and involves OS process creation overhead. Applications that open many short-lived connections — serverless functions, high-concurrency APIs — can overwhelm Postgres’s connection capacity.

PgBouncer is a lightweight connection pooler that sits between your application and Postgres, multiplexing many application connections onto a smaller number of Postgres connections.

Pool Modes

Session mode: A server connection is assigned to a client for the duration of the client session. No query-level multiplexing. Only useful for reducing connection overhead (not connection count).

Transaction mode: A server connection is assigned for the duration of each transaction. After the transaction commits or rolls back, the connection returns to the pool. This is the most useful mode — a pool of 50 server connections can handle thousands of concurrent application connections.

Statement mode: A server connection is assigned for a single statement, then released. Most restrictive — prepared statements, SET commands, and transactions spanning multiple statements don’t work.

For most applications, transaction mode is the right choice.

PgBouncer Configuration

[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432

pool_mode = transaction

# Server connections
max_client_conn = 10000    # Allow many application connections
default_pool_size = 25     # Postgres sees max 25 connections from this app
min_pool_size = 5
reserve_pool_size = 5

# Authentication
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 0

# Logging
log_connections = 0
log_disconnections = 0

What doesn’t work in transaction mode: SET LOCAL (within a transaction is fine), session-level SET, advisory locks held across transactions, LISTEN/NOTIFY, server-side cursors outside transactions, prepared statements (unless you use server_reset_query or PgBouncer’s prepared statement tracking feature). Know these limitations before adopting.

How Many Postgres Connections?

The optimal number of Postgres server connections for throughput:

optimal_connections ≈ CPU_count * 2 + effective_spindle_count

For a 4-core server with an NVMe SSD:

optimal_connections ≈ 4 * 2 + 1 ≈ 9

This seems shockingly low but is supported by benchmarks. More connections than CPUs means context-switching overhead and lock contention outweigh concurrency benefits. For most servers, 20-50 Postgres connections is sufficient for high throughput. PgBouncer’s job is to make 1000 application clients share those efficiently.

`max_connections`

Default: 100. Set this based on what you can afford in terms of memory, not what you hope to use.

Each connection uses about 5-10MB of RAM. For a server with 32GB RAM and 8GB shared_buffers, the remaining 24GB can support roughly 2,400-4,800 connections. But you don’t want that many — use PgBouncer instead and keep max_connections low.

max_connections = 200  # With PgBouncer handling the fan-out

Query Performance: The EXPLAIN Habit

No amount of server configuration compensates for missing indexes or bad queries. The most impactful performance work is query-level.

The diagnostic workflow:

Find slow queries via pg_stat_statements:

SELECT query, calls, total_exec_time / calls AS avg_ms, rows / calls AS avg_rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY avg_ms DESC
LIMIT 20;

Explain the slow query:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.created_at > now() - interval '7 days'
  AND o.status = 'pending';

Look for:
- Sequential scans on large tables → need an index
- Nested loop with large outer relation → bad join choice, likely bad statistics
- High Buffers: read → cache miss, data not in buffer
- rows=X (actual rows=Y) with large discrepancy → bad statistics, run ANALYZE
- Sort operations with “external sort” → increase work_mem
- Filter: (Rows Removed by Filter: N) with large N → index doesn’t exist or isn’t selective enough
Fix the issue (add index, update statistics, rewrite query)
Verify improvement (run EXPLAIN ANALYZE again)

Statistics and `ANALYZE`

Run ANALYZE after large bulk loads to update table statistics before the planner has to guess:

ANALYZE orders;  -- Update statistics for one table
ANALYZE;         -- Update statistics for all tables in current database

For columns with high cardinality or very uneven distributions, increase per-column statistics:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

Parallel Query

Postgres can parallelize query execution across multiple CPU cores for sequential scans and some aggregations. This is controlled by max_parallel_workers_per_gather.

max_parallel_workers_per_gather = 4   # Up to 4 workers per parallel query
max_parallel_workers = 8              # Total parallel workers (across all queries)
max_worker_processes = 16             # Total background workers

Parallel query is automatic — the planner decides when to use it. It helps large analytical queries; it doesn’t help indexed lookups.

Partitioning for Performance

As covered in Chapter 3, time-partitioned tables with recent-data access patterns benefit enormously from partition pruning:

-- This only scans the relevant monthly partition:
SELECT * FROM events
WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01';

For queries that always filter on the partition key, partitioning is a very effective performance strategy for large tables.

Identifying Bloat

Index and table bloat degrades performance and wastes disk space. A quick check:

-- Find tables with significant dead tuple bloat
SELECT schemaname, relname, n_dead_tup,
       pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

-- pgstattuple for precise bloat measurement (requires extension)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');

For severe bloat that autovacuum can’t reclaim (e.g., after a massive delete), consider VACUUM FULL (locks the table) or pg_repack (online, no locking).

Configuration Validation Tools

pgtune (pgtune.leopard.in.ua) generates a postgresql.conf based on your hardware profile. It’s a good starting point.

pgBadger analyzes Postgres log files and produces detailed reports on slow queries, wait events, and error patterns.

check_postgres is a Nagios/Icinga monitoring plugin that checks connection counts, bloat, vacuum age, and many other indicators.

The key principle: Postgres’s default configuration is a safe minimum, not a target. Every production Postgres instance should be tuned for its workload and hardware. The changes in this chapter — higher shared_buffers, lower autovacuum scale factors, PgBouncer in transaction mode — produce immediate, measurable improvements on almost every system.

Keyboard shortcuts

Postgres Is Enough