Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 12: Buffer Pools and Caching

“The fastest disk read is the one that never happens.”

Memory is fast; disk is slow. The buffer pool is the database’s primary defense against I/O latency—a sophisticated cache that keeps frequently-accessed pages in memory. Understanding how it works is essential for tuning database performance.


12.1 Why Buffer Pools Exist

                    THE MEMORY-DISK GAP

    RAM Access:    ~100 nanoseconds
    SSD Access:    ~100,000 nanoseconds (100 microseconds)
    HDD Access:    ~10,000,000 nanoseconds (10 milliseconds)

    RAM is 1,000x faster than SSD
    RAM is 100,000x faster than HDD

    If we can serve requests from memory, huge win!

The buffer pool bridges this gap by:

  1. Caching frequently-used pages in memory
  2. Batching writes to reduce I/O
  3. Managing memory pressure intelligently

12.2 Buffer Pool Architecture

                    BUFFER POOL STRUCTURE

    ┌─────────────────────────────────────────────────────────────┐
    │                     BUFFER POOL                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │                   PAGE TABLE                             ││
    │  │  PageID → Buffer Frame mapping                          ││
    │  │  ┌────────────┬─────────────┐                           ││
    │  │  │ (file, page) │ frame #    │                           ││
    │  │  ├────────────┼─────────────┤                           ││
    │  │  │ (1, 42)    │    17       │                           ││
    │  │  │ (1, 100)   │    3        │                           ││
    │  │  │ (2, 5)     │    42       │                           ││
    │  │  └────────────┴─────────────┘                           ││
    │  └─────────────────────────────────────────────────────────┘│
    │                                                              │
    │  ┌─────────────────────────────────────────────────────────┐│
    │  │                  BUFFER FRAMES                           ││
    │  │  ┌───────┬───────┬───────┬───────┬───────┬───────┐      ││
    │  │  │ Frame │ Frame │ Frame │ Frame │ Frame │ Frame │ ...  ││
    │  │  │   0   │   1   │   2   │   3   │   4   │   5   │      ││
    │  │  │ 8KB   │ 8KB   │ 8KB   │ 8KB   │ 8KB   │ 8KB   │      ││
    │  │  │       │       │       │(1,100)│       │       │      ││
    │  │  └───────┴───────┴───────┴───────┴───────┴───────┘      ││
    │  │                                                          ││
    │  │  Each frame holds one page and metadata:                 ││
    │  │  - Page content (8KB)                                    ││
    │  │  - Dirty flag (modified since read?)                     ││
    │  │  - Pin count (currently in use?)                         ││
    │  │  - Reference count (for eviction decisions)              ││
    │  └─────────────────────────────────────────────────────────┘│
    └─────────────────────────────────────────────────────────────┘

12.3 Page Table and Hash Index

The page table maps (file_id, page_number) to buffer frames:

    Request: Read page 42 of file 1

    1. Compute hash: hash(1, 42) = bucket 17
    2. Search bucket 17 for (1, 42)
    3. If found: Return frame pointer (HIT!)
    4. If not found: Load from disk (MISS)

Hash Table Considerations

  • Must handle concurrent access (latches)
  • Bucket chains for collisions
  • Resize when too full

12.4 The Page Request Flow

                    PAGE REQUEST FLOW

    Request page (1, 42)
           │
           ▼
    ┌─────────────────┐
    │ Check page table│
    └────────┬────────┘
             │
     ┌───────┴───────┐
     │               │
    HIT            MISS
     │               │
     ▼               ▼
    Pin page    ┌─────────────────┐
    in frame    │ Find free frame │
     │          │ (or evict one)  │
     │          └────────┬────────┘
     │                   │
     │                   ▼
     │          ┌─────────────────┐
     │          │ Read page from  │
     │          │ disk into frame │
     │          └────────┬────────┘
     │                   │
     │                   ▼
     │          ┌─────────────────┐
     │          │ Add to page     │
     │          │ table           │
     │          └────────┬────────┘
     │                   │
     └──────────┬────────┘
                │
                ▼
         Return frame pointer

12.5 Page Replacement Policies

When the buffer pool is full and we need to load a new page, which page do we evict?

LRU (Least Recently Used)

Evict the page that hasn’t been accessed for the longest time.

    Access sequence: A, B, C, D, A, B, E (pool size = 4)

    A: [A, _, _, _]
    B: [A, B, _, _]
    C: [A, B, C, _]
    D: [A, B, C, D]
    A: [B, C, D, A] (A moves to front)
    B: [C, D, A, B] (B moves to front)
    E: [D, A, B, E] (C evicted - least recently used)

Problem: Sequential scan evicts everything!

    Sequential scan of 1M pages:
    Each page accessed once, never again
    But they fill the entire buffer pool
    Evicting actually useful pages!

LRU-K

Track last K accesses, not just most recent. Evict page with oldest K-th access.

Clock (Second Chance)

Approximate LRU with less overhead:

                    CLOCK ALGORITHM

    Frames arranged in circle with a clock hand

         ┌─────┐
         │  A  │ ref=1
         │     │
    ┌────┴─────┴────┐
    │ D │         │ B │
    │ref=0        │ref=1
    │     ▲       │
    └─────│───────┘
          │  clock hand
         ┌┴────┐
         │  C  │ ref=0
         └─────┘

    To find victim:
    1. Check current frame
    2. If ref=0: Evict it
    3. If ref=1: Set ref=0, advance hand, repeat

    Accessing a page sets its ref=1

2Q (Two Queues)

Protect hot pages from sequential scans:

    Queue 1 (A1): Recently added pages (FIFO)
    Queue 2 (Am): Frequently accessed pages (LRU)

    New page → A1
    Accessed again while in A1 → Promote to Am
    Evicted from A1 without reaccess → Gone

    Sequential scan pages never make it to Am!

12.6 Dirty Page Management

A dirty page has been modified but not yet written to disk.

    Page state transitions:

    CLEAN ───modify───► DIRTY
      ▲                   │
      │                   │ write to disk
      └───────────────────┘

Why Not Write Immediately?

  1. Batching: Accumulate multiple changes, write once
  2. Write ordering: WAL must be written first
  3. Reduced I/O: If same page modified multiple times, only write final state

Background Writer

A background process periodically writes dirty pages:

    Background Writer Loop:
    1. Sleep for configured interval
    2. Find dirty pages not recently accessed
    3. Write some to disk
    4. Mark as clean
    5. Repeat

    Benefits:
    - Spreads I/O over time (no spike at checkpoint)
    - Keeps some clean frames available for eviction
-- PostgreSQL background writer configuration
bgwriter_delay = 200ms          -- Sleep between rounds
bgwriter_lru_maxpages = 100     -- Max pages per round
bgwriter_lru_multiplier = 2.0   -- Aggressiveness

12.7 Buffer Pool Sizing

How much memory should the buffer pool get?

General Guidelines

    Dedicated database server:
    Buffer pool = 70-80% of total RAM

    Shared server:
    Buffer pool = 25-40% of total RAM

    Leave room for:
    - OS file cache
    - Connection memory
    - Sort/hash operations
    - Other processes

PostgreSQL

-- Shared buffers (main buffer pool)
shared_buffers = 8GB

-- Typical sizing:
-- <1GB RAM: 25% of RAM
-- >1GB RAM: 25% of RAM, up to ~40%

-- Check current usage
SELECT
    pg_size_pretty(count(*) * 8192) as buffer_usage
FROM pg_buffercache;

MySQL InnoDB

-- InnoDB buffer pool
innodb_buffer_pool_size = 12G

-- Typical sizing:
-- 70-80% of available RAM on dedicated server

-- Multiple instances for parallelism
innodb_buffer_pool_instances = 8

12.8 Buffer Pool Hit Ratio

The hit ratio measures cache effectiveness:

    Hit Ratio = Buffer Hits / (Buffer Hits + Disk Reads)

    Target: > 99% for OLTP workloads
    Acceptable: > 95% for mixed workloads
    Concerning: < 90%

PostgreSQL

SELECT
    sum(blks_hit) as hits,
    sum(blks_read) as reads,
    sum(blks_hit) * 100.0 / nullif(sum(blks_hit) + sum(blks_read), 0) as ratio
FROM pg_stat_database;

MySQL InnoDB

SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';

-- Calculate hit ratio
-- Innodb_buffer_pool_read_requests /
-- (Innodb_buffer_pool_read_requests + Innodb_buffer_pool_reads)

When Hit Ratio Is Low

  1. Buffer pool too small: Increase memory
  2. Working set too large: Data doesn’t fit
  3. Sequential scans: Full table scans evict useful pages
  4. Cold start: Cache is still warming up

12.9 Multiple Buffer Pools

Large databases use multiple buffer pools for scalability:

                    MULTIPLE BUFFER POOLS

    ┌─────────────────────────────────────────────────────────────┐
    │ Connection assigns to pool based on hash of page ID        │
    │                                                              │
    │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
    │   │ Buffer Pool │  │ Buffer Pool │  │ Buffer Pool │        │
    │   │     #1      │  │     #2      │  │     #3      │        │
    │   │             │  │             │  │             │        │
    │   │  Latch #1   │  │  Latch #2   │  │  Latch #3   │        │
    │   └─────────────┘  └─────────────┘  └─────────────┘        │
    │                                                              │
    │   Reduces contention on page table latch                    │
    └─────────────────────────────────────────────────────────────┘
-- MySQL: Multiple InnoDB buffer pool instances
innodb_buffer_pool_instances = 8

-- Pages hash to instances, reducing mutex contention

12.10 Page Prefetching

Prefetching loads pages before they’re requested:

Sequential Prefetch

Detect sequential access, load next pages:

    Read pages: 1, 2, 3, 4, ...

    System detects sequential pattern
    Prefetch pages: 5, 6, 7, 8, ... in background

    When query needs page 5: Already in memory!

Index Prefetch

When traversing an index, prefetch heap pages:

    Index scan returns row pointers:
    (page 5, slot 1), (page 12, slot 3), (page 5, slot 7), (page 20, slot 2)

    Sort by page: 5, 5, 12, 20
    Prefetch pages 5, 12, 20 in sorted order

    Reduces random I/O

PostgreSQL

-- effective_io_concurrency: prefetch depth
SET effective_io_concurrency = 200;  -- Good for SSDs

-- Controls bitmap heap scan prefetching

12.11 Buffer Pool and Large Operations

Bulk Loads

Loading large amounts of data shouldn’t evict the entire working set:

-- PostgreSQL: Use COPY, which is optimized
COPY users FROM '/data/users.csv' WITH CSV;

-- Or use unlogged tables for intermediate data
CREATE UNLOGGED TABLE staging (...);

Large Sequential Scans

PostgreSQL uses a small ring buffer for large sequential scans:

    Normal query: Uses main buffer pool
    Large sequential scan: Uses 256KB ring buffer

    Prevents one big scan from evicting everything

Sort Operations

When sorting exceeds work_mem, spill to disk:

-- PostgreSQL work_mem
SET work_mem = '256MB';  -- Per-operation sort memory

-- MySQL sort buffer
SET sort_buffer_size = 256 * 1024 * 1024;

12.12 Observing Buffer Pool Behavior

PostgreSQL pg_buffercache

-- Enable extension
CREATE EXTENSION pg_buffercache;

-- See what's in the buffer pool
SELECT
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database()))
  AND c.relname NOT LIKE 'pg_%'
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 20;

MySQL InnoDB Buffer Pool Stats

-- Buffer pool status
SHOW ENGINE INNODB STATUS\G

-- Look for "BUFFER POOL AND MEMORY" section:
-- Buffer pool size
-- Free buffers
-- Database pages
-- Modified db pages
-- Pages read, created, written

12.13 Double Buffering Problem

When using buffered I/O, data may be cached twice:

    Database Buffer Pool: Page 42
    OS Page Cache: Page 42 (same data!)

    Memory wasted!

Solutions

Direct I/O: Bypass OS cache entirely

-- MySQL InnoDB (default)
innodb_flush_method = O_DIRECT

Coordinated caching: Tell OS not to cache

posix_fadvise(fd, offset, length, POSIX_FADV_DONTNEED);

Accept double buffering: For small databases, not a big deal


12.14 Summary

The buffer pool is crucial for database performance:

  • Purpose: Cache pages in memory to avoid disk I/O
  • Page table: Maps page IDs to buffer frames
  • Replacement policies: LRU, Clock, 2Q balance hit rate and overhead
  • Dirty pages: Modified pages written by background writer
  • Sizing: Typically 25-80% of RAM depending on workload
  • Hit ratio: Target >99% for OLTP
  • Prefetching: Load pages before they’re needed

Understanding the buffer pool helps you:

  • Size memory appropriately
  • Diagnose cache-related performance issues
  • Understand why queries are slow after restart
  • Tune for your specific workload

What’s Next

In Chapter 13, we’ll explore recovery and crash safety—how databases ensure durability and recover from failures.


“A well-tuned buffer pool makes your database feel like everything is in memory. A poorly-tuned one makes everything feel like tape.”