Chapter 10: Query Parsing and Planning

“SQL is declarative: you say what you want, not how to get it. The planner figures out the how.”

When you write SQL, you express intent—“give me users older than 30 sorted by name.” The database must figure out how to actually retrieve that data. This transformation from SQL text to executable plan happens through parsing, analysis, and planning.

10.1 The Query Processing Pipeline

                    QUERY PROCESSING STAGES

    "SELECT name FROM users WHERE age > 30 ORDER BY name"
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 1. LEXER/TOKENIZER                                          │
    │    Break into tokens: SELECT, name, FROM, users, WHERE, ... │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 2. PARSER                                                    │
    │    Build Abstract Syntax Tree (AST) from tokens              │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 3. ANALYZER/BINDER                                           │
    │    Resolve names, check types, validate semantics            │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 4. REWRITER                                                  │
    │    Apply rules: views, RLS policies, query rewriting         │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 5. PLANNER/OPTIMIZER                                         │
    │    Generate and evaluate execution plans, choose best        │
    └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
    ┌─────────────────────────────────────────────────────────────┐
    │ 6. EXECUTOR                                                  │
    │    Run the chosen plan, return results                       │
    └─────────────────────────────────────────────────────────────┘

10.2 Lexical Analysis (Tokenization)

The lexer breaks SQL text into tokens:

SELECT name, age FROM users WHERE age > 30

    Tokens:
    ┌──────────┬─────────────┐
    │ Type     │ Value       │
    ├──────────┼─────────────┤
    │ KEYWORD  │ SELECT      │
    │ IDENT    │ name        │
    │ COMMA    │ ,           │
    │ IDENT    │ age         │
    │ KEYWORD  │ FROM        │
    │ IDENT    │ users       │
    │ KEYWORD  │ WHERE       │
    │ IDENT    │ age         │
    │ OPERATOR │ >           │
    │ NUMBER   │ 30          │
    └──────────┴─────────────┘

Handling Strings and Identifiers

SELECT "column-with-dash" FROM 'my_table'
WHERE name = 'O''Brien'  -- Escaped quote

The lexer handles:

Quoted identifiers ("column-with-dash")
String literals ('O''Brien')
Escape sequences
Comments (-- comment, /* block */)

10.3 Parsing: Building the AST

The parser uses grammar rules to build an Abstract Syntax Tree:

                    ABSTRACT SYNTAX TREE

                        SelectStmt
                       /    |    \
                      /     |     \
                 target   from   where
                  list    clause  clause
                   |        |        |
              ┌────┴────┐   │    BinaryOp
              │         │   │      (>)
           ColRef   ColRef │     /   \
           (name)   (age)  │  ColRef  Const
                           │  (age)   (30)
                       TableRef
                       (users)

Parser Implementation

Most databases use parser generators (Bison/Yacc) with grammar rules:

select_stmt:
    SELECT target_list FROM from_clause where_clause
    ;

target_list:
    target_entry
    | target_list ',' target_entry
    ;

where_clause:
    /* empty */
    | WHERE expr
    ;

Syntax Errors

The parser catches syntax errors:

SELCT name FROM users;
-- Error: syntax error at or near "SELCT"

SELECT FROM users;
-- Error: syntax error at or near "FROM" (missing column list)

10.4 Semantic Analysis (Binding)

The analyzer validates that the AST makes sense:

Name Resolution

SELECT name FROM users WHERE age > 30;

Does table users exist?
Does users have column name?
Does users have column age?
Is comparing age to 30 valid (types compatible)?

Type Checking

SELECT name FROM users WHERE name > 30;
-- PostgreSQL: Actually allowed (compares text to integer after cast)

SELECT name FROM users WHERE name > ARRAY[1,2,3];
-- Error: operator does not exist: text > integer[]

Ambiguity Resolution

SELECT name FROM users, orders WHERE id = order_id;
-- Which table's id? Must specify: users.id or orders.id

Output Schema

The analyzer determines the result’s schema:

SELECT name, age + 1 AS next_age FROM users;
-- Result columns:
--   name: text
--   next_age: integer (expression type)

10.5 Query Rewriting

Before optimization, databases apply rewriting rules:

View Expansion

-- View definition
CREATE VIEW active_users AS
    SELECT * FROM users WHERE status = 'active';

-- Original query
SELECT name FROM active_users WHERE age > 30;

-- After expansion
SELECT name FROM users WHERE status = 'active' AND age > 30;

Subquery Flattening

-- Original
SELECT * FROM orders WHERE customer_id IN (
    SELECT id FROM customers WHERE country = 'USA'
);

-- Rewritten as join
SELECT orders.* FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE customers.country = 'USA';

Predicate Pushdown

-- Original
SELECT * FROM (
    SELECT * FROM orders WHERE year = 2024
) sub
WHERE amount > 100;

-- Rewritten
SELECT * FROM orders WHERE year = 2024 AND amount > 100;

Row-Level Security (RLS)

-- Policy: users can only see their own data
CREATE POLICY user_isolation ON accounts
    USING (user_id = current_user_id());

-- Original query
SELECT * FROM accounts;

-- After RLS expansion
SELECT * FROM accounts WHERE user_id = current_user_id();

10.6 Logical vs Physical Plans

Logical Plan

Describes what operations are needed, abstractly:

    Logical Plan:
    ┌─────────────────────────────────────────────────────────────┐
    │ Project [name]                                               │
    │   └── Filter [age > 30]                                     │
    │         └── Scan [users]                                    │
    └─────────────────────────────────────────────────────────────┘

Physical Plan

Describes how to execute, concretely:

    Physical Plan:
    ┌─────────────────────────────────────────────────────────────┐
    │ Project [name]                                               │
    │   └── Filter [age > 30]                                     │
    │         └── Index Scan [idx_users_age]                      │
    │               range: age > 30                                │
    └─────────────────────────────────────────────────────────────┘

The same logical plan might have many possible physical plans.

10.7 Plan Operators

Scan Operators

    Sequential Scan:    Read all rows in table
    Index Scan:         Use index to find rows
    Index Only Scan:    Get data from index alone
    Bitmap Scan:        Build bitmap from index, then scan heap

Join Operators

    Nested Loop:        For each row in A, scan B
    Hash Join:          Build hash table from A, probe with B
    Merge Join:         Merge two sorted inputs

Other Operators

    Sort:               Order rows by key
    Aggregate:          Compute SUM, COUNT, etc.
    Limit:              Return only N rows
    Unique:             Remove duplicates
    Append:             Concatenate results (UNION)

Example Plan Tree

SELECT d.name, COUNT(*)
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
ORDER BY COUNT(*) DESC
LIMIT 10;

    Plan:
    Limit (10)
      └── Sort (count DESC)
            └── Aggregate (GROUP BY d.name)
                  └── Hash Join (e.dept_id = d.id)
                        ├── Seq Scan (employees)
                        └── Hash
                              └── Seq Scan (departments)

10.8 Plan Representation in Databases

PostgreSQL EXPLAIN

EXPLAIN SELECT * FROM users WHERE age > 30;

                        QUERY PLAN
─────────────────────────────────────────────────────────
 Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
   Filter: (age > 30)

EXPLAIN ANALYZE (with execution)

EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;

                        QUERY PLAN
─────────────────────────────────────────────────────────
 Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                    (actual time=0.012..0.015 rows=3 loops=1)
   Filter: (age > 30)
   Rows Removed by Filter: 2
 Planning Time: 0.045 ms
 Execution Time: 0.029 ms

MySQL EXPLAIN

EXPLAIN SELECT * FROM users WHERE age > 30;

+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | users | ALL  | idx_age       | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

10.9 Understanding EXPLAIN Output

Cost Estimates

Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                         │      │    │      │
                         │      │    │      └── Average row width (bytes)
                         │      │    └── Estimated rows returned
                         │      └── Total cost
                         └── Startup cost

Startup cost: Cost before first row can be returned Total cost: Cost to return all rows

Node Types

    Seq Scan         - Full table scan
    Index Scan       - Use index, then fetch rows
    Index Only Scan  - Use index, don't fetch rows
    Bitmap Index Scan - Build bitmap of matching rows
    Bitmap Heap Scan - Fetch rows indicated by bitmap

    Nested Loop      - For each outer row, scan inner
    Hash Join        - Build hash table, probe
    Merge Join       - Merge sorted inputs

    Sort             - Sort rows
    HashAggregate    - Aggregate via hash table
    GroupAggregate   - Aggregate pre-sorted groups

Reading Nested Plans

Plans are trees, read bottom-up (leaves execute first):

    Limit  (cost=1.15..1.15 rows=1 width=68)
      ->  Sort  (cost=1.15..1.15 rows=2 width=68)
            Sort Key: age DESC
            ->  Seq Scan on users  (cost=0.00..1.05 rows=2 width=68)
                  Filter: (age > 30)

    Execution order:
    1. Seq Scan users (filter age > 30)
    2. Sort results by age DESC
    3. Limit to 1 row

10.10 Prepared Statements

Parsing and planning can be cached:

-- Prepare once
PREPARE find_user (int) AS
    SELECT * FROM users WHERE id = $1;

-- Execute many times (skip parse/plan)
EXECUTE find_user(42);
EXECUTE find_user(100);
EXECUTE find_user(7);

Generic vs Custom Plans

For prepared statements with parameters, databases choose:

Custom plan: Re-plan with actual parameter values. Better estimates. Generic plan: Cache plan, use for all parameter values. Less planning overhead.

    First 5 executions: Custom plans (learning phase)
    After 5 executions: Compare costs
    If generic plan is close enough: Switch to generic

-- PostgreSQL: Control plan caching
SET plan_cache_mode = auto;           -- Default: decide automatically
SET plan_cache_mode = force_generic;  -- Always generic
SET plan_cache_mode = force_custom;   -- Always custom

10.11 Plan Caching and Invalidation

What Gets Cached

Prepared statement plans
PL/pgSQL function plans
View definitions

Invalidation Triggers

Plans are invalidated when:

Table structure changes (ALTER TABLE)
Indexes are created or dropped
Statistics are updated (ANALYZE)
Related objects are modified

-- This invalidates plans using the users table
ALTER TABLE users ADD COLUMN nickname TEXT;
CREATE INDEX idx_users_email ON users(email);
ANALYZE users;

10.12 Parser Limitations and Quirks

SQL Injection

The parser doesn’t prevent injection—you must use parameters:

-- DANGEROUS: String interpolation
"SELECT * FROM users WHERE name = '" + user_input + "'"

-- If user_input = "'; DROP TABLE users; --"
-- Parsed as:
SELECT * FROM users WHERE name = ''; DROP TABLE users; --'

-- SAFE: Parameterized query
PREPARE stmt AS SELECT * FROM users WHERE name = $1;
EXECUTE stmt('Alice');
-- $1 is always treated as a value, never as SQL

Case Sensitivity

-- These are the same column
SELECT Name FROM users;
SELECT name FROM users;
SELECT NAME FROM users;

-- Unless quoted (PostgreSQL)
SELECT "Name" FROM users;  -- Different from "name"

Operator Precedence

-- What does this mean?
SELECT * FROM t WHERE a = 1 OR b = 2 AND c = 3;

-- Parsed as (AND has higher precedence):
SELECT * FROM t WHERE a = 1 OR (b = 2 AND c = 3);

-- Use explicit parentheses!
SELECT * FROM t WHERE (a = 1 OR b = 2) AND c = 3;

10.13 Summary

Query parsing and planning transforms SQL into executable operations:

Lexer breaks SQL into tokens
Parser builds Abstract Syntax Tree
Analyzer validates semantics and resolves names
Rewriter expands views and applies rules
Planner generates execution plans

Understanding this pipeline helps you:

Debug SQL syntax errors
Interpret EXPLAIN output
Understand why plans are chosen
Use prepared statements effectively

What’s Next

In Chapter 11, we’ll dive deep into query optimization—how the planner evaluates alternatives and chooses the best execution plan.

“The parser is the bouncer at the SQL club: strict about syntax, indifferent to whether your query makes business sense.”

Keyboard shortcuts

Database Internals: Where Your Data Actually Lives