Chapter 10: Query Parsing and Planning
“SQL is declarative: you say what you want, not how to get it. The planner figures out the how.”
When you write SQL, you express intent—“give me users older than 30 sorted by name.” The database must figure out how to actually retrieve that data. This transformation from SQL text to executable plan happens through parsing, analysis, and planning.
10.1 The Query Processing Pipeline
QUERY PROCESSING STAGES
"SELECT name FROM users WHERE age > 30 ORDER BY name"
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. LEXER/TOKENIZER │
│ Break into tokens: SELECT, name, FROM, users, WHERE, ... │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. PARSER │
│ Build Abstract Syntax Tree (AST) from tokens │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. ANALYZER/BINDER │
│ Resolve names, check types, validate semantics │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REWRITER │
│ Apply rules: views, RLS policies, query rewriting │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. PLANNER/OPTIMIZER │
│ Generate and evaluate execution plans, choose best │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. EXECUTOR │
│ Run the chosen plan, return results │
└─────────────────────────────────────────────────────────────┘
10.2 Lexical Analysis (Tokenization)
The lexer breaks SQL text into tokens:
SELECT name, age FROM users WHERE age > 30
Tokens:
┌──────────┬─────────────┐
│ Type │ Value │
├──────────┼─────────────┤
│ KEYWORD │ SELECT │
│ IDENT │ name │
│ COMMA │ , │
│ IDENT │ age │
│ KEYWORD │ FROM │
│ IDENT │ users │
│ KEYWORD │ WHERE │
│ IDENT │ age │
│ OPERATOR │ > │
│ NUMBER │ 30 │
└──────────┴─────────────┘
Handling Strings and Identifiers
SELECT "column-with-dash" FROM 'my_table'
WHERE name = 'O''Brien' -- Escaped quote
The lexer handles:
- Quoted identifiers (
"column-with-dash") - String literals (
'O''Brien') - Escape sequences
- Comments (
-- comment,/* block */)
10.3 Parsing: Building the AST
The parser uses grammar rules to build an Abstract Syntax Tree:
ABSTRACT SYNTAX TREE
SelectStmt
/ | \
/ | \
target from where
list clause clause
| | |
┌────┴────┐ │ BinaryOp
│ │ │ (>)
ColRef ColRef │ / \
(name) (age) │ ColRef Const
│ (age) (30)
TableRef
(users)
Parser Implementation
Most databases use parser generators (Bison/Yacc) with grammar rules:
select_stmt:
SELECT target_list FROM from_clause where_clause
;
target_list:
target_entry
| target_list ',' target_entry
;
where_clause:
/* empty */
| WHERE expr
;
Syntax Errors
The parser catches syntax errors:
SELCT name FROM users;
-- Error: syntax error at or near "SELCT"
SELECT FROM users;
-- Error: syntax error at or near "FROM" (missing column list)
10.4 Semantic Analysis (Binding)
The analyzer validates that the AST makes sense:
Name Resolution
SELECT name FROM users WHERE age > 30;
- Does table
usersexist? - Does
usershave columnname? - Does
usershave columnage? - Is comparing
ageto30valid (types compatible)?
Type Checking
SELECT name FROM users WHERE name > 30;
-- PostgreSQL: Actually allowed (compares text to integer after cast)
SELECT name FROM users WHERE name > ARRAY[1,2,3];
-- Error: operator does not exist: text > integer[]
Ambiguity Resolution
SELECT name FROM users, orders WHERE id = order_id;
-- Which table's id? Must specify: users.id or orders.id
Output Schema
The analyzer determines the result’s schema:
SELECT name, age + 1 AS next_age FROM users;
-- Result columns:
-- name: text
-- next_age: integer (expression type)
10.5 Query Rewriting
Before optimization, databases apply rewriting rules:
View Expansion
-- View definition
CREATE VIEW active_users AS
SELECT * FROM users WHERE status = 'active';
-- Original query
SELECT name FROM active_users WHERE age > 30;
-- After expansion
SELECT name FROM users WHERE status = 'active' AND age > 30;
Subquery Flattening
-- Original
SELECT * FROM orders WHERE customer_id IN (
SELECT id FROM customers WHERE country = 'USA'
);
-- Rewritten as join
SELECT orders.* FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE customers.country = 'USA';
Predicate Pushdown
-- Original
SELECT * FROM (
SELECT * FROM orders WHERE year = 2024
) sub
WHERE amount > 100;
-- Rewritten
SELECT * FROM orders WHERE year = 2024 AND amount > 100;
Row-Level Security (RLS)
-- Policy: users can only see their own data
CREATE POLICY user_isolation ON accounts
USING (user_id = current_user_id());
-- Original query
SELECT * FROM accounts;
-- After RLS expansion
SELECT * FROM accounts WHERE user_id = current_user_id();
10.6 Logical vs Physical Plans
Logical Plan
Describes what operations are needed, abstractly:
Logical Plan:
┌─────────────────────────────────────────────────────────────┐
│ Project [name] │
│ └── Filter [age > 30] │
│ └── Scan [users] │
└─────────────────────────────────────────────────────────────┘
Physical Plan
Describes how to execute, concretely:
Physical Plan:
┌─────────────────────────────────────────────────────────────┐
│ Project [name] │
│ └── Filter [age > 30] │
│ └── Index Scan [idx_users_age] │
│ range: age > 30 │
└─────────────────────────────────────────────────────────────┘
The same logical plan might have many possible physical plans.
10.7 Plan Operators
Scan Operators
Sequential Scan: Read all rows in table
Index Scan: Use index to find rows
Index Only Scan: Get data from index alone
Bitmap Scan: Build bitmap from index, then scan heap
Join Operators
Nested Loop: For each row in A, scan B
Hash Join: Build hash table from A, probe with B
Merge Join: Merge two sorted inputs
Other Operators
Sort: Order rows by key
Aggregate: Compute SUM, COUNT, etc.
Limit: Return only N rows
Unique: Remove duplicates
Append: Concatenate results (UNION)
Example Plan Tree
SELECT d.name, COUNT(*)
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
ORDER BY COUNT(*) DESC
LIMIT 10;
Plan:
Limit (10)
└── Sort (count DESC)
└── Aggregate (GROUP BY d.name)
└── Hash Join (e.dept_id = d.id)
├── Seq Scan (employees)
└── Hash
└── Seq Scan (departments)
10.8 Plan Representation in Databases
PostgreSQL EXPLAIN
EXPLAIN SELECT * FROM users WHERE age > 30;
QUERY PLAN
─────────────────────────────────────────────────────────
Seq Scan on users (cost=0.00..1.05 rows=2 width=68)
Filter: (age > 30)
EXPLAIN ANALYZE (with execution)
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30;
QUERY PLAN
─────────────────────────────────────────────────────────
Seq Scan on users (cost=0.00..1.05 rows=2 width=68)
(actual time=0.012..0.015 rows=3 loops=1)
Filter: (age > 30)
Rows Removed by Filter: 2
Planning Time: 0.045 ms
Execution Time: 0.029 ms
MySQL EXPLAIN
EXPLAIN SELECT * FROM users WHERE age > 30;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | users | ALL | idx_age | NULL | NULL | NULL | 1000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
10.9 Understanding EXPLAIN Output
Cost Estimates
Seq Scan on users (cost=0.00..1.05 rows=2 width=68)
│ │ │ │
│ │ │ └── Average row width (bytes)
│ │ └── Estimated rows returned
│ └── Total cost
└── Startup cost
Startup cost: Cost before first row can be returned Total cost: Cost to return all rows
Node Types
Seq Scan - Full table scan
Index Scan - Use index, then fetch rows
Index Only Scan - Use index, don't fetch rows
Bitmap Index Scan - Build bitmap of matching rows
Bitmap Heap Scan - Fetch rows indicated by bitmap
Nested Loop - For each outer row, scan inner
Hash Join - Build hash table, probe
Merge Join - Merge sorted inputs
Sort - Sort rows
HashAggregate - Aggregate via hash table
GroupAggregate - Aggregate pre-sorted groups
Reading Nested Plans
Plans are trees, read bottom-up (leaves execute first):
Limit (cost=1.15..1.15 rows=1 width=68)
-> Sort (cost=1.15..1.15 rows=2 width=68)
Sort Key: age DESC
-> Seq Scan on users (cost=0.00..1.05 rows=2 width=68)
Filter: (age > 30)
Execution order:
1. Seq Scan users (filter age > 30)
2. Sort results by age DESC
3. Limit to 1 row
10.10 Prepared Statements
Parsing and planning can be cached:
-- Prepare once
PREPARE find_user (int) AS
SELECT * FROM users WHERE id = $1;
-- Execute many times (skip parse/plan)
EXECUTE find_user(42);
EXECUTE find_user(100);
EXECUTE find_user(7);
Generic vs Custom Plans
For prepared statements with parameters, databases choose:
Custom plan: Re-plan with actual parameter values. Better estimates. Generic plan: Cache plan, use for all parameter values. Less planning overhead.
First 5 executions: Custom plans (learning phase)
After 5 executions: Compare costs
If generic plan is close enough: Switch to generic
-- PostgreSQL: Control plan caching
SET plan_cache_mode = auto; -- Default: decide automatically
SET plan_cache_mode = force_generic; -- Always generic
SET plan_cache_mode = force_custom; -- Always custom
10.11 Plan Caching and Invalidation
What Gets Cached
- Prepared statement plans
- PL/pgSQL function plans
- View definitions
Invalidation Triggers
Plans are invalidated when:
- Table structure changes (ALTER TABLE)
- Indexes are created or dropped
- Statistics are updated (ANALYZE)
- Related objects are modified
-- This invalidates plans using the users table
ALTER TABLE users ADD COLUMN nickname TEXT;
CREATE INDEX idx_users_email ON users(email);
ANALYZE users;
10.12 Parser Limitations and Quirks
SQL Injection
The parser doesn’t prevent injection—you must use parameters:
-- DANGEROUS: String interpolation
"SELECT * FROM users WHERE name = '" + user_input + "'"
-- If user_input = "'; DROP TABLE users; --"
-- Parsed as:
SELECT * FROM users WHERE name = ''; DROP TABLE users; --'
-- SAFE: Parameterized query
PREPARE stmt AS SELECT * FROM users WHERE name = $1;
EXECUTE stmt('Alice');
-- $1 is always treated as a value, never as SQL
Case Sensitivity
-- These are the same column
SELECT Name FROM users;
SELECT name FROM users;
SELECT NAME FROM users;
-- Unless quoted (PostgreSQL)
SELECT "Name" FROM users; -- Different from "name"
Operator Precedence
-- What does this mean?
SELECT * FROM t WHERE a = 1 OR b = 2 AND c = 3;
-- Parsed as (AND has higher precedence):
SELECT * FROM t WHERE a = 1 OR (b = 2 AND c = 3);
-- Use explicit parentheses!
SELECT * FROM t WHERE (a = 1 OR b = 2) AND c = 3;
10.13 Summary
Query parsing and planning transforms SQL into executable operations:
- Lexer breaks SQL into tokens
- Parser builds Abstract Syntax Tree
- Analyzer validates semantics and resolves names
- Rewriter expands views and applies rules
- Planner generates execution plans
Understanding this pipeline helps you:
- Debug SQL syntax errors
- Interpret EXPLAIN output
- Understand why plans are chosen
- Use prepared statements effectively
What’s Next
In Chapter 11, we’ll dive deep into query optimization—how the planner evaluates alternatives and chooses the best execution plan.
“The parser is the bouncer at the SQL club: strict about syntax, indifferent to whether your query makes business sense.”