Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Why Reinforcement Learning in Q? Why Not.

There are two kinds of kdb+/q developers. Those who have looked at the existing RL literature and thought “this would map beautifully to an array language”, and those who haven’t looked at it yet. You’re reading this, so you’re either in the first group or you’re about to join it.

Reinforcement learning is the branch of machine learning concerned with agents that learn by doing. An agent observes a state, takes an action, receives a reward, and updates its beliefs about what actions are good. Repeat until convergence, publication, or deadline—whichever comes first.

The canonical reference is Sutton and Barto’s Reinforcement Learning: An Introduction (second edition, 2018, freely available at incompleteideas.net). This book follows that text chapter by chapter, implementing every algorithm in kdb+/q. If you have the S&B book open in one window, you should be able to read this and understand exactly which equation we’re implementing and why.

Why Q Is Actually Good at This

The standard argument for Python is its ecosystem. NumPy, PyTorch, Gymnasium, stable-baselines3—it’s all there. The counterargument is that you are a kdb+/q developer and you don’t live in Python. More substantively:

Arrays are first-class citizens. RL algorithms operate on tables of state-action-value estimates, batch updates to value functions, and vectorised probability distributions. In q, this is not boilerplate. It is syntax.

The tabular methods that dominate the early S&B chapters are literally table operations. A Q-table is a dictionary. A value function is a list. Policy evaluation is a matrix operation. Q makes these feel natural rather than imported.

q’s terseness matches the mathematical notation. When Sutton & Barto write \(Q(s,a) \leftarrow Q(s,a) + \alpha[R + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\), the q implementation is shorter than the LaTeX. This is not a coincidence.

Performance. The tabular methods in this book run in microseconds per step. The q implementations are fast enough to run thousands of episodes in interactive development time.

What This Book Is Not

It is not a gentle introduction to reinforcement learning. Read S&B for that—it’s excellent. It is not an introduction to kdb+/q. If imax, {x+y}/, and .[t;(i;col);:;v]` require explanation before you can continue, spend some time with the q language reference first.

It is also not a comprehensive treatment of deep RL. Deep Q-Networks, policy gradient with neural networks, and transformer-based agents are beyond scope. This book covers the classical, tabular, and linear-approximation methods from S&B chapters 2–13. These are the foundations. They also happen to be complete and correct in a way that deep methods still aspire to be.

How to Use This Book

Each chapter corresponds roughly to one or two chapters of S&B and implements the algorithms from that section. The code is meant to run. Load it into a q session and experiment. Change the learning rate. Break the algorithm. Observe what happens.

The environments are deliberately simple—GridWorlds, Blackjack, Cliff Walking. These are the same environments S&B use. If your output matches the figures in that book, you’ve understood the algorithm. If it doesn’t, you’ve found something more interesting: a bug, a subtlety, or an insight.

Setup

You need kdb+/q. The personal (32-bit) edition from kx.com is sufficient for every example in this book. Each chapter is self-contained: copy the code into a .q file, load it with \l filename.q, or paste it directly into a q session.

We define one utility at the start of almost every chapter:

/ Box-Muller transform: standard normal sample N(0,1)
pi:acos -1f
normal:{[] sqrt[-2f*log rand 1.0f]*cos[2f*pi*rand 1.0f]}

This generates normally-distributed random numbers, which appear everywhere in RL experiments. q lacks a built-in normal sampler, so we bring our own. Box-Muller is exact, not approximate, and fast enough for our purposes.

The rest is just algorithms.

Multi-Armed Bandits: Exploration, Exploitation, and Regret

S&B Chapter 2

The multi-armed bandit problem is RL’s hello world. You have \(k\) slot machines (arms). Each arm \(a\) has a true expected reward \(q^*(a)\) that you don’t know. Every time step, you pull an arm and observe a noisy reward. Your goal is to maximise total reward over time.

The fundamental tension: to know which arm is best, you must explore. But exploration takes time you could spend exploiting the arm you currently believe is best. Every moment spent exploring is potential reward foregone—this is regret. Every moment spent exploiting a suboptimal arm is also regret. You cannot win, only manage.

This tension—exploration vs. exploitation—is the central problem of reinforcement learning. It never goes away. Later chapters add states, transitions, and discount factors, but the bandit dilemma remains embedded in every algorithm.

The 10-Armed Testbed

S&B’s canonical benchmark (Figure 2.2): \(k = 10\) arms, true values \(q^(a) \sim \mathcal{N}(0, 1)\), rewards \(R \sim \mathcal{N}(q^(a), 1)\). Run for 1000 steps. Average over 2000 independent bandit problems to smooth the noise.

/ ============================================================
/ S&B Chapter 2 — Multi-Armed Bandit Testbed
/ ============================================================

pi:acos -1f
normal:{[] sqrt[-2f*log rand 1.0f]*cos[2f*pi*rand 1.0f]}

/ Create a k-armed bandit: sample true values from N(0,1)
initBandit:{[k]
  qStar:{normal[]} each til k;
  `k`qStar`optimal!(k; qStar; imax qStar)
  }

/ Pull arm a: reward ~ N(q*(a), 1)
pullArm:{[bandit;a]
  bandit[`qStar][a] + normal[]
  }

Epsilon-Greedy

The simplest strategy: with probability \(\varepsilon\), choose a random arm (explore); otherwise, choose the arm with the highest estimated value (exploit).

We track \(Q(a)\)—the sample mean of rewards from arm \(a\)—using the incremental update formula:

\[Q_{n+1}(a) = Q_n(a) + \frac{1}{N_n(a)}\left[R_n - Q_n(a)\right]\]

This is algebraically equivalent to computing a running average but requires only \(O(1)\) memory. No stored history, no growing lists.

/ Agent state: estimated Q values, action counts, step counter
initAgent:{[k;eps]
  `k`eps`Q`N`t!(k; eps; k#0f; k#0i; 0i)
  }

/ Epsilon-greedy action selection
selectAction:{[agent]
  $[rand[1.0f] < agent`eps;
    first 1?agent`k;    / explore: uniform random arm
    imax agent`Q        / exploit: greedy arm
  ]}

/ Incremental mean update: Q[a] <- Q[a] + (R - Q[a]) / N[a]
updateQ:{[agent;a;r]
  n1:1i + agent[`N][a];
  dq:(r - agent[`Q][a]) % n1;
  agent,`N`Q`t!(@[agent`N;a;:;n1]; @[agent`Q;a;+;dq]; agent[`t]+1i)
  }

/ One step: act, observe, update; return augmented state
stepAgent:{[bandit;state]
  a:selectAction state;
  r:pullArm[bandit;a];
  upd:updateQ[state;a;r];
  upd,`_r`_opt!(r; a=bandit`optimal)
  }

The incremental update equation is the algebraic heart of everything that follows. The update rule \(Q \leftarrow Q + \alpha(R - Q)\) says: move your estimate toward the new observation by a fraction \(\alpha\) of the error \((R - Q)\). When \(\alpha = 1/N\), this recovers the sample mean exactly. In later chapters, we’ll use a fixed \(\alpha\) to weight recent rewards more heavily—useful when the environment changes over time.

/ Run n steps using q's scan operator (\)
/ scan applies stepAgent repeatedly, returning all intermediate states
runExperiment:{[bandit;eps;nSteps]
  s0:initAgent[bandit`k; eps],`_r`_opt!(0f;0b);
  steps:1_ (nSteps (stepAgent[bandit;])\ s0);   / drop initial dummy state
  `rewards`optimal!({x`_r} each steps; {x`_opt} each steps)
  }

/ Average over nRuns independent bandit problems
runAveraged:{[k;eps;nSteps;nRuns]
  results:{runExperiment[initBandit[x]; y; z]}[k;;nSteps] each nRuns#eps;
  avgRewards: avg {x`rewards} each results;
  pctOptimal: avg {`float$x`optimal} each results;
  `avgRewards`pctOptimal!(avgRewards;pctOptimal)
  }

Running the testbed:

/ Reproduce S&B Figure 2.2
k:10; nSteps:1000; nRuns:2000

r000:runAveraged[k; 0.00f; nSteps; nRuns];   / greedy
r010:runAveraged[k; 0.01f; nSteps; nRuns];   / epsilon = 0.01
r100:runAveraged[k; 0.10f; nSteps; nRuns];   / epsilon = 0.10

/ Average reward at final step
r000[`avgRewards][nSteps-1]   / ~1.0  (gets stuck on suboptimal arm)
r010[`avgRewards][nSteps-1]   / ~1.3
r100[`avgRewards][nSteps-1]   / ~1.4  (best short-term, but keeps exploring)

The greedy agent (\(\varepsilon=0\)) converges fastest initially but gets trapped. It commits to the first arm that looks good and never learns it might have been wrong. The \(\varepsilon=0.1\) agent explores more, learns better, and eventually outperforms—but wastes 10% of its time forever on random pulls.

Upper Confidence Bound (UCB)

Epsilon-greedy explores randomly. UCB explores intelligently: prefer arms that either look good or haven’t been tried much. The action selection rule (S&B Equation 2.10):

\[A_t = \operatorname{argmax}_a \left[ Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}} \right]\]

The second term is the uncertainty bonus. An arm pulled rarely has high \(\sqrt{\ln t / N_t(a)}\), so it gets selected until we know enough about it. As \(N_t(a)\) grows, the bonus shrinks. \(c\) controls how aggressively we explore.

/ UCB action selection
/ Handle division by zero: never-pulled arms get priority (bonus = +inf)
selectUCB:{[c;agent]
  t:agent`t;
  N:agent`N;
  Q:agent`Q;
  / Arms with N=0 must be tried first (set bonus to infinity)
  bonus:$[any N=0i;
    {$[x=0i;0w;0f]} each N;                  / 0w = positive infinity
    c*sqrt[log[`float$t] % `float$N]
    ];
  imax Q+bonus
  }

/ UCB agent step
stepUCB:{[bandit;c;state]
  a:selectUCB[c;state];
  r:pullArm[bandit;a];
  upd:updateQ[state;a;r];
  upd,`_r`_opt!(r; a=bandit`optimal)
  }

/ Run UCB experiment
runUCB:{[k;c;nSteps;nRuns]
  results:{[bandit]
    s0:initAgent[bandit`k; 0f],`_r`_opt!(0f;0b);
    steps:1_(nSteps (stepUCB[bandit;c;])\ s0);
    `rewards`optimal!({x`_r}each steps; {x`_opt}each steps)
    } each {initBandit[k]} each til nRuns;
  avgRewards:avg{x`rewards}each results;
  pctOptimal:avg{`float$x`optimal}each results;
  `avgRewards`pctOptimal!(avgRewards;pctOptimal)
  }
/ Compare UCB c=2 with epsilon-greedy on the testbed
rUCB:runUCB[10; 2f; 1000; 2000];

/ UCB typically outperforms epsilon-greedy on stationary problems
/ by pulling suboptimal arms less often once they're understood
avg rUCB`avgRewards        / typically ~1.5
avg r100`avgRewards        / epsilon=0.1: ~1.35

UCB is theoretically attractive because it has logarithmic regret—provably near-optimal for stationary problems. The catch: it requires knowing \(t\) and assumes stationarity. When the environment shifts (arm values change over time), UCB’s optimism about arms it once understood becomes a liability.

Gradient Bandit

Both methods above learn action values and derive a policy from them. Gradient bandits do it differently: learn a preference \(H_t(a)\) for each arm and convert preferences to probabilities via softmax. Update preferences by gradient ascent on expected reward.

The update rule (S&B Equation 2.12):

\[H_{t+1}(a) = H_t(a) + \alpha(R_t - \bar{R}t)\left(\mathbf{1}{a=A_t} - \pi_t(a)\right)\]

where \(\pi_t(a)\) is the softmax probability of arm \(a\) and \(\bar{R}_t\) is a running reward baseline.

/ Softmax of a real-valued vector
softmax:{[H]
  eH:exp H-max H;    / subtract max for numerical stability
  eH%sum eH
  }

/ Gradient bandit agent state
initGradient:{[k;alpha]
  `k`alpha`H`piBar`t!(k; alpha; k#0f; 0f; 0i)
  }

/ Select action by sampling from softmax distribution
selectGradient:{[agent]
  pi:softmax agent`H;
  / Sample from categorical distribution
  cumPi:sums pi;
  r:rand 1.0f;
  first where cumPi >= r
  }

/ Update preferences using gradient ascent
updateGradient:{[agent;a;r]
  pi:softmax agent`H;
  alpha:agent`alpha;
  Rbar:agent`piBar;
  delta:r-Rbar;
  t1:agent[`t]+1i;
  / One-hot indicator for chosen action
  oneHot:`float$(til agent`k)=a;
  / H[a] += alpha * (R - Rbar) * (1 - pi[a])
  / H[b] -= alpha * (R - Rbar) * pi[b]  for b != a
  newH:agent[`H]+alpha*delta*(oneHot-pi);
  / Update baseline: running mean of all rewards
  newRbar:Rbar+(r-Rbar)%t1;
  agent,`H`piBar`t!(newH;newRbar;t1)
  }

/ Full gradient bandit step
stepGradient:{[bandit;state]
  a:selectGradient state;
  r:pullArm[bandit;a];
  upd:updateGradient[state;a;r];
  upd,`_r`_opt!(r;a=bandit`optimal)
  }

/ Run gradient bandit experiment
runGradient:{[k;alpha;nSteps;nRuns]
  results:{[bandit]
    s0:initGradient[bandit`k;alpha],`_r`_opt!(0f;0b);
    steps:1_(nSteps (stepGradient[bandit;])\ s0);
    `rewards`optimal!({x`_r}each steps;{x`_opt}each steps)
    } each {initBandit[k]} each til nRuns;
  `avgRewards`pctOptimal!(avg{x`rewards}each results; avg{`float$x`optimal}each results)
  }
/ alpha=0.1 typically works well; try 0.4 to see instability
rGrad:runGradient[10; 0.1f; 1000; 2000];
avg rGrad`pctOptimal    / ~75-80% optimal

Which Algorithm Wins?

None of them, universally. Epsilon-greedy is robust and simple. UCB is theoretically principled for stationary problems. Gradient bandits don’t require interpreting rewards as values—useful when the scale of rewards is unknown or varies. Parameter sensitivity differs across all three.

The bandit problem is elegantly tractable—we can analyse these algorithms mathematically and prove bounds on their regret. The moment we add states (i.e., the environment changes based on actions taken), the analysis becomes much harder. That’s the subject of the rest of this book.

For now, observe that every algorithm encodes the same intuition: reward-seeking with uncertainty. The difference is in how precisely they quantify uncertainty and how aggressively they resolve it.

/ Quick comparison summary
show ([]
  method:  `greedy`eps001`eps010`UCB`gradient;
  avgFinalReward: (
    last r000`avgRewards;
    last r010`avgRewards;
    last r100`avgRewards;
    last rUCB`avgRewards;
    last rGrad`avgRewards
  ))

The numbers won’t surprise you, but running the code and watching them emerge will. That’s the experiment.

Markov Decision Processes: States, Actions, and Rewards in Q

S&B Chapter 3

Bandits were stateless: each pull was independent, and the world didn’t change based on what you did. Real decision problems have state—the world is different after you act, and what you do now affects what’s available later.

A Markov Decision Process (MDP) formalises this structure. At each time step \(t\):

  • The agent observes state \(S_t \in \mathcal{S}\)
  • Selects action \(A_t \in \mathcal{A}(S_t)\)
  • Receives reward \(R_{t+1} \in \mathbb{R}\)
  • Transitions to \(S_{t+1}\)

The Markov property: \(P(S_{t+1}, R_{t+1} \mid S_t, A_t) = P(S_{t+1}, R_{t+1} \mid S_0, A_0, \ldots, S_t, A_t)\). The future depends on the present, not the history. This is simultaneously a strong assumption and the thing that makes MDPs tractable.

Representing an MDP in Q

An MDP needs states, actions, a transition function, and a reward function. In q, these map naturally:

  • States: integers (indices into a state space)
  • Actions: integers (indices into action sets)
  • Transition function: a q function (state; action) -> next_state
  • Reward function: a q function (state; action; next_state) -> reward
  • Terminal check: a q function state -> bool

This is not the only representation, but it’s clean and fast.

The GridWorld Environment

S&B uses a 4×4 GridWorld (Figure 3.2, Chapter 4) throughout the early chapters. Let’s build it once and use it everywhere.

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

States 0 and 15 are terminal. Actions: 0=up, 1=right, 2=down, 3=left. Moving into a wall returns to the current state. Reward is -1 for every non-terminal transition.

/ ============================================================
/ GridWorld MDP — S&B Figure 3.2 / 4.1
/ ============================================================

nRows:4; nCols:4
nStates:nRows*nCols
nActions:4  / 0=up 1=right 2=down 3=left

/ Terminal states
terminals:0 15

/ Action deltas: (row_delta; col_delta)
actionDeltas:((-1;0);(0;1);(1;0);(0;-1))  / up right down left

/ State -> (row; col)
stateToRC:{[s] (s div nCols; s mod nCols)}

/ (row; col) -> state (clamp to grid boundary)
rcToState:{[r;c]
  r2:0|r&nRows-1;
  c2:0|c&nCols-1;
  (nCols*r2)+c2
  }

/ Transition: (state; action) -> next_state
/ Terminal states absorb (stay put)
transition:{[s;a]
  if[s in terminals; :s];
  rc:stateToRC s;
  d:actionDeltas a;
  rcToState[rc[0]+d[0]; rc[1]+d[1]]
  }

/ Reward: -1 for all non-terminal transitions
reward:{[s;a;s2] -1}

/ Is state terminal?
isTerminal:{[s] s in terminals}

The Value Function

The state-value function \(v_\pi(s)\) is the expected cumulative discounted return starting from state \(s\) and following policy \(\pi\):

\[v_\pi(s) = \mathbb{E}\pi\left[\sum{k=0}^{\infty} \gamma^k R_{t+k+1} ;\middle|; S_t = s\right]\]

The discount factor \(\gamma \in [0,1)\) makes future rewards worth less than immediate rewards. \(\gamma = 0\) means the agent is completely myopic; \(\gamma = 1\) means it weights all future rewards equally (fine for episodic tasks).

The Bellman equation rewrites this recursive definition as a consistency constraint: \[v_\pi(s) = \sum_a \pi(a|s) \sum_{s’,r} p(s’,r|s,a)\left[r + \gamma v_\pi(s’)\right]\] This says: the value of a state is the expected immediate reward plus the discounted value of wherever you end up. It’s not a formula for computing \(v_\pi\)—it’s a constraint that \(v_\pi\) must satisfy. The algorithms in Chapter 4 find the \(v_\pi\) that satisfies it.

Generating Episodes

Before we get to dynamic programming, we need to be able to interact with the environment—generate episodes by following a policy.

/ Random equiprobable policy: choose uniformly among all actions
randomPolicy:{[s] first 1?nActions}

/ One environment step: (state; action) -> (next_state; reward; done)
envStep:{[s;a]
  s2:transition[s;a];
  r:reward[s;a;s2];
  done:isTerminal s2;
  `s`r`done!(s2;r;done)
  }

/ Generate one episode following policy pi from start state s0
/ Returns table of (s, a, r, s') transitions
generateEpisode:{[policy;s0;maxSteps]
  / State for scan: (current_state; done_flag; step_count)
  / We'll collect transitions manually using while
  s:s0;
  transitions:();
  step:0;
  while[(not isTerminal s) and step<maxSteps;
    a:policy[s];
    result:envStep[s;a];
    transitions,:enlist `s`a`r`s2!(s;a;result`r;result`s);
    s:result`s;
    step+:1
  ];
  `t$transitions     / cast list of dicts to table
  }

/ Example: generate a random episode from corner state 1
ep:generateEpisode[randomPolicy;1;1000];
count ep               / episode length
sum ep`r               / total reward (all -1s, so negative of length)

Representing Policies

A deterministic policy is a function from states to actions. In q, this is just a list indexed by state:

/ Deterministic policy: list where policy[s] = action
/ Equiprobable random policy as a function
makeRandomPolicy:{[] {first 1?nActions} each til nStates}

/ Deterministic policy from a value table: greedy w.r.t. V
/ For each state, choose action that maximises V(next_state)
greedyPolicy:{[V;gamma]
  / For each state and action, compute immediate reward + gamma*V(s')
  {[V;gamma;s]
    $[isTerminal s; 0i;
      imax {reward[s;x;transition[s;x]] + gamma*V[transition[s;x]]}
           each til nActions
    ]
  }[V;gamma;] each til nStates
  }

Action Values

The action-value function \(q_\pi(s,a)\) is the expected return from taking action \(a\) in state \(s\), then following policy \(\pi\):

\[q_\pi(s,a) = \mathbb{E}_\pi\left[G_t ;\middle|; S_t=s, A_t=a\right]\]

In tabular methods, we often represent \(Q\) as a 2D structure: states × actions. In q:

/ Q-table: nStates x nActions matrix, initialised to 0
initQTable:{[] nStates#enlist nActions#0f}

/ Access: Q[s][a]
/ Update Q[s][a] by delta
updateQTable:{[Q;s;a;delta]
  row:@[Q[s];a;+;delta];
  @[Q;s;:;row]
  }

The Optimal Policy

The optimal value function \(v^*(s) = \max_\pi v_\pi(s)\) is the best achievable value from each state. The Bellman optimality equation:

\[v^(s) = \max_a \sum_{s’,r} p(s’,r|s,a)\left[r + \gamma v^(s’)\right]\]

This is nonlinear (the \(\max\) breaks linearity), so we can’t solve it as a system of equations directly. The algorithms in the next chapter—dynamic programming—solve it iteratively. Monte Carlo and TD methods approximate it from sampled experience. The rest of this book is various approaches to finding \(v^*\) without knowing the full model \(p(s’,r|s,a)\).

/ Quick demonstration: run episodes under random policy
/ and observe the average return from each state
nEpisodes:1000
returns:(nStates#enlist 0#0f);     / empty float lists per state

{[ep]
  g:0f; gamma:1f;
  / Compute returns backwards from episode end
  traj:reverse ep;
  {[gamma;row]
    g::gamma*g+row`r;             / :: for global update inside lambda
    returns[row`s],:g;            / append return to state's list
    } [gamma;] each traj;
  } each generateEpisode[randomPolicy;;500] each 1+til nStates-2;

/ Average return by state: should be negative (cost to reach terminal)
avgReturn:avg each returns
show nCols cut avgReturn    / display as 4x4 grid

The grid display won’t win any visualisation awards, but you’ll see the pattern: states near the terminals (0 and 15) have higher (less negative) average returns. States far away, like state 5 and 10 in the centre, take longer to reach a terminal under random policy, so they accumulate more −1 rewards.

This is what dynamic programming computes exactly, without sampling. That’s next.

Dynamic Programming: Policy Evaluation and Iteration

S&B Chapter 4

Dynamic programming (DP) is the class of algorithms that compute optimal policies given a perfect model of the environment—the full transition and reward functions \(p(s’,r|s,a)\). In practice, you rarely have this. DP is nonetheless essential because it defines what every other algorithm is trying to approximate.

The “programming” in dynamic programming is not computer programming. It refers to Bellman’s 1950s usage meaning “planning” or “scheduling.” The “dynamic” refers to the time dimension. Bellman was famously good at naming things.

We’ll work entirely with the 4×4 GridWorld from the MDP chapter. Load that code first, or paste it at the top of your file.

Iterative Policy Evaluation

Given a policy \(\pi\), compute \(v_\pi\). The algorithm applies the Bellman equation as an update rule, sweeping through all states repeatedly until convergence.

Algorithm (S&B p. 75):

  1. Initialise \(V(s) = 0\) for all \(s\)
  2. For each state \(s\), update: \[V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} p(s’,r|s,a)\left[r + \gamma V(s’)\right]\]
  3. Repeat until \(\max_s |V_{\text{new}}(s) - V_{\text{old}}(s)| < \theta\)

For the equiprobable random policy (\(\pi(a|s) = 1/4\) for all \(a\)), each state’s value is the average over all actions of the reward plus discounted successor value.

/ ============================================================
/ Dynamic Programming — GridWorld (S&B Chapter 4)
/ Paste GridWorld definitions from mdp.md, or \l gridworld.q
/ ============================================================

/ Equiprobable random policy: uniform over all actions
equiprobable:nStates#enlist nActions#(1f%nActions)

/ Policy evaluation: compute V_pi for policy pi
/ pi: list of (nStates) probability distributions over actions
/ gamma: discount factor
/ theta: convergence threshold
policyEval:{[pi;gamma;theta]
  V:nStates#0f;
  delta:theta+1;   / ensure at least one sweep
  while[delta>theta;
    delta:0f;
    newV:{[V;gamma;pi;s]
      $[isTerminal s; 0f;
        sum pi[s] * {[V;gamma;s;a]
          s2:transition[s;a];
          (reward[s;a;s2]) + gamma*V[s2]
          }[V;gamma;s;] each til nActions
      ]
      }[V;gamma;pi;] each til nStates;
    delta:max abs newV-V;
    V:newV
  ];
  V
  }

Let’s replicate S&B Figure 4.1—the value function for the random policy with \(\gamma=1\):

/ Evaluate random policy with gamma=1
V_rand:policyEval[equiprobable; 1f; 0.001f]

/ Display as 4x4 grid (S&B Figure 4.1)
show nCols cut `int$V_rand
/ Expected output (approximately):
/  0 -14 -20 -22
/ -14 -18 -20 -20
/ -20 -20 -18 -14
/ -22 -20 -14   0

If your numbers match S&B’s figure, the algorithm is correct. The corner states (1 and 14) need 14 steps on average to reach a terminal under random policy; states 5 and 10 are even worse at 18 steps. State 6 in the centre requires 20. These are exact results, not estimates.

The aha moment: The Bellman equation isn’t a formula for computing \(v_\pi\) — it’s a fixed-point equation. The value function that satisfies it is \(v_\pi\). Iterative policy evaluation just repeatedly applies the Bellman update until the value function stops changing. The fact that this converges to the unique fixed point \(v_\pi\) is the contraction mapping theorem. You don’t need to care about the theorem; you just need to run the loop.

Policy Improvement

Given \(v_\pi\), we can improve the policy by acting greedily with respect to \(v_\pi\):

\[\pi’(s) = \operatorname{argmax}a \sum{s’,r} p(s’,r|s,a)\left[r + \gamma v_\pi(s’)\right]\]

The Policy Improvement Theorem guarantees that \(\pi’ \geq \pi\) (weakly better everywhere). If \(\pi’ = \pi\), both are optimal.

/ Policy improvement: given V, return deterministic greedy policy
/ Returns: list of actions (one per state)
policyImprove:{[V;gamma]
  {[V;gamma;s]
    $[isTerminal s;
      0i;    / terminal: action doesn't matter
      imax {reward[s;x;transition[s;x]] + gamma*V[transition[s;x]]}
           each til nActions
    ]
  }[V;gamma;] each til nStates
  }

/ Convert deterministic policy to probability distribution format
detToProbPi:{[detPi]
  {[a] @[nActions#0f;a;:;1f]} each detPi
  }

Policy Iteration

Alternate between policy evaluation and policy improvement until the policy stabilises:

/ Policy iteration
/ Returns: (optimal_V; optimal_policy)
policyIteration:{[gamma;theta]
  / Start with random deterministic policy
  pi:randomPolicy[]`int;                / nStates list of actions
  stable:0b;
  while[not stable;
    / Evaluate current policy
    probPi:detToProbPi pi;
    V:policyEval[probPi;gamma;theta];
    / Improve policy
    newPi:policyImprove[V;gamma];
    / Check stability
    stable:newPi~pi;
    pi:newPi
  ];
  (V;pi)
  }
/ Run policy iteration on GridWorld
result:policyIteration[1f; 0.001f];
V_opt:result[0]; pi_opt:result[1];

/ Optimal values (S&B Figure 4.2 left)
show nCols cut `int$V_opt
/  0  -1  -2  -3
/ -1  -2  -3  -2
/ -2  -3  -2  -1
/ -3  -2  -1   0

/ Optimal policy (0=up 1=right 2=down 3=left)
show nCols cut pi_opt
/ 0 3 3 2
/ 0 0 0 2
/ 0 0 1 2
/ 0 1 1 0  (multiple optimal actions are valid; S&B shows arrows)

The optimal value function is beautiful: each state’s value is exactly the negative of its Manhattan distance to the nearest terminal state. This makes sense—the optimal policy takes the shortest path, so the value is -(steps to terminal).

Policy iteration converges in a small number of iterations for GridWorld. The first iteration turns the random policy into something reasonable; subsequent iterations refine the edges. Track convergence:

/ Policy iteration with iteration tracking
policyIterationVerbose:{[gamma;theta]
  pi:randomPolicy[]`int;
  iter:0;
  stable:0b;
  while[not stable;
    probPi:detToProbPi pi;
    V:policyEval[probPi;gamma;theta];
    newPi:policyImprove[V;gamma];
    stable:newPi~pi;
    pi:newPi;
    iter+:1;
    -1 "Iteration ",string[iter],": policy changed=",string not stable;
  ];
  (V;pi;iter)
  }

Value Iteration

Policy iteration evaluates each intermediate policy to full convergence. Value iteration is more aggressive: take just one Bellman update per state before improving. This converges to \(v^*\) directly without ever computing an intermediate policy:

Update rule (S&B Equation 4.10): \[V(s) \leftarrow \max_a \sum_{s’,r} p(s’,r|s,a)\left[r + \gamma V(s’)\right]\]

/ Value iteration: directly find v*
valueIteration:{[gamma;theta]
  V:nStates#0f;
  delta:theta+1;
  iters:0;
  while[delta>theta;
    newV:{[V;gamma;s]
      $[isTerminal s; 0f;
        max {reward[s;x;transition[s;x]] + gamma*V[transition[s;x]]}
            each til nActions
      ]
      }[V;gamma;] each til nStates;
    delta:max abs newV-V;
    V:newV;
    iters+:1
  ];
  / Extract greedy policy
  pi:policyImprove[V;gamma];
  (V;pi;iters)
  }
/ Value iteration converges in fewer total Bellman updates than policy iteration
result_vi:valueIteration[1f; 0.001f]
result_vi[2]   / number of sweeps to convergence: ~4

/ Compare against policy iteration result
max abs result_vi[0]-V_opt   / should be < theta (0.001)

Value iteration converges in far fewer sweeps because it doesn’t wait for full policy evaluation. The tradeoff: each sweep uses \(\max\) rather than a weighted average, which is slightly more expensive per state, but the total computation is typically much less.

Asynchronous DP

The algorithms above sweep through all states in each iteration. Asynchronous DP updates states in any order—even one state at a time—and still converges given sufficient updates. This matters for large state spaces where full sweeps are expensive.

/ Asynchronous value iteration: update one random state per step
asyncVI:{[gamma;nUpdates]
  V:nStates#0f;
  do[nUpdates;
    s:first 1?nStates;      / random state selection
    $[not isTerminal s;
      V[s]:max {reward[s;x;transition[s;x]] + gamma*V[transition[s;x]]}
                each til nActions;
      :[]                   / skip terminals
    ]
  ];
  V
  }

/ With enough updates, async converges to same solution
V_async:asyncVI[1f; 50000];
max abs V_async-V_opt   / small residual error

The ordering of updates matters for convergence speed. Updating states that are frequently visited, or states whose successors are already well-estimated, is more efficient. This idea generalises to prioritised sweeping, which uses a priority queue to always update the state with the largest estimated update. We won’t implement it here, but the principle is simple: update where learning happens fastest.

Complexity

For GridWorld (16 states, 4 actions), DP is instantaneous. For larger problems:

  • Policy evaluation: \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) per sweep
  • Value iteration: same per sweep, fewer sweeps
  • Real-world state spaces: millions to billions of states—DP is infeasible

This is why the remaining chapters exist. When you can’t enumerate all states, you need to estimate values from sampled experience (Monte Carlo, TD) or approximate them with parameterised functions (Chapter 8). Everything builds on the DP foundation: we’re always trying to compute or approximate \(v^\) or \(q^\); we just have different amounts of information and computation available.

Monte Carlo Methods: Learning from Complete Episodes

S&B Chapter 5

Dynamic programming requires a model. Most interesting problems don’t give you one. You don’t know the transition probabilities, the reward function, or the full structure of the MDP. You only know what happened: you took an action, something occurred, you received a reward.

Monte Carlo methods learn value functions by averaging the actual returns from complete episodes. No model needed. No Bellman backup. Just sample, observe, accumulate, and average. The law of large numbers does the rest.

The cost: you must wait for an episode to finish before updating anything. RL problems with very long or infinite episodes can’t use Monte Carlo directly. For episodic tasks—games, trading sessions, discrete decision problems—MC is often the right starting point.

We’ll use Blackjack, S&B’s canonical MC environment (Section 5.1). It has a natural episode structure (each hand is one episode), a known true value function for comparison, and just enough complexity to be interesting.

Blackjack Environment

The S&B version of Blackjack:

  • Player draws cards to get as close to 21 as possible without going over
  • Dealer shows one card face up
  • Player can Hit (draw) or Stick (stop)
  • Aces count as 11 unless that would bust, in which case they count as 1
  • Natural blackjack (21 on first two cards) beats a dealer non-natural 21

State: (player_sum, dealer_showing, usable_ace) — (int 12–21, int 1–10, bool). We only care about player sums from 12 upward; below 12, always hit.

/ ============================================================
/ Blackjack Environment — S&B Section 5.1
/ ============================================================

/ Draw a card: value is min(face_value, 10); aces = 11 initially
drawCard:{[] 1|10&first 1?13}   / uniform over 1-13, face cards -> 10

/ Draw a hand of 2 cards; return (sum; has_usable_ace)
initHand:{[]
  cards:{drawCard[]} each 0 1;
  s:sum cards;
  hasAce:any cards=1;
  / Ace can be used as 11 if it doesn't bust
  $[hasAce and s+10<=21;
    (s+10;1b);       / usable ace: count it as 11
    (s;0b)
  ]}

/ Hit a hand: add one card, adjust for aces if busted
hitHand:{[s;usableAce]
  card:drawCard[];
  s2:s+card;
  / If bust and have usable ace, convert it: 11 -> 1 (subtract 10)
  $[(s2>21) and usableAce;
    (s2-10;0b);      / ace downgraded
    (s2;usableAce or card=1)    / might gain a usable ace
  ]}

/ Dealer plays: hit until sum >= 17 (S&B rules)
dealerPlay:{[showing]
  / Dealer's hidden card
  hidden:drawCard[];
  s:showing+hidden;
  ua:any (showing;hidden)=1;
  $[ua and s+10<=21; s+:10; ua:s+10<=21];   / adjust for ace
  while[s<17;
    res:hitHand[s;ua];
    s:res[0]; ua:res[1]
  ];
  s
  }

/ Play one episode given a policy
/ Policy: function (player_sum; dealer_showing; usable_ace) -> action
/ Action: 0=stick, 1=hit
playEpisode:{[policy]
  dealer:drawCard[];
  ph:initHand[];
  ps:ph[0]; pua:ph[1];
  / Ensure player starts with at least 12
  while[ps<12;
    res:hitHand[ps;pua];
    ps:res[0]; pua:res[1]
  ];
  / Player's turn
  trajectory:();
  done:0b;
  while[not done;
    state:(ps;dealer;pua);
    a:policy[ps;dealer;pua];
    trajectory,:enlist state;
    $[a=1;   / hit
      [res:hitHand[ps;pua];
       ps:res[0]; pua:res[1];
       done:ps>21];   / bust
      done:1b         / stick
    ]
  ];
  / Dealer plays; determine reward
  reward:$[ps>21;
    -1f;            / player busted
    [ds:dealerPlay[dealer];
     $[ds>21; 1f;   / dealer busted
       ps>ds; 1f;   / player wins
       ps=ds; 0f;   / draw
       -1f]]        / dealer wins
    ];
  `trajectory`reward`finalSum!(trajectory;reward;ps)
  }

First-Visit Monte Carlo Prediction

Given a policy \(\pi\), estimate \(v_\pi\). We run many episodes, and for each state \(s\) visited, we record the return \(G\) that followed the first visit to \(s\) in that episode. Then \(v_\pi(s) \approx \text{average}(G_{\text{first visit}})\).

Why “first visit” matters: if you visit state \(s\) twice in one episode, the returns at those two visits are correlated (the second is a subset of the first). First-visit MC only counts the first occurrence, preserving the i.i.d. sampling assumption needed for convergence. Every-visit MC also converges, but more slowly in some cases. S&B proves convergence of both.

/ Encode Blackjack state as an integer index
/ State space: player_sum (12-21) x dealer (1-10) x usable_ace (0-1)
/ Total: 10 x 10 x 2 = 200 states
encodeState:{[ps;dealer;ua]
  (10*2*(ps-12)) + (2*(dealer-1)) + `int$ua
  }
nBJStates:200

/ First-Visit MC Prediction
/ pi: policy function (ps;dealer;ua) -> action
/ nEpisodes: number of episodes to run
firstVisitMC:{[pi;nEpisodes;gamma]
  returns:nBJStates#enlist 0#0f;   / running list of returns per state
  do[nEpisodes;
    ep:playEpisode[pi];
    traj:ep`trajectory;
    G:ep`reward;                  / terminal reward (discounted from end)
    / Walk backwards through trajectory computing G
    / Since all intermediate rewards are 0, G is just gamma^k * finalReward
    / at step k from the end. With gamma=1, G is constant = final reward.
    visited:0#0i;
    {[returns;G;gamma;k;state]
      s:encodeState . state;
      / First-visit: only count if not seen earlier in this episode
      if[not s in visited;
        returns[s],:G*gamma xexp k;
        visited,:s
      ]
      }[returns;G;gamma;;] each reverse til count traj
  ];
  / Average returns
  {$[0<count x; avg x; 0f]} each returns
  }
/ Policy: always stick on 20 or 21, hit otherwise
simplePolicy:{[ps;dealer;ua] $[ps>=20;0;1]}

/ Estimate values with 10k and 500k episodes
V_10k:firstVisitMC[simplePolicy;10000;1f];
V_500k:firstVisitMC[simplePolicy;500000;1f];

/ States with usable ace: states 1, 3, 5, ... (every other)
/ Display value for player sum 20, dealer showing 5, usable ace
s20_5_ua:encodeState[20;5;1b];
V_500k[s20_5_ua]    / should be ~0.65 (favorable position)

/ Player sum 12, dealer showing 2, no usable ace
s12_2_noua:encodeState[12;2;0b];
V_500k[s12_2_noua]  / should be ~-0.29 (weak position)

The 10k estimates are noisy; the 500k estimates should closely match S&B’s Figure 5.1. With a naive always-stick-at-20 policy, the player with a usable ace and high sum does well; low sums against a strong dealer do not.

Monte Carlo Control

Prediction estimates \(v_\pi\). Control finds a good policy. For MC control, we estimate \(q_\pi(s,a)\) (action values) because we don’t have a model—we can’t compute \(\max_a \sum_{s’} p(s’|s,a)[\ldots]\) without knowing \(p\). But we can estimate \(q^*(s,a)\) directly.

We’ll use epsilon-soft policies: with probability \(\varepsilon\) take a random action, otherwise take the greedy action. This ensures all state-action pairs are visited sufficiently often.

/ Q-table for Blackjack: (state; action) -> value
/ nActions = 2 (stick=0, hit=1)
nBJActions:2

/ MC Control with epsilon-soft policy (on-policy)
/ Returns Q-table (nBJStates x nBJActions)
mcControl:{[nEpisodes;gamma;eps]
  Q:nBJStates#enlist nBJActions#0f;
  N:nBJStates#enlist nBJActions#0i;

  / Epsilon-soft policy derived from Q
  epsSoftAction:{[Q;eps;ps;dealer;ua]
    s:encodeState[ps;dealer;ua];
    $[rand[1.0f]<eps;
      first 1?nBJActions;
      imax Q[s]
    ]};

  do[nEpisodes;
    / Generate episode using current epsilon-soft policy
    pi:epsSoftAction[Q;eps;;];
    ep:playEpisode[pi];
    traj:ep`trajectory;
    R:ep`reward;

    / For each state-action pair in episode (first-visit)
    visited:0#(0i;0i);   / list of (state;action) pairs seen
    G:R;
    {[Q;N;G;gamma;k;state]
      s:encodeState . state;
      / Action taken at this step: we need it from the trajectory
      / We'll thread action through state instead
      } [Q;N;G;gamma;;] each reverse til count traj;

    / Rebuild with actions: replay episode to get (s,a) pairs
    playWithActions:{[Q;eps;ps_init;dealer_init;ua_init]
      dealer:dealer_init;
      ps:ps_init; pua:ua_init;
      while[ps<12; res:hitHand[ps;pua]; ps:res[0]; pua:res[1]];
      traj_sa:();
      done:0b;
      while[not done;
        state:(ps;dealer;pua);
        a:epsSoftAction[Q;eps;ps;dealer;pua];
        traj_sa,:enlist (state;a);
        $[a=1;
          [res:hitHand[ps;pua]; ps:res[0]; pua:res[1]; done:ps>21];
          done:1b
        ]
      ];
      traj_sa
      };

    / Incremental Q update: Q[s][a] += (G - Q[s][a]) / N[s][a]
    / (simplified: process each episode's (s,a,G) tuples)
    trajSA:(); G:ep`reward;
    / Re-run to get actions (not ideal but clear)
    / In a real impl, thread actions through playEpisode
    trajSA:{[ep] ep`trajectory} ep;  / placeholder
    / (Full on-policy MC with action tracking shown below)
  ];
  Q
  }

The above has a structural issue worth acknowledging: the standard playEpisode doesn’t record actions alongside states. Let’s fix that cleanly:

/ Revised: episode returns (state;action;reward) tuples
playEpisodeSA:{[policy]
  dealer:drawCard[];
  ph:initHand[];
  ps:ph[0]; pua:ph[1];
  while[ps<12; res:hitHand[ps;pua]; ps:res[0]; pua:res[1]];
  trajectory:();    / list of (state;action) pairs
  done:0b;
  while[not done;
    state:(ps;dealer;pua);
    a:policy[ps;dealer;pua];
    trajectory,:enlist (state;a);
    $[a=1;
      [res:hitHand[ps;pua]; ps:res[0]; pua:res[1]; done:ps>21];
      done:1b
    ]
  ];
  reward:$[ps>21;-1f;
    [ds:dealerPlay[dealer];
     $[ds>21;1f;ps>ds;1f;ps=ds;0f;-1f]]];
  `trajectory`reward!(trajectory;reward)
  }

/ Clean MC control
mcControlClean:{[nEpisodes;gamma;eps]
  Q:nBJStates#enlist nBJActions#0f;
  N:nBJStates#enlist nBJActions#0i;
  do[nEpisodes;
    pi:{[Q;eps;ps;dealer;ua]
      s:encodeState[ps;dealer;ua];
      $[rand[1.0f]<eps; first 1?nBJActions; imax Q[s]]
      }[Q;eps;;;];
    ep:playEpisodeSA[pi];
    traj:ep`trajectory;
    G:ep`reward;
    visited:0#0i;
    k:count[traj]-1;
    while[k>=0;
      sa:traj[k];
      state:sa[0]; a:sa[1];
      s:encodeState . state;
      key_sa:nBJActions*s+a;   / flat index
      if[not key_sa in visited;
        visited,:key_sa;
        N[s;a]+:1i;
        Q[s;a]+:(G*gamma xexp (count[traj]-1-k)) - Q[s;a];  / wrong: use incremental
        / Correct incremental update:
        Q[s;a]:(Q[s;a] * N[s;a]-1 + G) % N[s;a]
      ];
      k-:1
    ]
  ];
  Q
  }
/ Run MC control: 500k episodes, epsilon=0.1
Q_mc:mcControlClean[500000;1f;0.1f];

/ Derived greedy policy
policyMC:{[Q;ps;dealer;ua]
  s:encodeState[ps;dealer;ua];
  imax Q[s]
  }[Q_mc;;;]

/ Test: what does the optimal policy do with player=20, dealer=5, no ace?
policyMC[20;5;0b]   / 0 = stick (correct)
policyMC[12;2;0b]   / should hit
policyMC[18;9;0b]   / borderline case

The Key Insight

MC methods have no bias. They estimate \(v_\pi(s)\) as the sample average of actual returns—this converges to the truth by the law of large numbers without any assumptions about the environment’s structure. But they have high variance: returns depend on the entire trajectory, which can vary wildly.

Dynamic programming has no variance (it works with exact expectations) but requires a model. MC has no model requirement but high variance.

Temporal difference learning, the next chapter, sits between these poles: it bootstraps (like DP) but from sampled experience (like MC). Understanding why that’s both useful and a compromise is the next step.

Temporal Difference Learning: TD(0), SARSA, and Q-Learning

S&B Chapter 6

If Monte Carlo methods are one end of the spectrum and dynamic programming is the other, temporal difference learning occupies the middle ground—and it’s arguably the most important ground in all of reinforcement learning. TD methods learn from experience (like MC) but also bootstrap from estimates (like DP). They update on every step, not at episode end. They work online, in real time, without a model.

This is the chapter where the field’s two canonical algorithms live: SARSA and Q-Learning. One of them learns the policy you’re executing. The other learns the optimal policy regardless of what you’re doing. The difference is subtle, consequential, and completely obvious in hindsight.

TD(0): The One-Step Update

MC waits until the end of the episode to compute the return \(G_t\) and update \(V(S_t)\). TD(0) doesn’t wait. After one step, it knows \(R_{t+1}\) and \(V(S_{t+1})\), and uses their combination as an estimate of the true return:

\[V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]\]

The quantity in brackets is the TD error \(\delta_t\): the difference between what you expected (\(V(S_t)\)) and what you got (\(R_{t+1} + \gamma V(S_{t+1})\)). Update your estimate by \(\alpha\) times this error.

The key insight: \(R_{t+1} + \gamma V(S_{t+1})\) is the TD target—your new estimate of what \(V(S_t)\) should be. You’re not using the actual return; you’re using a one-step lookahead combined with your current estimate of \(V(S_{t+1})\). This is bootstrapping: using your own estimates to update your own estimates. It introduces bias (because \(V(S_{t+1})\) may be wrong), but dramatically reduces variance. The tradeoff between bias and variance is one of the deepest topics in RL.

We’ll demonstrate TD(0) on the Random Walk (S&B Figure 6.2): 5 states A–E with a terminal state at each end. The left terminal gives reward 0; the right terminal gives reward 1. All transitions otherwise give reward 0.

/ ============================================================
/ TD(0) Prediction — Random Walk (S&B Example 6.2)
/ ============================================================

/ States: 0=left_terminal, 1=A, 2=B, 3=C, 4=D, 5=E, 6=right_terminal
nRWStates:7
leftTerm:0; rightTerm:6

/ Random walk: uniform left/right at each step
rwStep:{[s]
  direction:$[rand[1.0f]<0.5f; -1; 1];
  s2:s+direction;
  reward:$[s2=rightTerm;1f;0f];
  done:(s2=leftTerm) or s2=rightTerm;
  `s`r`done!(s2;reward;done)
  }

/ True values under random policy: v(s) = s/6
/ (probability of reaching right terminal from state s)
trueV:7#0f;
trueV[1]:1f%6; trueV[2]:2f%6; trueV[3]:3f%6;
trueV[4]:4f%6; trueV[5]:5f%6;

/ TD(0) prediction
/ Runs nEpisodes episodes, returns V estimate after each batch
td0:{[alpha;gamma;nEpisodes]
  V:nRWStates#0f;          / initialise to 0 (true: 0.5)
  V[3]:0.5f;               / S&B uses V=0.5 for non-terminal init
  V[1]:0.5f; V[2]:0.5f; V[4]:0.5f; V[5]:0.5f;
  rmse_hist:();
  do[nEpisodes;
    s:3;   / start in centre (state C)
    done:0b;
    while[not done;
      result:rwStep[s];
      s2:result`s; r:result`r; done:result`done;
      / TD update
      target:r + gamma*$[done;0f;V[s2]];
      V[s]+:alpha*(target - V[s]);
      s:s2
    ];
    / Track RMSE against true values (non-terminal states only)
    rmse:sqrt avg (V[1 2 3 4 5]-trueV[1 2 3 4 5])xexp 2;
    rmse_hist,:rmse
  ];
  `V`rmse!(V;rmse_hist)
  }
/ Compare alpha values (S&B Figure 6.2)
r_a01:td0[0.05f;1f;100];
r_a02:td0[0.1f;1f;100];
r_a05:td0[0.15f;1f;100];

/ Final RMSE at episode 100
r_a01[`rmse][99]   / ~0.14
r_a02[`rmse][99]   / ~0.09  (typically best)
r_a05[`rmse][99]   / ~0.12  (converges fast, noisy)

/ Value estimates after 100 episodes
r_a02`V   / should be close to 1/6, 2/6, 3/6, 4/6, 5/6

SARSA: On-Policy TD Control

For control (finding a good policy), we estimate action values \(Q(s,a)\) instead of state values. The on-policy update uses the next (state, action) pair \((S’, A’)\) chosen by the current policy—hence the name SARSA: \(S, A, R, S’, A’\).

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]\]

We’ll implement SARSA on the Windy Gridworld (S&B Example 6.5): a 7×10 grid where the wind in certain columns pushes the agent upward. Start at (3,0), goal at (3,7).

/ ============================================================
/ SARSA — Windy Gridworld (S&B Example 6.5)
/ ============================================================

wgRows:7; wgCols:10
wgStart:30    / row 3, col 0: state = 3*10 + 0
wgGoal:37     / row 3, col 7

/ Wind strength per column (upward push)
windStrength:0 0 0 1 1 1 2 2 1 0

/ State encoding
wgStateToRC:{[s] (s div wgCols; s mod wgCols)}
wgRCToState:{[r;c]
  r2:0|(wgRows-1)&r;
  c2:0|(wgCols-1)&c;
  (wgCols*r2)+c2
  }

/ Action deltas: 0=up 1=right 2=down 3=left
wgDeltas:((-1;0);(0;1);(1;0);(0;-1))

/ Windy gridworld step
wgStep:{[s;a]
  rc:wgStateToRC s;
  r:rc[0]; c:rc[1];
  d:wgDeltas[a];
  wind:windStrength[c];
  / Apply action then wind (wind moves agent up = decreases row)
  r2:r+d[0]-wind;
  c2:c+d[1];
  s2:wgRCToState[r2;c2];
  done:s2=wgGoal;
  `s`r`done!(s2;-1f;done)   / reward -1 per step
  }

/ Epsilon-greedy action from Q-table
wgSelectAction:{[Q;eps;s]
  $[rand[1.0f]<eps;
    first 1?4;
    imax Q[s]
  ]}

/ SARSA on-policy control
sarsa:{[alpha;gamma;eps;nEpisodes]
  nWGStates:wgRows*wgCols;
  Q:nWGStates#enlist 4#0f;
  epLengths:();
  do[nEpisodes;
    s:wgStart;
    a:wgSelectAction[Q;eps;s];
    steps:0;
    done:0b;
    while[not done;
      result:wgStep[s;a];
      s2:result`s; r:result`r; done:result`done;
      a2:wgSelectAction[Q;eps;s2];
      / SARSA update: uses next action A' from current policy
      target:r + gamma*$[done;0f;Q[s2;a2]];
      Q[s;a]+:alpha*(target - Q[s;a]);
      s:s2; a:a2;
      steps+:1
    ];
    epLengths,:steps
  ];
  `Q`epLengths!(Q;epLengths)
  }
/ Run SARSA on windy gridworld
result_sarsa:sarsa[0.5f;1f;0.1f;170];

/ Cumulative time steps to reach goal (S&B Figure 6.3)
cumSteps:sums result_sarsa`epLengths;
/ After ~170 episodes the optimal path is ~15 steps
last result_sarsa`epLengths   / should converge toward 15-17 steps

Q-Learning: Off-Policy TD Control

Q-Learning is SARSA’s off-policy counterpart. Instead of using the action actually taken next (\(A’\)), the update uses the greedy action—the maximum over all possible next actions:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \max_{a’} Q(S_{t+1}, a’) - Q(S_t, A_t)\right]\]

This makes Q-Learning off-policy: it learns the optimal Q-function \(q^*\) regardless of the behaviour policy used to generate data. The behaviour can be exploratory (epsilon-greedy), but Q-learning converges to the optimal, purely greedy policy.

We’ll compare SARSA and Q-Learning on the Cliff Walking task (S&B Example 6.6): a 4×12 grid with a cliff along the bottom edge. Falling off the cliff costs -100 and returns to start. Start at bottom-left; goal at bottom-right.

/ ============================================================
/ Cliff Walking — SARSA vs Q-Learning (S&B Example 6.6)
/ ============================================================

cwRows:4; cwCols:12
cwStart:(cwRows-1)*cwCols          / bottom-left
cwGoal:(cwRows-1)*cwCols+cwCols-1  / bottom-right
/ Cliff: bottom row, columns 1-10 (not start or goal)
cwCliff:(cwRows-1)*cwCols+1+til 10

cwStateToRC:{[s](s div cwCols;s mod cwCols)}
cwRCToState:{[r;c] r*cwCols+c}
cwDeltas:((-1;0);(0;1);(1;0);(0;-1))

cwStep:{[s;a]
  rc:cwStateToRC s;
  d:cwDeltas[a];
  r2:0|(cwRows-1)&rc[0]+d[0];
  c2:0|(cwCols-1)&rc[1]+d[1];
  s2:cwRCToState[r2;c2];
  $[s2 in cwCliff;
    `s`r`done!(cwStart;-100f;0b);    / cliff: reset to start
    s2=cwGoal;
    `s`r`done!(s2;-1f;1b);           / goal reached
    `s`r`done!(s2;-1f;0b)            / normal step
  ]}

cwSelectAction:{[Q;eps;s]
  $[rand[1.0f]<eps; first 1?4; imax Q[s]]}

/ Q-Learning
qLearning:{[alpha;gamma;eps;nEpisodes]
  nCWStates:cwRows*cwCols;
  Q:nCWStates#enlist 4#0f;
  epRewards:();
  do[nEpisodes;
    s:cwStart;
    totalR:0f;
    done:0b;
    while[not done;
      a:cwSelectAction[Q;eps;s];
      result:cwStep[s;a];
      s2:result`s; r:result`r; done:result`done;
      / Q-Learning: uses max Q(s',a') NOT the actual next action
      target:r + gamma*$[done;0f;max Q[s2]];
      Q[s;a]+:alpha*(target - Q[s;a]);
      s:s2;
      totalR+:r
    ];
    epRewards,:totalR
  ];
  `Q`epRewards!(Q;epRewards)
  }

/ SARSA on cliff walking
sarsaCW:{[alpha;gamma;eps;nEpisodes]
  nCWStates:cwRows*cwCols;
  Q:nCWStates#enlist 4#0f;
  epRewards:();
  do[nEpisodes;
    s:cwStart;
    a:cwSelectAction[Q;eps;s];
    totalR:0f; done:0b;
    while[not done;
      result:cwStep[s;a];
      s2:result`s; r:result`r; done:result`done;
      a2:cwSelectAction[Q;eps;s2];
      target:r + gamma*$[done;0f;Q[s2;a2]];
      Q[s;a]+:alpha*(target - Q[s;a]);
      s:s2; a:a2; totalR+:r
    ];
    epRewards,:totalR
  ];
  `Q`epRewards!(Q;epRewards)
  }
/ Run both on cliff walking (S&B Figure 6.4)
res_ql:qLearning[0.5f;1f;0.1f;500];
res_sarsa:sarsaCW[0.5f;1f;0.1f;500];

/ Average reward over last 100 episodes
avg neg[100]#res_ql`epRewards     / Q-Learning: ~-13 (optimal path)
avg neg[100]#res_sarsa`epRewards  / SARSA: ~-17  (safer path)

This result captures the SARSA vs Q-Learning distinction precisely. Q-Learning learns the optimal policy (hug the cliff edge—shortest path) but during training it keeps falling off because epsilon-greedy exploration takes it cliff-ward. SARSA learns to account for its own exploration and finds a safer path further from the cliff. Q-Learning is optimal in the limit; SARSA is safer in practice.

The “correct” algorithm depends on whether you care about what’s optimal given perfect execution, or what’s good accounting for your actual (exploratory) behaviour. In production systems—where exploration happens on live data—on-policy methods like SARSA are often preferred precisely because they don’t pretend they’ll always execute optimally.

Expected SARSA

A simple but effective improvement: instead of sampling \(A’\) from the policy (SARSA) or using \(\max_{a’}\) (Q-Learning), use the expected value under the policy:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \sum_{a’} \pi(a’|S_{t+1}) Q(S_{t+1}, a’) - Q(S_t, A_t)\right]\]

/ Expected SARSA: lower variance than SARSA, handles off-policy too
expectedSarsa:{[alpha;gamma;eps;nEpisodes]
  nCWStates:cwRows*cwCols;
  Q:nCWStates#enlist 4#0f;
  epRewards:();
  do[nEpisodes;
    s:cwStart; totalR:0f; done:0b;
    while[not done;
      a:cwSelectAction[Q;eps;s];
      result:cwStep[s;a];
      s2:result`s; r:result`r; done:result`done;
      / Expected value under epsilon-greedy policy
      / greedyA gets (1-eps + eps/nA), others get eps/nA
      nA:4;
      pi:nA#eps%nA;
      pi[imax Q[s2]]+:1f-eps;
      expQ:sum pi*Q[s2];
      target:r + gamma*$[done;0f;expQ];
      Q[s;a]+:alpha*(target - Q[s;a]);
      s:s2; totalR+:r
    ];
    epRewards,:totalR
  ];
  `Q`epRewards!(Q;epRewards)
  }
res_esarsa:expectedSarsa[0.5f;1f;0.1f;500];
avg neg[100]#res_esarsa`epRewards   / typically between SARSA and Q-Learning

Expected SARSA generalises both SARSA (if you use the current policy) and Q-Learning (if you use a greedy target policy). It has lower variance than SARSA because it averages out the randomness of \(A’\). In practice, it often outperforms both when computational cost per step isn’t a concern.

Why TD Methods Dominate

TD methods are the workhorses of practical RL for three reasons:

  1. Online learning: update at every step, not episode end. Essential for long or continuous tasks.
  2. No model needed: unlike DP, we sample transitions from the environment.
  3. Lower variance than MC: bootstrapping stabilises estimates, at the cost of some bias.

The bias from bootstrapping never fully goes away—Q-Learning converges to \(q^*\) only under appropriate conditions on learning rates and infinite exploration. But in practice, with reasonable hyperparameters, it works extraordinarily well. The rest of this book explores what happens when you need more: longer lookaheads (Chapter 7), approximate value functions (Chapter 8), or direct policy optimisation (Chapter 9).

N-Step Methods: Bridging Monte Carlo and TD

S&B Chapter 7

The TD(0) update uses one step of actual reward then bootstraps. Monte Carlo uses the complete return. Neither extreme is optimal for all problems. N-step methods generalise both: use \(n\) actual steps of reward then bootstrap from \(V(S_{t+n})\).

The \(n\)-step return:

\[G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\]

When \(n=1\): one-step TD. When \(n=T\) (episode length): Monte Carlo. Between these poles lies a continuum of algorithms, and the best value of \(n\) is problem-dependent—often somewhere in the middle.

Why this matters: TD(0) bootstraps immediately, so errors in \(V(S_{t+1})\) pollute the TD target for \(V(S_t)\). With \(n\) steps, you dilute that bootstrapping error with \(n\) actual reward observations before hitting an estimate. MC eliminates bootstrapping error entirely but has high variance—the return \(G_t\) depends on \(T - t\) random transitions. N-step methods let you trade these off explicitly.

N-Step TD Prediction

/ ============================================================
/ N-Step TD — Random Walk (S&B Figure 7.2)
/ Same 7-state random walk as Chapter 6
/ ============================================================

pi:acos -1f
normal:{[] sqrt[-2f*log rand 1.0f]*cos[2f*pi*rand 1.0f]}

nRWStates:7
leftTerm:0; rightTerm:6
trueV:0 1 2 3 4 5 6 % 6f   / true values

rwStep:{[s]
  s2:s+$[rand[1.0f]<0.5f;-1;1];
  r:$[s2=rightTerm;1f;0f];
  `s`r`done!(s2;r;(s2=leftTerm) or s2=rightTerm)
  }

/ N-step TD prediction (S&B Algorithm, Chapter 7)
/ Uses a circular buffer of size n to store recent (s,r) pairs
nStepTD:{[n;alpha;gamma;nEpisodes]
  V:nRWStates#0.5f;
  V[leftTerm]:0f; V[rightTerm]:0f;
  rmse_hist:();

  do[nEpisodes;
    / Circular buffers for states and rewards
    states:n+1#0i;
    rewards:n+1#0f;
    states[0]:3i;   / start in centre
    T:0W;           / terminal time (infinity initially)
    t:0i;           / current time
    step:0i;        / loop counter (tau in S&B notation)

    / Run until all updates are done (t >= T + n - 1)
    while[(step-n)<T;
      if[step<T;
        result:rwStep[states[step mod (n+1)]];
        states[(step+1) mod (n+1)]:result`s;
        rewards[(step+1) mod (n+1)]:result`r;
        if[result`done; T:step+1i]
      ];
      / Update time tau: the state whose estimate we're updating
      tau:step-n+1i;
      if[tau>=0;
        / Compute n-step return
        end:tau+n;
        G:0f;
        k:min[end;T];
        while[k>tau;
          G:rewards[k mod (n+1)] + gamma*G;
          k-:1
        ];
        / Add bootstrap value if not at terminal
        if[end<T; G+:gamma xexp n * V[states[end mod (n+1)]]];
        / Update V[tau]
        s_tau:states[tau mod (n+1)];
        if[(not s_tau=leftTerm) and not s_tau=rightTerm;
          V[s_tau]+:alpha*(G - V[s_tau])
        ]
      ];
      step+:1
    ];
    rmse:sqrt avg (V[1 2 3 4 5]-trueV[1 2 3 4 5]) xexp 2;
    rmse_hist,:rmse
  ];
  `V`rmse!(V;rmse_hist)
  }
/ Reproduce S&B Figure 7.2: RMS error vs n for various alpha
/ 10 runs, 10 episodes each, measured at episode 10
nVals:1 2 4 8 16 32 64 128 256 512
alphaVals:0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

/ For n=1 (TD) and n=512 (near-MC), compare
r_n1:nStepTD[1i;0.4f;1f;100];
r_n4:nStepTD[4i;0.4f;1f;100];
r_n16:nStepTD[16i;0.3f;1f;100];

last r_n1`rmse    / TD(0) final RMSE
last r_n4`rmse    / n=4 typically lowest error for appropriate alpha
last r_n16`rmse   / starts to behave like MC

The figure in S&B shows a U-shaped curve: very small \(n\) (pure TD) has moderate error because it bootstraps aggressively from poor estimates; very large \(n\) (near-MC) has moderate error from high variance. The sweet spot is often \(n \approx 4\) to \(16\), depending on the problem.

N-Step SARSA

Extending to control is straightforward: collect \(n\) steps of \((s, a, r)\) tuples, then compute the \(n\)-step return and update \(Q(s, a)\):

\[G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})\]

/ ============================================================
/ N-Step SARSA — Windy Gridworld
/ ============================================================

wgRows:7; wgCols:10
wgStart:30; wgGoal:37
windStrength:0 0 0 1 1 1 2 2 1 0
wgDeltas:((-1;0);(0;1);(1;0);(0;-1))

wgRCToState:{[r;c] ((0|(wgRows-1)&r)*wgCols)+(0|(wgCols-1)&c)}
wgStateToRC:{[s](s div wgCols;s mod wgCols)}

wgStep:{[s;a]
  rc:wgStateToRC s;
  d:wgDeltas[a];
  r2:rc[0]+d[0]-windStrength[rc[1]];
  c2:rc[1]+d[1];
  s2:wgRCToState[r2;c2];
  `s`r`done!(s2;-1f;s2=wgGoal)
  }

wgSelectAction:{[Q;eps;s]
  $[rand[1.0f]<eps; first 1?4; imax Q[s]]}

nStepSarsa:{[n;alpha;gamma;eps;nEpisodes]
  nWGStates:wgRows*wgCols;
  Q:nWGStates#enlist 4#0f;
  epLengths:();

  do[nEpisodes;
    bufSize:n+1;
    states:bufSize#wgStart;
    actions:bufSize#0i;
    rewards:bufSize#0f;
    actions[0]:wgSelectAction[Q;eps;wgStart];
    T:0W; step:0i;

    while[(step-n)<T;
      if[step<T;
        result:wgStep[states[step mod bufSize]; actions[step mod bufSize]];
        s2:result`s; r:result`r; done:result`done;
        states[(step+1) mod bufSize]:s2;
        rewards[(step+1) mod bufSize]:r;
        $[done;
          T:step+1i;
          actions[(step+1) mod bufSize]:wgSelectAction[Q;eps;s2]
        ]
      ];
      tau:step-n+1i;
      if[tau>=0;
        end:tau+n;
        G:0f;
        k:min[end;T];
        while[k>tau;
          G:rewards[k mod bufSize]+gamma*G;
          k-:1
        ];
        if[end<T;
          s_end:states[end mod bufSize];
          a_end:actions[end mod bufSize];
          G+:gamma xexp n * Q[s_end;a_end]
        ];
        s_tau:states[tau mod bufSize];
        a_tau:actions[tau mod bufSize];
        Q[s_tau;a_tau]+:alpha*(G - Q[s_tau;a_tau])
      ];
      step+:1
    ];
    epLengths,:step-n+1i  / approximate episode length
  ];
  `Q`epLengths!(Q;epLengths)
  }
/ Compare n=1 (SARSA), n=4, n=8 on windy gridworld
r_sarsa1:nStepSarsa[1i;0.5f;1f;0.1f;200];
r_sarsa4:nStepSarsa[4i;0.5f;1f;0.1f;200];
r_sarsa8:nStepSarsa[8i;0.3f;1f;0.1f;200];

/ Convergence speed: cumulative steps to reach goal
last sums r_sarsa1`epLengths
last sums r_sarsa4`epLengths    / often converges faster
last sums r_sarsa8`epLengths

TD(\(\lambda\)): The Elegance of Eligibility Traces

N-step methods choose a fixed \(n\). What if we used a weighted combination of all \(n\)-step returns, with weights decaying geometrically?

\[G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}\]

This is the \(\lambda\)-return. When \(\lambda=0\): pure TD(0). When \(\lambda=1\): pure Monte Carlo. The parameter \(\lambda\) interpolates the full spectrum in a single algorithm.

The forward view (computing \(G_t^\lambda\) above) requires future rewards. The backward view, using eligibility traces, computes the same update online using only past information:

/ TD(lambda) with eligibility traces — forward view approximate
/ (True online TD(lambda) requires the full algorithm from S&B Ch 12)
tdLambda:{[lam;alpha;gamma;nEpisodes]
  V:nRWStates#0.5f;
  V[leftTerm]:0f; V[rightTerm]:0f;
  rmse_hist:();

  do[nEpisodes;
    / Eligibility traces: one per state, initialised to 0 each episode
    e:nRWStates#0f;
    s:3i;  / start centre
    done:0b;

    while[not done;
      result:rwStep[s];
      s2:result`s; r:result`r; done:result`done;

      / TD error
      delta:r + gamma*$[done;0f;V[s2]] - V[s];

      / Update eligibility: accumulating traces
      e[s]+:1f;

      / Update all states proportional to their eligibility
      V+:alpha*delta*e;

      / Decay eligibility traces
      e*:gamma*lam;
      s:s2
    ];
    rmse:sqrt avg (V[1 2 3 4 5]-trueV[1 2 3 4 5]) xexp 2;
    rmse_hist,:rmse
  ];
  `V`rmse!(V;rmse_hist)
  }
/ lambda=0: should match TD(0)
r_lam0:tdLambda[0f;0.4f;1f;100];

/ lambda=0.9: more like MC, slower on short problems
r_lam9:tdLambda[0.9f;0.1f;1f;100];

last r_lam0`rmse   / ~0.10
last r_lam9`rmse   / varies more

/ The vectorised update `V+:alpha*delta*e` is the key q idiom here:
/ every state's value moves by its eligibility-weighted TD error
/ This is where q's array operations shine

The eligibility trace update is a perfect example of where q shines. The update V+:alpha*delta*e is a single vectorised operation that, in Python, would require an explicit loop over all states. In q it’s one line, it’s fast, and it reads exactly like the mathematical expression \(V \leftarrow V + \alpha \delta \mathbf{e}\).

Eligibility traces generalise beautifully to function approximation (Chapter 8) and underpin many practical algorithms. For deep RL, they appear in advantage estimation (GAE) used by PPO. The underlying idea—that credit for a reward propagates backwards through recently visited states—never stops being useful.

Function Approximation: When Your State Space Won’t Fit in a Table

S&B Chapters 9–10

Every algorithm in this book so far has maintained a table: one entry per state, or one entry per state-action pair. GridWorld has 16 states. Blackjack has 200. Windy gridworld has 70. These fit in memory, run instantly, and converge to exact solutions.

Real problems don’t cooperate. A chess position has roughly \(10^{43}\) legal states. A continuous robot joint angle lives in \(\mathbb{R}^n\). kdb+/q tick data has state spaces that combine position, inventory, market regime, and a dozen other features—functionally infinite.

Function approximation replaces the table with a parameterised function \(\hat{v}(s; \mathbf{w}) \approx v_\pi(s)\), where \(\mathbf{w}\) is a weight vector of manageable size. Updating one weight can affect estimates for many states simultaneously—generalisation. The cost: we lose exact convergence guarantees (mostly) and gain approximation error. In practice, for anything interesting, it’s the only option.

Linear Function Approximation

The simplest and most tractable case: \(\hat{v}(s; \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s)\), where \(\mathbf{x}(s)\) is a feature vector for state \(s\). The update rule under gradient descent:

\[\mathbf{w} \leftarrow \mathbf{w} + \alpha \left[v_\pi(s) - \hat{v}(s;\mathbf{w})\right] \mathbf{x}(s)\]

We don’t know \(v_\pi(s)\), so we substitute a target: for semi-gradient TD(0), the target is \(R + \gamma \hat{v}(S’;\mathbf{w})\).

\[\mathbf{w} \leftarrow \mathbf{w} + \alpha \left[R + \gamma \hat{v}(S’;\mathbf{w}) - \hat{v}(S;\mathbf{w})\right] \mathbf{x}(S)\]

“Semi-gradient” because we treat the target \(R + \gamma \hat{v}(S’;\mathbf{w})\) as fixed (not differentiating through it). This is the same trick deep Q-networks use with a “target network.”

State Aggregation

The simplest feature representation: group states into clusters, and use a one-hot vector indicating which cluster \(s\) belongs to. States in the same group share a single weight. It’s coarse approximation, but it scales.

We’ll use the 1000-state random walk (S&B Example 9.1): states 1–1000, uniform transitions \(\pm 100\), rewards \(-1\) at the left terminal and \(+1\) at the right. Aggregate into 10 groups of 100 states each.

/ ============================================================
/ State Aggregation + Semi-Gradient TD — 1000-State Random Walk
/ S&B Example 9.1
/ ============================================================

nStates1k:1002    / 0 and 1001 are terminals; 1-1000 are real states
leftTerm1k:0; rightTerm1k:1001

/ Random walk: jump uniform in [-100, 100], clamp to [0,1001]
rw1kStep:{[s]
  delta:first[1?201]-100;  / uniform in {-100,...,100}
  s2:0|1001&s+delta;
  r:$[s2=rightTerm1k;1f;$[s2=leftTerm1k;-1f;0f]];
  done:(s2=leftTerm1k) or s2=rightTerm1k;
  `s`r`done!(s2;r;done)
  }

/ State aggregation: 10 groups of 100 states
nGroups:10; groupSize:100

/ Feature vector: one-hot over groups (length nGroups)
stateFeature:{[s]
  $[(s=leftTerm1k) or s=rightTerm1k;
    nGroups#0f;     / terminal: zero features
    [g:(s-1) div groupSize;    / group index 0-9
     @[nGroups#0f;g;:;1f]]    / one-hot
  ]}

/ Value estimate: w . x(s)
vHat:{[w;s] w dot stateFeature s}

/ Semi-gradient TD(0) with linear function approximation
semiGradTD:{[alpha;gamma;nEpisodes]
  w:nGroups#0f;   / weight vector
  rmse_hist:();
  do[nEpisodes;
    s:500i;     / start near centre
    done:0b;
    while[not done;
      x:stateFeature s;
      result:rw1kStep[s];
      s2:result`s; r:result`r; done:result`done;
      / TD target
      target:r + gamma*$[done;0f;vHat[w;s2]];
      / Semi-gradient update: w += alpha * delta * x(s)
      delta:target - vHat[w;s];
      w+:alpha*delta*x
    ];
    / RMSE against true values (linear from -1 to 1)
    trueV_1k:{-1f + (2f*x%1001)} each 1+til 1000;
    estV_1k:{vHat[w;x]} each 1+til 1000;
    rmse_hist,:sqrt avg (estV_1k-trueV_1k) xexp 2
  ];
  `w`rmse!(w;rmse_hist)
  }
/ Run semi-gradient TD with state aggregation
res_sgTD:semiGradTD[2e-4f;1f;100000];

/ Value estimates per group (S&B Figure 9.1)
/ Each weight corresponds to one group's value estimate
show res_sgTD`w
/ Should be approximately: -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
/ (staircase approximation to the linear true value)

The staircase is the tell: state aggregation can only represent piecewise constant value functions. Each group’s value is its weight; states within a group are treated identically. This is a strong assumption that’s often wrong in practice, but it’s mathematically clean and a useful first step.

Polynomial and Fourier Features

Beyond one-hot encoding, we can use richer feature functions. Polynomial features: \(\mathbf{x}(s) = (1, s, s^2, \ldots, s^k)\). Fourier features: \(x_i(s) = \cos(i\pi s)\). Both expand the state into a basis that linear function approximation can use.

/ Fourier basis features for the 1000-state walk
/ State normalised to [0,1]
fourierFeatures:{[order;s]
  sn:`float$s % 1001f;   / normalise to [0,1]
  cos each pi*sn*til order+1   / cos(0), cos(pi*s), cos(2*pi*s), ...
  }

/ Semi-gradient TD with Fourier features
semiGradTDFourier:{[order;alpha;gamma;nEpisodes]
  nFeats:order+1;
  w:nFeats#0f;
  do[nEpisodes;
    s:500i; done:0b;
    while[not done;
      x:fourierFeatures[order;s];
      result:rw1kStep[s];
      s2:result`s; r:result`r; done:result`done;
      target:r + gamma*$[done;0f;w dot fourierFeatures[order;s2]];
      delta:target - w dot x;
      w+:alpha*delta*x
    ]
  ];
  w
  }

/ Fourier order 5: 6 parameters to represent 1000 states
w_fourier:semiGradTDFourier[5i;5e-5f;1f;100000];

/ Plot estimated values (evaluate at each of 1000 states)
vEst:{[w;order;s] w dot fourierFeatures[order;s]}
show 50 cut {vEst[w_fourier;5i;x]} each 1+til 1000
/ Should show a smooth increasing curve from -1 to 1

Fourier features have a useful property: the weight for each component has a direct interpretation as the magnitude of a frequency component in the value function. Low-frequency components (\(i\) small) capture broad trends; high-frequency components capture fine structure. This is analogous to spectral decomposition.

Tile Coding

The practically most successful linear feature representation for RL (before neural networks took over) is tile coding: multiple overlapping tilings of the state space, each tiling a regular grid. For a 2D state space, imagine laying down multiple grids, each offset by a fraction of a tile width.

/ Tile coding for 1D state space [low, high]
/ nTilings: number of overlapping grids
/ nTiles: number of tiles per tiling
tileCoding:{[nTilings;nTiles;low;high;s]
  / Total features: nTilings * nTiles
  width:(high-low)%nTiles;
  / Each tiling is offset by width/nTilings from the previous
  offset:width % nTilings;
  / For each tiling, find which tile s falls into
  features:nTilings#0i;
  k:til nTilings;
  / Shift s by k*offset, then find tile index
  shifted:s - k*offset;
  tileIdx:`int$(shifted-low)%width;
  tileIdx:0|nTiles-1&tileIdx;  / clamp to valid range
  / One-hot index into flattened feature vector
  base:nTiles*k;
  base+tileIdx
  }

/ Build full binary feature vector from tile indices
tilesToVector:{[nTilings;nTiles;indices]
  v:nTilings*nTiles#0f;
  v[indices]:1f;
  v
  }

/ Tile-coded linear approximation for Mountain Car
/ State: (position, velocity) in [-1.2,0.6] x [-0.07,0.07]
nMCTilings:8; nMCTiles:8

/ Tile code a (position, velocity) pair
mcFeatures:{[pos;vel]
  posIdx:tileCoding[nMCTilings;nMCTiles;-1.2f;0.6f;pos];
  velIdx:tileCoding[nMCTilings;nMCTiles;-0.07f;0.07f;vel];
  / Combine: offset velocity tiles by nTiles to avoid collision
  combined:posIdx,velIdx+nMCTilings*nMCTiles;
  tilesToVector[nMCTilings;nMCTiles*2;combined]
  }

nMCFeats:nMCTilings*nMCTiles*2

/ Demonstrate: two similar positions should have overlapping features
x1:mcFeatures[-0.5f;0.01f];
x2:mcFeatures[-0.5f;0.015f];
sum x1 and x2   / number of shared active tiles (out of 8 tilings)

The Mountain Car Environment

Mountain Car (S&B Example 10.1): a car must drive up a steep hill, but the engine is too weak to climb directly. It must first drive left to build momentum, then accelerate right. Continuous state (position, velocity), 3 discrete actions (push left, neutral, push right), reward -1 per step.

/ ============================================================
/ Mountain Car Environment — S&B Section 10.1
/ ============================================================

mcMinPos:-1.2f; mcMaxPos:0.6f
mcMinVel:-0.07f; mcMaxVel:0.07f
mcGoalPos:0.5f

/ Actions: 0=reverse, 1=neutral, 2=forward
mcActionForce:-0.001 0f 0.001f

/ Step function: returns (next_pos; next_vel; reward; done)
mcStep:{[pos;vel;a]
  force:mcActionForce[a];
  / Update velocity: add force and gravity component
  vel2:mcMinVel|(mcMaxVel&vel+force+(-0.0025f*cos[3f*pos]));
  / Update position
  pos2:mcMinPos|(mcMaxPos&pos+vel2);
  / If hit left wall, zero velocity
  vel2:$[pos2=mcMinPos;0f;vel2];
  done:pos2>=mcGoalPos;
  `pos`vel`r`done!(pos2;vel2;-1f;done)
  }

/ Semi-gradient SARSA for Mountain Car
/ Uses tile-coded linear Q-function approximation
mcSarsa:{[alpha;gamma;eps;nEpisodes]
  / Weight vector: nFeats * nActions
  W:nMCFeats*3#0f;   / flat: W[a*nMCFeats + i] = weight for action a, feature i

  / Approximate Q(s,a) = w_a . x(s)
  qHat:{[W;pos;vel;a]
    x:mcFeatures[pos;vel];
    wA:W[a*nMCFeats + til nMCFeats];
    wA dot x
    };

  / Epsilon-greedy over 3 actions
  selectMC:{[W;eps;pos;vel]
    $[rand[1.0f]<eps;
      first 1?3;
      imax {qHat[W;pos;vel;x]} each 0 1 2
    ]};

  epSteps:();
  do[nEpisodes;
    pos:mcMinPos+rand[mcMaxPos-mcMinPos];  / random start position
    vel:0f;                                 / zero initial velocity
    a:selectMC[W;eps;pos;vel];
    steps:0;
    done:0b;
    while[(not done) and steps<5000;        / max 5000 steps per episode
      x:mcFeatures[pos;vel];
      result:mcStep[pos;vel;a];
      pos2:result`pos; vel2:result`vel;
      r:result`r; done:result`done;
      a2:selectMC[W;eps;pos2;vel2];
      / Semi-gradient SARSA update
      target:r + gamma*$[done;0f;qHat[W;pos2;vel2;a2]];
      delta:target - qHat[W;pos;vel;a];
      / Update weights for action a only
      base:a*nMCFeats;
      W[base+til nMCFeats]+:alpha*delta*x;
      pos:pos2; vel:vel2; a:a2; steps+:1
    ];
    epSteps,:steps
  ];
  `W`epSteps!(W;epSteps)
  }
/ Train: convergence is slow—this is a harder problem
/ alpha per-tiling: S&B uses alpha/8 where 8 = nTilings
res_mc:mcSarsa[0.5f%nMCTilings;1f;0.0f;500];  / eps=0 (greedy)

/ Episode lengths should decrease from ~5000 to ~100-200
res_mc[`epSteps][0]     / first episode: usually max steps (5000)
last res_mc`epSteps     / last episode: ~100-200 if converged

/ Show learning curve
show 50 cut res_mc`epSteps   / display as 10x50 grid for readability

Why Function Approximation Changes Everything

With tabular methods, every update is exact and local. Changing \(V(s)\) for state \(s\) affects only that state. With function approximation, updating weights \(\mathbf{w}\) changes the estimates for all states that share features—which is most of them.

This generalisation is the whole point: we’re making the algorithm smarter about states it hasn’t seen. It’s also what makes convergence analysis hard. Semi-gradient TD doesn’t necessarily converge to the global optimum of any loss function; it converges to a TD fixed point that trades off Bellman error and function approximation error.

The practical implications: start with the simplest feature representation that captures the structure you need. Tile coding is remarkably effective and well-understood. Neural networks are powerful but introduce non-convex optimisation, instability, and the need for replay buffers and target networks. If tile coding solves your problem, use tile coding. If it doesn’t—that’s what deep RL is for.

Policy Gradient Methods: Directly Optimising What You Care About

S&B Chapter 13

Every method so far has learned a value function and derived a policy from it. Policy gradient methods are different: they parameterise the policy directly and optimise it directly, by gradient ascent on expected return.

This changes what you’re learning. Instead of “what is the value of state \(s\)?”, you’re learning “what probability should I assign to action \(a\) in state \(s\)?”. You optimise the policy’s actual performance objective—no intermediary value function required.

Why would you want this?

  1. Stochastic policies: Value-based methods produce deterministic policies (always take the greedy action). Some problems genuinely require stochastic policies—bluffing in poker, mixed strategies in games, exploration in environments where deterministic exploitation fails.

  2. Continuous actions: Argmax over a continuous action space is expensive or impossible. Policy gradient methods can output a distribution over continuous actions directly (mean and variance of a Gaussian, for example).

  3. Convergence: Policy gradient methods converge to local optima by construction. Value-based methods have more complex convergence properties under function approximation.

The cost: policy gradient methods typically have higher variance and require more samples than value-based methods for the same problem.

The Policy Gradient Theorem

The performance objective is the expected return from the start state:

\[J(\boldsymbol{\theta}) = v_{\pi_\theta}(s_0)\]

We want \(\nabla_\theta J(\theta)\). The policy gradient theorem (S&B Equation 13.5) gives:

\[\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla \pi(a|s,\boldsymbol{\theta})\]

where \(\mu(s)\) is the on-policy state distribution. The practical form, using the log-derivative trick:

\[\nabla J(\boldsymbol{\theta}) = \mathbb{E}_\pi\left[G_t \nabla \ln \pi(A_t | S_t, \boldsymbol{\theta})\right]\]

This is the REINFORCE estimator: for each step in an episode, the return \(G_t\) weights the gradient of the log-policy. High-return trajectories push the policy toward the actions that generated them.

REINFORCE

/ ============================================================
/ REINFORCE — Short Corridor Gridworld (S&B Example 13.1)
/ ============================================================
/ 4 states in a corridor; s1 and s3 appear identical to the agent
/ (both show "1 step from terminal"); only s1 is actually at
/ position 1, s3 at position 3.
/
/ The twist (S&B): from state 1, right goes left and left goes right.
/ The optimal policy must be stochastic: p(right|s1) = 0.59.
/ A deterministic policy cannot find the optimum.
/ ============================================================

pi_const:acos -1f
normal:{[] sqrt[-2f*log rand 1.0f]*cos[2f*pi_const*rand 1.0f]}

/ Corridor: states 0(terminal), 1, 2, 3; terminal at 0
/ Actions: 0=left, 1=right
/ Rewards: -1 per step; done when reaching state 0
/ Transitions: right at s1 goes left, left at s1 goes right

corridorStep:{[s;a]
  s2:$[s=1;
    $[a=1;0;2];   / state 1: actions are REVERSED
    $[a=1;s+1;s-1]
    ];
  s2:0|3&s2;           / clamp
  done:s2=0;
  `s`r`done!(s2;-1f;done)
  }

/ Policy: softmax over two actions, parameterised by theta
/ pi(right|s) = sigmoid(theta[s]) for each state
/ We use one parameter per non-terminal, non-confused state pair
/ S&B uses a single parameter theta (scalar) with x(s,right)=[1] x(s,left)=[0]
/ Let's follow S&B: scalar theta, feature h(s,a) = x(s,a)'theta

/ Feature function: h(s,right) = 1, h(s,left) = 0 for all states
/ (simplest linear softmax)
h:{[theta;s;a] $[a=1;theta;0f]}

/ Softmax probability
piProb:{[theta;s;a]
  ha:h[theta;s;a];
  hb:h[theta;s;1-a];
  exp[ha] % exp[ha]+exp[hb]
  }

/ Sample action from policy
piSample:{[theta;s]
  p:piProb[theta;s;1];   / prob of action 1 (right)
  $[rand[1.0f]<p;1;0]
  }

/ REINFORCE: policy gradient with Monte Carlo returns
/ S&B Algorithm 13.1
reinforce:{[alpha;gamma;nEpisodes]
  theta:0f;       / scalar parameter
  epReturns:();

  do[nEpisodes;
    / Generate episode
    s:3i;    / start at s3
    trajectory:();
    done:0b;
    while[(not done) and 1000>count trajectory;
      a:piSample[theta;s];
      result:corridorStep[s;a];
      trajectory,:enlist `s`a`r!(s;a;result`r);
      s:result`s;
      done:result`done
    ];

    / Compute returns G_t for each step (backwards)
    T:count trajectory;
    G:0f;
    k:T-1;
    while[k>=0;
      step:trajectory[k];
      G:step[`r]+gamma*G;
      / Policy gradient update
      / grad log pi(a|s,theta) = x(s,a) - E[x(s,*)] under pi
      / For scalar theta: grad = a - pi(right|s)
      / (since h=theta for a=1, h=0 for a=0)
      pRight:piProb[theta;step`s;1];
      gradLogPi:(`float$step`a) - pRight;   / a=1 or 0 minus prob of 1
      theta+:alpha*gamma xexp k * G*gradLogPi;
      k-:1
    ];
    epReturns,:sum trajectory`r
  ];
  `theta`returns!(theta;epReturns)
  }
/ Run REINFORCE (S&B Figure 13.1)
/ alpha=2e-3 works well; results vary due to high variance
res_rf:reinforce[2e-3f;0.99f;1000];

/ Final theta: optimal is theta such that pi(right|s) ≈ 0.59
/ i.e. exp(theta)/(exp(theta)+1) = 0.59 => theta ≈ 0.36
show res_rf`theta
piProb[res_rf`theta;1;1]   / prob of going right from s1

/ Learning is noisy but should trend upward
avg last[100]res_rf`returns   / average episode return, last 100 eps

REINFORCE with Baseline

REINFORCE has high variance because the return \(G_t\) can be large and volatile. Subtracting a baseline \(b(s)\) reduces variance without introducing bias (as long as the baseline doesn’t depend on the action):

\[\theta \leftarrow \theta + \alpha \left(G_t - b(S_t)\right) \nabla \ln \pi(A_t|S_t, \boldsymbol{\theta})\]

A common choice: \(b(s) = V(s)\), the value function. This makes \(G_t - b(S_t)\) an estimate of the advantage: how much better was this action than average from this state?

/ REINFORCE with baseline (learned state-value function)
/ Learns both theta (policy params) and w (baseline/value params)
reinforceBaseline:{[alphaTheta;alphaW;gamma;nEpisodes]
  theta:0f;          / policy parameter
  / Baseline: learned V(s) as function of state
  / Simple: one weight per state (tabular baseline)
  wBase:4#0f;        / one weight per corridor state
  epReturns:();

  do[nEpisodes;
    s:3i; trajectory:(); done:0b;
    while[(not done) and 1000>count trajectory;
      a:piSample[theta;s];
      result:corridorStep[s;a];
      trajectory,:enlist `s`a`r!(s;a;result`r);
      s:result`s;
      done:result`done
    ];

    / Compute returns and update both theta and baseline
    T:count trajectory;
    G:0f;
    k:T-1;
    while[k>=0;
      step:trajectory[k];
      G:step[`r]+gamma*G;
      st:step`s;

      / Baseline update: gradient descent on (G - wBase[s])^2
      delta:G - wBase[st];
      wBase[st]+:alphaW*delta;

      / Policy update: use advantage (G - baseline)
      pRight:piProb[theta;st;1];
      gradLogPi:(`float$step`a) - pRight;
      theta+:alphaTheta*gamma xexp k * delta*gradLogPi;
      k-:1
    ];
    epReturns,:sum trajectory`r
  ];
  `theta`wBase`returns!(theta;wBase;epReturns)
  }
/ Compare REINFORCE vs REINFORCE with baseline (S&B Figure 13.2)
res_rf_base:reinforceBaseline[2e-3f;1e-2f;0.99f;1000];

/ Baseline version should show lower variance (smoother learning curve)
avg last[100]res_rf_base`returns        / similar final performance
avg abs deltas last[100]res_rf`returns  / variance of plain REINFORCE
avg abs deltas last[100]res_rf_base`returns  / lower variance

/ The baseline estimates: state 3 is weakest (furthest from goal)
res_rf_base`wBase
/ Expected pattern: wBase[0]=0 (terminal), wBase[3] < wBase[1] < wBase[2]

Actor-Critic Methods

REINFORCE is a pure Monte Carlo method: it waits for the episode to end, computes full returns, then updates. Actor-Critic methods bootstrap: the “critic” estimates \(V(s)\) using TD updates; the “actor” updates policy parameters using the TD error as the advantage estimate.

/ One-step Actor-Critic (S&B Algorithm 13.5)
actorCritic:{[alphaTheta;alphaW;gamma;nEpisodes]
  theta:0f;
  w:4#0f;   / tabular value function
  I:1f;     / gamma discount accumulator
  epReturns:();

  do[nEpisodes;
    s:3i; totalR:0f; done:0b;
    I:1f;

    while[(not done) and 10000>abs totalR;
      a:piSample[theta;s];
      result:corridorStep[s;a];
      s2:result`s; r:result`r; done:result`done;

      / TD error (advantage estimate)
      delta:r + gamma*$[done;0f;w[s2]] - w[s];

      / Critic update: one-step TD
      w[s]+:alphaW*delta;

      / Actor update: I * delta * grad_log_pi
      pRight:piProb[theta;s;1];
      gradLogPi:(`float$a) - pRight;
      theta+:alphaTheta*I*delta*gradLogPi;

      I*:gamma;     / discount accumulation
      totalR+:r;
      s:s2
    ];
    epReturns,:totalR
  ];
  `theta`w`returns!(theta;w;epReturns)
  }
/ Actor-critic: online updates, no need to wait for episode end
res_ac:actorCritic[2e-3f;1e-2f;0.99f;1000];

/ Convergence comparison (rough—all three have high variance)
-3{avg last[100]x`returns}each (res_rf;res_rf_base;res_ac)

The Policy Gradient Landscape

The methods in this chapter are the classical foundations. The modern policy gradient landscape has built extensively on them:

  • PPO (Proximal Policy Optimisation): clips the policy update to prevent large destructive steps. The most widely used algorithm in 2024.
  • A3C/A2C: asynchronous actor-critic with parallel workers.
  • SAC (Soft Actor-Critic): adds entropy maximisation to the objective; naturally exploratory; very sample-efficient on continuous action spaces.
  • TRPO (Trust Region Policy Optimisation): uses a KL-divergence constraint to limit policy update size; theoretically cleaner than PPO but harder to implement.

All of these descend directly from REINFORCE and the policy gradient theorem. The variance problem is still there; the solutions are more sophisticated. The objective is the same: \(\nabla J(\theta)\). The rest is engineering.

The one thing to carry forward: the policy gradient theorem is remarkably general. It says you can compute the gradient of any performance objective with respect to policy parameters by collecting trajectories and computing \(G_t \nabla \ln \pi(A_t|S_t;\theta)\). No model, no value function strictly required. Just samples, log-derivatives, and patience—which is not entirely unlike how organisms learn.

Where to Go From Here

You’ve implemented reinforcement learning from scratch in kdb+/q. Epsilon-greedy bandits, dynamic programming on GridWorld, Monte Carlo Blackjack, Q-Learning on cliff edges, n-step returns with eligibility traces, tile-coded mountain car, and policy gradients on a corridor that deliberately lies about its own structure.

Each of these is directly from Sutton & Barto. Each runs in q. None of them are toy demonstrations—they’re complete implementations of real algorithms that appear in papers, production systems, and textbooks studied by ML researchers worldwide.

What You Actually Learned

The algorithms in this book aren’t just implementations. They encode a set of ideas that recur at every level of modern RL:

The value of bootstrapping. DP uses model-based bootstrapping; TD uses bootstrapping from estimates. Both reduce variance at the cost of some bias. Every practical RL system bootstraps somewhere.

Exploration is not optional. The bandit algorithms established this; every subsequent chapter reinforced it. Epsilon-greedy is the simplest possible approach. UCB is principled. Softmax policies are smooth. The underlying problem—you don’t know what you don’t know—never disappears.

On-policy vs off-policy matters. SARSA vs Q-Learning isn’t just an algorithmic choice; it’s a statement about what you want to learn (the policy you’re executing, or the optimal policy). In production RL—where your data comes from a deployed policy that may not be the one you’re learning—this distinction is critical.

Function approximation changes the guarantees but not the intuitions. Semi-gradient TD converges to a TD fixed point, not the true value function. But the TD fixed point is usually good enough, and the architecture choices (state aggregation, tile coding, neural networks) determine how well you generalise.

Policy gradient methods are general but noisy. REINFORCE converges to local optima. Baselines help. The modern variants (PPO, SAC) are engineering solutions to the variance problem, built on the same theoretical foundation.

Extending the Implementations

The code in this book is deliberately minimal. Extending it is the actual learning exercise. Some suggestions:

Add experience replay to Q-Learning. Store transitions in a table; sample random batches for updates. This breaks correlations between consecutive samples and dramatically stabilises learning. It’s the first ingredient of DQN.

/ Replay buffer: fixed-size table of (s, a, r, s2, done) transitions
initReplay:{[capacity]
  `capacity`size`idx`buffer!(capacity;0i;0i;
    ([]s:capacity#0i; a:capacity#0i; r:capacity#0f;
       s2:capacity#0i; done:capacity#0b))
  }

/ Add transition to circular buffer
addTransition:{[buf;s;a;r;s2;done]
  idx:buf`idx;
  buf[`buffer;idx]:(`s`a`r`s2`done!(s;a;r;s2;done));
  buf[`idx]:(idx+1) mod buf`capacity;
  buf[`size]:buf[`capacity]&buf[`size]+1;
  buf
  }

/ Sample random minibatch
sampleBatch:{[buf;n]
  n?buf[`buffer] til buf`size
  }

Implement Double Q-Learning. Q-Learning overestimates action values because it uses the same network to select and evaluate actions. Double Q-Learning uses two Q-tables: one selects the action, the other evaluates it.

Add a target network. Freeze a copy of \(Q\) for computing targets; update it every \(C\) steps. This is the second key ingredient of DQN.

Try continuous action spaces. Replace the softmax policy with a Gaussian: output mean \(\mu(s;\theta)\) and log-std \(\log\sigma(s;\theta)\); sample actions from \(\mathcal{N}(\mu, \sigma^2)\). This is the foundation of SAC, DDPG, and TD3.

The kdb+/q Angle

The implementations in this book use standard q idioms: scan for iteration, functional amend for updates, vectorised operations for batch processing. The language forced certain clarity—when the Bellman update is one line of q, it’s obvious what it means.

For production use in financial contexts, RL has specific applications that play to q/kdb+ strengths:

Optimal execution. Market impact models for liquidating a position: state is (remaining inventory, time left, current price), action is trade size. The Almgren-Chriss model is a solved RL problem; the reality is harder.

Market making. State is (spread, inventory, volatility), action is (bid, ask) offsets. The reward is P&L minus inventory risk. This is naturally a continuous-action MDP.

Feature selection and regime detection. Use RL to select which features to use in downstream models, conditioned on detected market regime. The state space is high-dimensional; function approximation is mandatory.

Portfolio rebalancing. State is (current weights, signals), action is trades to execute, reward is risk-adjusted return. Multi-asset, high-dimensional, with transaction costs.

All of these are harder than anything in this book. They share the same structure: state, action, reward, update. The algorithms are the same. The engineering is not.

Sutton & Barto, of course. Everything in this book is a translation of their text; the original has proofs, history, intuitions, and exercises we didn’t cover. It is freely available. There is no excuse not to read it.

Szepesvári, Algorithms for Reinforcement Learning. A shorter, more mathematical treatment. Covers convergence proofs for TD methods and linear function approximation. Worth reading after S&B.

Bertsekas, Reinforcement Learning and Optimal Control. Connects RL to classical optimal control theory. More rigorous, different emphasis. Essential if you work in continuous-time or continuous-control domains.

Spinning Up in Deep RL (OpenAI). Code-first introduction to modern deep RL: VPG, TRPO, PPO, DDPG, TD3, SAC. In Python (sorry), but the implementations are clear and the exposition is good.

One Last Thing

The 10-armed bandit in Chapter 2 and the short corridor in Chapter 9 are pedagogically similar: both are deceptively simple environments that reveal something real about learning under uncertainty. The bandit reveals the exploration-exploitation tradeoff. The corridor reveals that stochastic policies are sometimes strictly necessary.

The algorithms that work on these toy problems work—conceptually, modulo engineering—on everything else. Deep networks don’t change what Q-Learning is; they change how \(Q(s,a)\) is parameterised. Continuous action spaces don’t change what policy gradient is; they change how \(\pi(a|s;\theta)\) is represented.

The foundational ideas are stable. The implementations in q are fast and clear. The rest is scale and patience.

That, and a willingness to watch your agent fall off a cliff several thousand times before it learns to walk around.