Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Where to Go From Here

You made it. Let’s take stock of what you now know.

You understand that text becomes tokens — integer IDs indexing into a vocabulary built by iterative byte-pair merging. Those tokens become embedding vectors in a high-dimensional space where semantic relationships are geometric. Attention lets every position in the sequence look at every other position, computing relevance via query-key dot products and aggregating via weighted sum of values. Multi-head attention does this several ways in parallel. Transformer blocks chain attention and feedforward layers with residual connections. Stacking many blocks produces a full language model.

Training minimizes cross-entropy loss on next-token prediction, via gradient descent with Adam, on hundreds of billions of tokens. Fine-tuning adapts the model to specific tasks using a fraction of that compute, often using LoRA to update only a tiny fraction of parameters. Inference samples from the output distribution, using temperature and top-p to balance diversity and coherence, with KV caching to avoid redundant computation.

And none of this is magic. It’s matrix multiplication, softmax, and gradient descent, scaled to a level that produces emergent capabilities the architects didn’t explicitly design.

What You Should Actually Do Next

If you want to go deeper on the theory

Read the original papers, in order:

  1. “Attention Is All You Need” (Vaswani et al., 2017) — the transformer paper
  2. “Language Models are Unsupervised Multitask Learners” — GPT-2 paper (good writing, clear architecture)
  3. “Training Language Models to Follow Instructions with Human Feedback” — InstructGPT, the RLHF paper
  4. “LoRA: Low-Rank Adaptation of Large Language Models” — the LoRA paper

After those four, you have the foundation of everything that’s happened since.

Karpathy’s educational resources are exceptional:

  • nanoGPT on GitHub: the cleanest small GPT implementation available
  • His “Let’s build GPT” video on YouTube: 2 hours, builds a character-level GPT from scratch

If you want to get practical quickly

Run something real:

# Llama via Ollama — runs locally on a MacBook
brew install ollama
ollama pull llama3.2
ollama run llama3.2

# Or use the transformers library
pip install transformers accelerate
from transformers import pipeline

# A real model, running on your machine
generator = pipeline("text-generation", model="gpt2")
output = generator("The transformer architecture", max_new_tokens=50)
print(output[0]['generated_text'])

Fine-tune something:

pip install trl peft transformers datasets

The trl library (Transformer Reinforcement Learning) has clean examples for SFT (supervised fine-tuning), DPO, and PPO. Start with the SFT trainer on a small model like gpt2 or TinyLlama.

If you want to contribute to the field

Study mechanistic interpretability. This is the sub-field dedicated to understanding what’s actually happening inside trained models — which features specific neurons represent, how information flows through the network, how circuits implement behaviors.

Resources:

Study scaling laws. The Chinchilla paper (“Training Compute-Optimal Large Language Models,” Hoffmann et al. 2022) fundamentally changed how people think about the relationship between model size, data size, and compute. Understanding this will help you understand why models are the size they are.

Implement something from scratch. The best way to truly understand a system is to build it without looking at existing implementations.

# A starting challenge: implement this from scratch
# (not looking at the code in this book):

class MiniGPT(nn.Module):
    """
    Implement a working GPT from scratch.
    Requirements:
    - Multi-head causal self-attention
    - Position-wise feedforward network
    - Residual connections + layer norm
    - Token + positional embeddings
    - LM head with weight tying

    When it trains on Shakespeare and produces recognizable text,
    you've understood the architecture.
    """
    pass  # Your implementation here

The Things Worth Being Excited About

The field is genuinely exciting right now, and understanding the internals means you can read the excitement with more precision.

Context length is expanding rapidly. Models that can process hundreds of thousands of tokens — effectively entire codebases, books, or legal documents — are qualitatively different from models that can’t. The architectural work to make this efficient (sparse attention, state space models, various hybrid approaches) is active.

Multimodality is real. The same attention mechanism that works on text works on images (ViT, CLIP), audio, video. Models that process multiple modalities simultaneously are enabling genuinely new applications.

Agents are early but real. Models with tool use, persistent memory, and the ability to take actions in environments are being deployed in production. The interesting challenges are about reliability and trust, not capability.

Inference efficiency keeps improving. Speculative decoding, quantization, distillation, mixture of experts — the gap between what’s possible and what’s affordable is closing. A year ago, 70B parameter models required expensive hardware. Today they run on a laptop.

The Things Worth Being Sober About

Everything in the previous chapter on limits remains true. The capabilities are real; so are the failure modes. The two are not in tension — both can be true simultaneously, and good engineering requires holding both.

The field’s progress is genuine and the hype is also genuine, which means careful thinking is required to separate signal from noise. Papers that show impressive benchmark results often don’t transfer to production. Models that seem capable of reasoning often fail at basic tasks in ways that reveal the limits of pattern matching.

Don’t use an LLM where a lookup table will do. This sounds obvious but is not always applied. Language models are expensive, slow, and unreliable compared to deterministic systems for tasks with deterministic answers.

Verify everything in production. Structure outputs. Validate. Test edge cases. The confidence of the output is not a signal you can trust.

Think about the second-order effects. If your product makes it much easier to produce certain kinds of content at scale, what happens when everyone uses it? Document generation, customer service, code review, essay writing — these all look different at 100x scale.

A Final Note

You started this book as a developer who had used LLMs extensively but didn’t know what was happening inside them. That should no longer be true.

When you see a context window limit, you know why it exists: quadratic attention complexity and positional embedding constraints. When you see a pricing page that charges per token, you know what a token is and roughly how many appear in your prompts. When a model hallucinates, you know the mechanism: confident prediction from imperfect training data, with no internal truth-checker. When someone talks about fine-tuning, you know whether they mean LoRA or full fine-tuning, and what the trade-offs are.

That knowledge is worth having. It won’t prevent you from using these models incorrectly — humans are creative in how they misapply tools — but it gives you the right mental model for debugging when things go wrong.

The architecture is not magic. It’s specific mathematics with specific properties, trained in a specific way. Knowing that doesn’t make it less impressive; it makes it more. The fact that stacking matrix multiplications on enough data produces a system that writes coherent technical prose and debugs Python is genuinely remarkable.

It’s also just math.

Good luck out there.


# The last code block of the book.
# Run it if you like.

import torch
import torch.nn.functional as F

def what_you_now_know():
    topics = [
        ("Tokenization",      "BPE merges characters → subwords → vocabulary"),
        ("Embeddings",        "Token IDs → dense vectors in semantic space"),
        ("Self-Attention",    "Q·K^T / sqrt(d_k) → softmax → weighted V"),
        ("Multi-Head Attn",   "Run n_heads attention operations in parallel"),
        ("Transformer Block", "Attention + FFN, each with residual + LayerNorm"),
        ("Training",          "Cross-entropy loss → backprop → AdamW update"),
        ("Fine-Tuning",       "LoRA: freeze weights, train low-rank adapters"),
        ("Inference",         "Temperature, top-p, KV cache, autoregressive loop"),
        ("Limits",            "No ground truth, no calibrated confidence, no real-time knowledge"),
        ("The Meta Part",     "This book was generated by the thing it describes"),
    ]

    print("What you now know:")
    print()
    for topic, summary in topics:
        print(f"  {'✓':2s} {topic:20s}  {summary}")

    print()
    print("Next step: build something.")

what_you_now_know()