Chapter 14: LLMs as Lossy Compressors

This chapter is different. It's speculative, philosophical, and only partially serious. Take it as a frame for thinking about LLMs, not as a rigorous technical argument. The frame is useful even if imperfect.

Having spent thirteen chapters understanding how compressors work — what they keep, what they throw away, how they model data, and what the theoretical limits are — we're in a surprisingly good position to think about large language models. Because the argument that LLMs are compressors is not just a metaphor. It holds up under scrutiny, and it explains several things about LLM behavior that are otherwise puzzling.

The Compressor Framing

Here's the core claim: a language model's training process is the act of lossy compression of a large corpus of text into a set of floating-point weights.

Let's break this down piece by piece.

Training = Compression

A large language model is trained on something like the entire public internet — trillions of tokens of text. The model itself contains some number of parameters: GPT-3 had 175 billion parameters; Llama 3 405B has 405 billion; the trajectory is toward hundreds of billions, maybe trillions.

The raw training data is vastly larger than the model's weights. The pre-training corpus for a frontier model is on the order of 10–100 trillion tokens, each a 16-bit value (plus vocabulary etc.). The weights are a fixed-size array of floats.

LLM as a lossy compressor:

  Training corpus: ~10 trillion tokens × 2 bytes = ~20 TB

  GPT-4 weights (estimated ~1.7T params at fp8): ~1.7TB
  Llama 3 70B (fp16): ~140GB
  Llama 3 405B (fp8): ~400GB
  Gemma 2 9B (fp16): ~18GB
  GPT-4 estimated compression ratio: ~12:1 minimum

But the ratio understates the compression. The model doesn't just store the text — it captures something more like the structure of the text. It learns that code tends to have matching brackets, that English sentences tend to have subject-verb agreement, that numbers that appear in certain financial contexts tend to follow certain distributions.

This is better than the training data itself for a compressor's purposes — a good model of the structure enables encoding at near-entropy.

Actually, this exact idea was tested. In 2019, Jack Clark and Dario Amodei observed that language models could be used directly as lossless compressors: use the model's probability estimate for each token to arithmetic-code the next token. A model that assigns high probability to the actual next token → uses few bits to encode it. A perfect model → achieves Shannon entropy.

The result: large language models, used as arithmetic coders, beat every other general-purpose compressor on text.

LLM-as-compressor results (approximate, on enwik8 — 100MB of Wikipedia):

Compressor          Bits per byte
─────────────────────────────────
gzip-6                 3.20
bzip2                  2.43
xz (LZMA)             2.19
Brotli-11             2.11
PPM                   2.04
LSTM language model   1.77  ← early result (2019)
GPT-3 (large)        ~1.30  ← estimated
GPT-4 (estimated)    ~1.10  ← approaching Shannon limit for English
Shannon limit English ~1.0   ← theoretical minimum

An LLM beats every non-LLM compressor on English text,
by a wide margin, because it models the language so much better.

Inference = Decompression

If training is compression, then inference is decompression — roughly. The model "stored" the patterns from the training data in its weights. When you prompt it, you're asking it to "decompress" the relevant portion: generate text that conforms to those stored patterns.

Classical lossy compression:

  Input data        Lossy encoder        Compressed bits
  (photograph)  ─────────────────▶     (JPEG bytes)
                                              │
                                              │ Decoder
                                              ▼
                                        Approximate reconstruction
                                        (slight artifacts, close to original)

LLM "compression":

  Training corpus   Gradient descent     Model weights
  (internet text) ─────────────────▶   (fp16/fp8 params)
                                              │
                                              │ Inference
                                              ▼
                                        Approximate "reconstruction"
                                        (statistically similar to training)

This framing explains something curious: why language models produce text that sounds like their training data. They don't retrieve it — they generate text that's statistically consistent with what was stored. The compression preserved the structure, the patterns, the correlations — not the verbatim content.

What Gets Lost

Every lossy compressor decides what to discard. JPEG discards high-frequency spatial information. MP3 discards sounds masked by louder sounds. A video codec discards temporal detail where motion compensation fails.

What does an LLM discard?

Verbatim Text

The model almost certainly cannot reproduce most of the training data verbatim. It stores patterns, not photographic reproductions. Ask an LLM for the 73rd word of a specific Wikipedia article from 2022, and it will either make something up or fail gracefully.

Exceptions: text that appears many times, or that was highly prominent in training, may be "memorized" — stored in the weights in a way that allows near-verbatim reproduction. This is how LLMs can reproduce famous poems, common code snippets, and heavily cited passages.

Precise Numbers

Numbers present a specific challenge. "The speed of light is approximately 300,000 km/s" is a pattern that appears everywhere and is reliably stored. "The GDP of Paraguay in Q3 2021 was $9.73 billion" is a specific fact that may or may not have appeared often enough to be reliably stored.

LLMs frequently hallucinate specific numbers. This is the compressor discarding precision: the "shape" of the fact (GDP of Paraguay is in the range $8–12 billion) is stored; the exact value is not.

Types of information and LLM retention:

Highly retained:
  Common patterns of language and grammar
  Widely repeated facts (speed of light, capital cities)
  Code patterns and API signatures (if in training data)
  General reasoning structures

Poorly retained (hallucination risk):
  Precise numerical values
  Specific dates and version numbers
  Exact quotations and attributions
  Events after training cutoff
  Obscure facts that appeared rarely in training

Provenance and Attribution

JPEG doesn't know which photographer took the photo it's compressing. It just stores the pixels. Similarly, an LLM during training doesn't (in base form) learn who said what — it learns what kinds of things tend to be said. The author, the source, the publication date, the context — these are not inherently preserved.

This is why LLMs confidently produce text that sounds authoritative but isn't: they're generating according to the statistical pattern "authoritative-sounding text about topic X," without any ground truth to constrain them.

The Lossy Compressor and Confabulation

The JPEG analogy is illuminating here. When you compress a JPEG heavily and then ask "what color is this pixel?", the JPEG decoder gives you an answer — but it might be wrong, because heavy quantization changed the color. It doesn't say "I'm not sure"; it says "blue."

LLMs behave similarly. When prompted about something that was poorly preserved in their weights, they don't (by default) say "I'm not sure about this." They generate text that conforms to the pattern "confident answer to a question about X" — because that's what was in the training data. People asking questions and confident people answering them.

This is confabulation — not deliberate lying, but generating text that's statistically plausible without a mechanism for checking whether it's factually correct.

The Model as a "Compressed Civilization"

There's a more optimistic way to view the same facts. The LLM hasn't just compressed text — it's compressed something like human knowledge and reasoning patterns. When you ask an LLM how to debug a TypeScript error, you're accessing the compressed form of thousands of Stack Overflow threads, documentation pages, blog posts, and GitHub issues about TypeScript debugging. The model synthesizes these into a response that no single document contained.

This is the sense in which LLMs are more useful than retrieval systems for many tasks: they've not just indexed the data, they've compressed it into a form where patterns combine. A user who asks "write a Python function that does X" gets something assembled from many training examples, not retrieved from any one of them.

Retrieval vs. Generation:

Retrieval (search engine / RAG):
  Query → find relevant documents → return verbatim excerpts
  Precise for specific known facts.
  Can't synthesize across sources.

Generation (LLM inference / "decompression"):
  Query → generate statistically appropriate response
  Synthesizes across everything in training.
  Hallucinations possible for specific facts.
  Better for synthesis, explanation, reasoning.

The tradeoff maps neatly to lossless vs. lossy compression:
  Retrieval = lossless (return the original)
  Generation = lossy (return a reconstruction, possibly imperfect)

Why "Really Expensive"

The chapter title calls LLMs "really expensive lossy compressors." Let's be specific about the costs.

Training costs: Frontier LLMs cost hundreds of millions to billions of dollars to train, requiring thousands of specialized accelerators (H100 GPUs, Google TPUs) running for months.

Storage costs: Model weights are large — 140GB for Llama 3 70B, estimated ~1TB+ for GPT-4 — requiring specialized storage and memory hierarchies.

Inference costs: Each forward pass requires enormous compute. Generating 1000 tokens from a 70B model requires approximately 70 billion multiply-accumulate operations per token — 70 trillion operations per request. At scale, this is the dominant cost for AI companies.

Memory bandwidth costs: The weights must be read from memory for every inference. For large models, memory bandwidth (how fast you can move data from GPU memory to compute units) is often the bottleneck. This is the "memory wall" problem.

Cost comparison to classical compression:

Decompress a 100MB gzip file:
  CPU: 1 second on a single core
  Memory: 32KB
  Power: ~10 watt-seconds

"Decompress" (inference) from a 70B LLM, 500 tokens output:
  Hardware: 4× A100 GPUs (required for model to fit in memory)
  Time: ~30 seconds
  Memory: 140GB GPU memory
  Power: ~150,000 watt-seconds (150 kWh equivalent... ok that's extreme,
         more like 150 watt-seconds per query on efficient hardware)

The cost gap is enormous. But the "compression ratio" is too:
  An LLM can synthesize knowledge from 20TB of training data
  in response to a 100-token query.
  No classical compressor does this.

Compression and Intelligence

The compression framing suggests something interesting about what intelligence might be. Marcus Hutter's "AIXI" framework and related theoretical work has long noted a connection between intelligence and compression: an agent that can compress data well implicitly has a model of the world that generated the data.

Hutter went further: he defined the Hutter Prize (the "Hutter Prize for Lossless Compression of Human Knowledge"): a prize for compressing 100MB of Wikipedia as small as possible. The prize is based on the argument that better compression implies better understanding, and the winner of the prize would effectively have a more concise model of human knowledge than anyone else.

LLMs have effectively won this argument by a wide margin — they're the best known compressors of human-generated text, by virtue of having (arguably) the best model of how humans generate text.

The compression-intelligence connection:

  Better world model  →  Better prediction  →  Better compression

  An agent that understands physics will predict the trajectory
  of a ball better than one that doesn't, and therefore can
  compress a video of a thrown ball more effectively.

  An agent that understands English grammar will predict the
  next word in a sentence better than one that doesn't.

  The model quality IS the compression quality.
  Evaluating a compressor IS (partially) evaluating the model.

What This Frame Gets Wrong

Every analogy breaks down somewhere. The LLM-as-compressor frame has real limits:

Training objectives differ: A lossy image compressor explicitly optimizes for minimum distortion at a given bit rate. An LLM optimizes for next-token prediction. These are related but not identical. An LLM trained to minimize perplexity is not the same as an LLM trained to be a good assistant.

Inference is not simply decoding: JPEG decoding is a deterministic algorithm. LLM inference is sampling from a distribution — the same weights can produce different outputs. This is more like a probabilistic generative model than a traditional decompressor.

LLMs can reason in ways compressors can't: A JPEG decoder cannot modify the compression algorithm based on what it decompresses. An LLM, through chain-of-thought reasoning and tool use, can in some sense "look up" information it's uncertain about. The analogy breaks when you extend LLMs with retrieval, tool use, and agentic capabilities.

Fine-tuning adds information: RLHF (Reinforcement Learning from Human Feedback) and fine-tuning modify the weights after pretraining. This is like post-processing a compressed file — not decompression, but further transformation. The compression analogy is cleanest for base models without fine-tuning.

Practical Implications

Whether or not you find the compression frame intellectually satisfying, it generates useful heuristics:

Use LLMs for synthesis, not lookup: Where the compression shines is combining patterns from many sources. Where it fails is verbatim recall of specific facts.

Retrieval Augmented Generation (RAG): RAG combines an LLM's synthesis capabilities with verbatim retrieval from a document store. This is like having both a compressor (for synthesis) and a lossless store (for facts). The combination is more powerful than either alone.

Hallucinations are a compression artifact, not a bug: They're the model generating statistically-plausible content where specific information was lost in compression. Treating them as bugs to be "fixed" misses the point — the solution is architecture (add retrieval), not just training.

Context is decompression guidance: When you provide context in a prompt ("here is the actual text of the relevant document..."), you're giving the "decompressor" ground truth to work from, reducing the chance it generates a plausible-but-wrong reconstruction.

Model size and data affect what's retained: Larger models (more parameters) have more capacity to retain detail. More training data on a topic increases the "effective compression ratio" for that topic — more training signal means the pattern is more reliably stored.

The Deeper Question

The compression frame raises a question that's genuinely unresolved: is a model that has compressed the world's text into weights understanding the world, or just very efficiently predicting patterns?

For classical compressors, we'd never ask this. A gzip stream doesn't understand the text it compressed. But language models do things that look like reasoning, analogy-making, and generalization — behaviors we don't see from gzip.

One answer: the difference is quantitative, not qualitative. gzip's model of the world (LZ77 with a 32KB window) is just too simple to do anything that looks like reasoning. A transformer with 400 billion parameters and 20 trillion training tokens has a rich enough model that the representations it builds internally support what looks like reasoning.

Another answer: "understanding" requires something that statistical compression doesn't provide — grounding in the world, embodied experience, intentionality. In this view, an LLM is forever just a very sophisticated pattern-matcher, not an understander.

This debate isn't resolved. The compression frame doesn't resolve it either. But it does clarify the terms: whatever LLMs are doing, they're doing it by building an exceptionally good model of the statistical structure of human-generated text. Whether that model captures "meaning" is a philosophical question that compression theory can't answer.

Summary

Language model training is a form of lossy compression: the training corpus (trillions of tokens) is compressed into model weights (billions of parameters).
LLMs, used as arithmetic coders, beat every classical lossless compressor on English text — they have a better probability model.
Inference is approximately decompression: generating text that conforms to the stored statistical patterns.
What gets lost: verbatim text, precise numbers, provenance, rare facts. Hallucinations are the model generating plausible-but-imprecise reconstructions of poorly-compressed information.
LLMs are expensive because the "compressor" and "decompressor" are enormous neural networks, not simple algorithms.
Better compression implies a better model of the data; this connects compression quality to something like intelligence.
The frame breaks down for fine-tuned models, reasoning capabilities, and tool use — but it's a useful lens for understanding base model behavior and limitations.

The link between compression and understanding runs throughout the history of information theory. Shannon's work was not just about sending bits efficiently — it was about the nature of information itself. The LLM is, in some sense, the culmination of that thread: a machine that has learned to model the information in human communication well enough to generate new instances of it.

Whether that constitutes understanding, or merely very sophisticated pattern-matching, is left as an exercise for the reader.

The Big Book of Compression Algorithms