Introduction

"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."

— Claude Shannon, A Mathematical Theory of Communication, 1948

Thanks to Georgiy Treyvus, CloudStreet Product Manager, whose idea started this book.

Compression is everywhere and mostly invisible. When you stream a video, your browser is decoding a stream that was compressed with H.264 or AV1. When you send a file over Slack, it's gzip'd in transit. When you store a photograph, the JPEG algorithm made a hundred small decisions about which visual details your eye probably won't notice. When a Parquet file loads in ten milliseconds instead of ten seconds, it's because the columnar layout lets LZ4 skip over 90% of the bytes.

Most developers interact with compression through flags and library calls: --level 9, Content-Encoding: gzip, COMPRESS=zstd. This book is for the developer who wants to understand what's happening underneath those flags — not just which algorithm to reach for, but why it works, where it breaks down, and what it's actually trading off.

What This Book Covers

We start at the foundation: information theory. Shannon's entropy tells us how much compression is theoretically possible for any given data. Everything else in this book is the story of how humanity has gotten closer and closer to that limit, in different contexts, with different constraints.

From there, we work through the major algorithm families:

Statistical coders (Huffman, arithmetic coding) — assign shorter codes to more frequent symbols
Dictionary methods (LZ77, LZ78, LZW, Deflate, Brotli, Zstandard) — replace repeated patterns with references
Transform coders (JPEG's DCT, audio's MDCT) — convert data to a domain where redundancy is easier to remove
Predictive coders (FLAC, video inter-frame) — model what comes next and encode only the surprise

We then look at how these primitives combine in real systems: zip files, PNG, JPEG, MP3, H.264, Parquet, HTTP compression. Each domain has its own constraints and exploits different structure in the data.

We close with a chapter that's half-serious, half-provocative: the argument that large language models are, at a deep level, just very expensive lossy compressors. It's a useful frame for understanding what LLMs are good at, what they hallucinate, and why.

How to Read This Book

Each chapter is designed to stand largely on its own after the first two foundational chapters. If you're already comfortable with entropy and information theory, you can skip ahead. If you're primarily interested in image compression, jump to Chapter 8 — but do read Chapters 1–4 first for the vocabulary.

Code examples are primarily in Python (for readability) and C (for the low-level parts that matter). They're illustrative, not production-ready. Where a real implementation differs significantly from the illustrative one, we'll say so.

ASCII diagrams appear throughout because compression algorithms are fundamentally about data structures and data transformations — and those are often clearest as pictures.

A Note on Benchmarks

Compression benchmarks are notoriously sensitive to the data being compressed. A compressor that wins on text will lose on binary. A compressor that wins on small files will lose on large ones. Whenever we cite numbers, we'll note the context. When in doubt: measure on your actual data.

Let's begin.

Chapter 1: What is Compression?

Before we can understand how compression algorithms work, we need to understand what they're working against: the information content of data itself. This is the domain of information theory, and it gives us something remarkable — a mathematical lower bound on how small any compressor can make a given piece of data.

The Basic Idea

Compression is the process of encoding information using fewer bits than the original representation. That's it. Everything else is detail about how to do it well and when the tradeoffs make sense.

Consider the string "AAAAAAAAAA" — ten A's. Stored naively as ASCII, that's 80 bits (10 bytes). But we could also encode it as "10×A" — just three symbols. We've compressed 10 symbols into 3. The decompressor knows how to reverse this: see N×C, output C repeated N times.

Now consider "AJKQMZXBWR" — ten random letters. There's no shorter description. Each character is unpredictable, so each one has to be transmitted in full.

The key insight: compressibility is about predictability. Data that has patterns, regularities, or redundancy can be compressed. Data that is already maximally random cannot be.

Entropy: The Measure of Surprise

Claude Shannon formalized this intuition in 1948 with the concept of entropy — the average amount of information (surprise) in a message, measured in bits.

For a source that produces symbols with probabilities p₁, p₂, ..., pₙ, Shannon's entropy H is:

H = -Σ pᵢ × log₂(pᵢ)

Each term -pᵢ × log₂(pᵢ) is the contribution of symbol i to the total entropy: the probability of seeing it, times how surprising it is when we do.

Worked Example: A Biased Coin

Suppose we're encoding coin flips where heads comes up 90% of the time.

Entropy of a biased coin (p_heads = 0.9):

p_heads = 0.9   →  information content = -log₂(0.9)  ≈ 0.152 bits
p_tails = 0.1   →  information content = -log₂(0.1)  ≈ 3.322 bits

H = -(0.9 × 0.152) + -(0.1 × 3.322)
  = -0.137 + -0.332
  ≈ 0.469 bits per flip

A fair coin has entropy of exactly 1 bit per flip (you need a full bit to represent each outcome). Our biased coin only has 0.469 bits per flip — because heads is so predictable, it carries very little information.

If we had a million flips of this coin, a naive encoding would use 1 million bits. An optimal compressor could theoretically use only ~469,000 bits — a 53% reduction, without losing a single flip.

Entropy of English Text

English has 26 letters (plus punctuation, spaces). If all were equally likely, entropy would be log₂(27) ≈ 4.75 bits per character, and we'd need 5 bits per character minimum.

But English letters aren't equally likely:

Letter frequencies in typical English text:

E: 12.7%  ████████████▌
T:  9.1%  █████████
A:  8.2%  ████████▏
O:  7.5%  ███████▌
I:  7.0%  ███████
N:  6.7%  ██████▋
S:  6.3%  ██████▎
H:  6.1%  ██████
R:  6.0%  ██████
...
Z:  0.07% ▏

Shannon estimated English text at roughly 1.0–1.5 bits per character of actual entropy when accounting for context and word structure. ASCII uses 8 bits per character. That's a theoretical 5–8× compression ratio sitting there waiting to be claimed — and modern text compressors do claim most of it.

Shannon's Source Coding Theorem

Shannon proved two fundamental theorems about compression. The source coding theorem (his first theorem) states:

It is impossible to losslessly compress a message to fewer bits than its Shannon entropy, on average.

"On average" is important. For any specific message, you might get lucky — a short message might compress well. But you cannot build a compressor that always compresses. In fact:

The incompressibility theorem: For every compression function, there exist inputs that make the output larger than the input. This is a counting argument: if you map N-bit strings to shorter strings, some N-bit strings must collide (pigeonhole principle), and your compressor must handle those by expanding them.

This is why zip files of already-compressed data (like .jpg or .mp4) don't get smaller — the data is already near its entropy limit. Attempting to compress it just adds overhead.

Entropy ceiling — no compressor can beat this:

  Bits per    ┌─────────────────────────────────────────┐
  symbol      │                                         │
       8 │────│─── ASCII (8 bits/char) ─────────────────│
         │    │                                         │
       5 │    │─── Uniform 26-letter alphabet ──────────│
         │    │                   ╲ room for compression │
     1-2 │    │─── English entropy (Shannon estimate) ──│
         │    │                                         │
       0 └────┴─────────────────────────────────────────┘

Lossless vs. Lossy Compression

This is the fork in the road that divides the entire compression world:

Lossless compression guarantees that decompression produces the exact original data. Not approximately. Not "close enough." Bit-for-bit identical. This is required for:

Text and source code (a single wrong bit can break a program)
Executable files and archives
Medical images and legal documents
Any data where exact reconstruction is required

Lossy compression accepts some degradation in exchange for dramatically better compression ratios. It's used when:

Human perception can't detect (or doesn't care about) the lost information
A "good enough" reconstruction is acceptable
The original data has more precision than needed

The key insight behind lossy compression: human perception is not a perfect measuring instrument. Our eyes are more sensitive to changes in brightness than color. Our ears can't hear certain frequencies above certain amplitudes. If we discard imperceptible detail, we're still delivering the full perceptual experience — at a fraction of the data.

Lossless round-trip:

Original     Compress    Compressed    Decompress    Reconstructed
  Data    ─────────────▶    Data    ─────────────▶    Data
IDENTICAL                                          (bit-for-bit)

Lossy round-trip:

Original     Encode      Compressed    Decode       Approximation
  Data    ─────────────▶    Data    ─────────────▶    of Data
                                                   (information lost)

The Lossy Tradeoff

Lossy compression creates a tunable tradeoff between quality and size. JPEG, for instance, has a "quality" parameter from 0–100 that controls how aggressively to discard high-frequency image data. At quality 95, the difference from the original is invisible to most people. At quality 30, you can see blocky artifacts but the file is 10× smaller.

This isn't "cheating" — it's engineering. The goal isn't to preserve all the bits; the goal is to preserve all the value.

When comparing algorithms, we use several metrics:

Compression ratio: original size / compressed size. A ratio of 3:1 means the compressed file is one-third the size.

Space savings: (1 - compressed/original) × 100%. A 3:1 ratio means 66.7% space savings.

Bits per symbol: the average number of bits used per input symbol. For text, lower is better.

Compression speed: how fast the compressor runs, usually in MB/s.

Decompression speed: often more important than compression speed in practice (you compress once, decompress many times).

Example: compressing a 100MB text file

Algorithm    Comp.Size    Ratio    Comp.Speed    Decomp.Speed
─────────────────────────────────────────────────────────────
gzip -6       36 MB       2.8×      25 MB/s       200 MB/s
gzip -9       34 MB       2.9×       8 MB/s       200 MB/s
Brotli-6      30 MB       3.3×      15 MB/s       200 MB/s
Zstd-6        33 MB       3.0×     220 MB/s       700 MB/s
Zstd-19       28 MB       3.6×      12 MB/s       700 MB/s
xz -6         27 MB       3.7×       4 MB/s        50 MB/s

Zstd at level 6 compresses nearly as well as gzip at level 9, but nearly 30× faster. Decompression is 3.5× faster. This is why Zstd has largely replaced zlib in performance-sensitive applications. More on this in Chapter 5.

The Fundamental Limits

Shannon's entropy gives us the theoretical lower bound. Real compressors can't always reach it because:

Context model complexity: To predict what comes next perfectly, you'd need an infinite-order model. Real compressors use finite-order models.
Compressor overhead: The compressed format must include metadata for the decompressor (code tables, dictionary entries, etc.). For small inputs, this overhead dominates.
Data structure mismatches: A compressor optimized for text will underperform on binary data, and vice versa.
Finite block sizes: Compression improves with more context. A compressor working on a 4KB block can't exploit patterns across a 100MB file.

The gap between theoretical entropy and achievable compression is where most of the interesting engineering happens. The chapters ahead are about how different algorithms close that gap for different types of data.

Summary

Entropy is the theoretical minimum size of any lossless encoding of a data source.
Shannon's source coding theorem proves no compressor can beat entropy on average.
Lossless compression preserves all information; lossy compression discards perceptually unimportant information for better ratios.
Compressibility is about predictability: regular, redundant data compresses; truly random data does not.
All compression involves modeling the data. Better models → better compression, usually at higher computational cost.

In the next chapter, we'll look at Huffman coding — the first practical algorithm for approaching the entropy limit, and a component that still ships in almost every file format you've ever used.

Chapter 2: Huffman Coding

In 1952, MIT student David Huffman was taking an information theory course and was given an assignment: prove which of two methods for constructing optimal prefix codes was better. He couldn't prove either was optimal — so he invented a third method that was demonstrably optimal, and published it. His professor, Robert Fano, had been working on the problem for years. Huffman's algorithm was simpler and better.

The core idea: give shorter codes to more frequent symbols, longer codes to rarer ones. This is the same intuition behind Morse code (E is ·, Z is −−··). Huffman's contribution was an algorithm that produces provably optimal variable-length prefix codes.

Prefix Codes and the No-Prefix Property

Before the algorithm, let's understand what we're building. A prefix code (also called a prefix-free code) is a set of codewords where no codeword is a prefix of another.

Why does this matter? Because without it, decoding is ambiguous:

Bad code (not prefix-free):

Symbol  Code
  A      0
  B      01
  C      10
  D      11

Receiving "010":
  - Could be A, B      (0, 10... wait, is it A then B starting with 0? or 01 then?)
  - Ambiguous!

A prefix-free code can be decoded greedily: read bits until you match a codeword, output the symbol, repeat. No backtracking needed.

Good code (prefix-free):

Symbol  Code
  A      00
  B      01
  C      10
  D      11

Receiving "010011":
  - 01 → B
  - 00 → A
  - 11 → D
  - Clear!

Building the Huffman Tree

Huffman codes are built from a binary tree. Leaves are symbols; the path from root to leaf defines the codeword (left branch = 0, right branch = 1).

The algorithm:

Count the frequency of each symbol.
Create a leaf node for each symbol, with its frequency as its weight.
Insert all nodes into a min-priority queue.
While more than one node remains: a. Extract the two nodes with lowest weight. b. Create a new internal node with these two as children, weight = sum of children's weights. c. Insert the new node back into the queue.
The remaining node is the root.

Step-by-Step Example

Input: "ABRACADABRA" (11 characters)

Step 1: Count frequencies

Symbol  Count  Probability
  A       5       0.455
  B       2       0.182
  R       2       0.182
  C       1       0.091
  D       1       0.091

Step 2: Build initial priority queue (lowest weight first)

[C:1] [D:1] [B:2] [R:2] [A:5]

Step 3: Combine C and D

Extract C:1 and D:1 → create internal node CD:2

         CD:2
        /    \
      C:1   D:1

Queue: [B:2] [R:2] [CD:2] [A:5]

Step 4: Combine B and R

Extract B:2 and R:2 → create internal node BR:4

         BR:4
        /    \
      B:2   R:2

Queue: [CD:2] [A:5] [BR:4]
  → sorted: [CD:2] [BR:4] [A:5]

Step 5: Combine CD and BR

Extract CD:2 and BR:4 → create internal node CDBR:6

              CDBR:6
             /       \
           CD:2     BR:4
          /    \   /    \
        C:1  D:1 B:2  R:2

Queue: [A:5] [CDBR:6]

Step 6: Combine A and CDBR

Extract A:5 and CDBR:6 → create root node ALL:11

                     ALL:11
                    /       \
                  A:5      CDBR:6
                           /       \
                         CD:2     BR:4
                        /    \   /    \
                      C:1  D:1 B:2  R:2

Step 7: Read off codewords (left=0, right=1)

Symbol  Path          Code   Length
  A     left           0       1
  C     right,left,left    100     3
  D     right,left,right   101     3
  B     right,right,left   110     3
  R     right,right,right  111     3

Verify the compression:

Original (fixed 3-bit codes, since 5 symbols needs 3 bits):
  11 chars × 3 bits = 33 bits

Huffman:
  A appears 5 times × 1 bit  =  5 bits
  B appears 2 times × 3 bits =  6 bits
  R appears 2 times × 3 bits =  6 bits
  C appears 1 time  × 3 bits =  3 bits
  D appears 1 time  × 3 bits =  3 bits
  Total = 23 bits

Savings: 33 - 23 = 10 bits (30% reduction)

How close are we to Shannon's entropy limit?

H = -(5/11)log₂(5/11) - 2×(2/11)log₂(2/11) - 2×(1/11)log₂(1/11)
  ≈ 0.455×1.14 + 2×0.182×2.46 + 2×0.091×3.46
  ≈ 0.518 + 0.895 + 0.630
  ≈ 2.043 bits per symbol

Entropy for 11 symbols: 11 × 2.043 ≈ 22.5 bits
Huffman achieved: 23 bits  ✓ (within 2.2% of theoretical minimum)

Python Implementation

Here's a clean, readable implementation of Huffman coding:

import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional

@dataclass(order=True)
class HuffNode:
    weight: int
    symbol: Optional[str] = field(default=None, compare=False)
    left: Optional['HuffNode'] = field(default=None, compare=False)
    right: Optional['HuffNode'] = field(default=None, compare=False)

def build_huffman_tree(text: str) -> HuffNode:
    """Build a Huffman tree from input text."""
    freq = Counter(text)

    # Initialize priority queue with leaf nodes
    heap = [HuffNode(weight=count, symbol=char)
            for char, count in freq.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)

        internal = HuffNode(
            weight=left.weight + right.weight,
            left=left,
            right=right
        )
        heapq.heappush(heap, internal)

    return heap[0]

def build_codebook(root: HuffNode) -> dict[str, str]:
    """Extract symbol → bitstring mapping from tree."""
    codebook = {}

    def traverse(node: HuffNode, path: str):
        if node.symbol is not None:       # leaf node
            codebook[node.symbol] = path or '0'  # single-symbol edge case
        else:
            traverse(node.left, path + '0')
            traverse(node.right, path + '1')

    traverse(root, '')
    return codebook

def huffman_encode(text: str) -> tuple[str, dict]:
    """Encode text, return bitstring and codebook."""
    if not text:
        return '', {}

    tree = build_huffman_tree(text)
    codebook = build_codebook(tree)

    encoded = ''.join(codebook[char] for char in text)
    return encoded, codebook

def huffman_decode(bits: str, codebook: dict) -> str:
    """Decode bitstring using codebook."""
    # Invert codebook: bitstring → symbol
    reverse = {v: k for k, v in codebook.items()}

    result = []
    current = ''
    for bit in bits:
        current += bit
        if current in reverse:
            result.append(reverse[current])
            current = ''

    return ''.join(result)

# Demo
text = "ABRACADABRA"
encoded, codebook = huffman_encode(text)

print(f"Original: {len(text) * 8} bits (naive 8-bit ASCII)")
print(f"Encoded:  {len(encoded)} bits")
print(f"Codebook: {codebook}")
print(f"Decoded:  {huffman_decode(encoded, codebook)}")

Output:

Original: 88 bits (naive 8-bit ASCII)
Encoded:  23 bits
Codebook: {'A': '0', 'C': '100', 'D': '101', 'B': '110', 'R': '111'}
Decoded:  ABRACADABRA

The Optimality Guarantee

Huffman coding is provably optimal among all prefix codes. The proof is an exchange argument: if you have an optimal code that doesn't satisfy Huffman's tree properties, you can always swap two codewords to make it "more Huffman" without increasing average code length.

Key properties of an optimal Huffman tree:

More frequent symbols have shorter (or equal) codes.
The two least frequent symbols have the same length.
The two least frequent symbols differ only in the last bit.

The algorithm constructs a tree satisfying all three properties by construction.

One caveat: Huffman is optimal among prefix codes, but prefix codes are not always optimal encoders. Arithmetic coding (Chapter 3) can beat Huffman by encoding fractional bits per symbol — relevant when probabilities aren't nice powers of 1/2.

Canonical Huffman Codes

There's a problem with Huffman: the decompressor needs the codebook to decode. Sending the full tree with the compressed data adds overhead, and trees can be large.

Canonical Huffman solves this by defining a standard way to assign codewords given only the code lengths. The decompressor only needs to know the length of each symbol's codeword — not the codeword itself — to reconstruct the same canonical codebook.

The canonical assignment rule:

Sort symbols by code length, then alphabetically within same length.
Assign the numerically smallest codeword to the first symbol.
Each subsequent symbol at the same length gets the next integer.
When moving to a longer length, left-shift and continue.

From our ABRACADABRA example:

Sorted by length:
  A: length 1
  B: length 3
  C: length 3
  D: length 3
  R: length 3

Canonical assignment:
  A: 0          (0 in binary)
  B: 100        (shift 0 to 000, +1 → 001... wait, jump from length 1 to 3)
               (last code at length 1 was 0, shift left twice → 000, +1 → 001... no)

  Correct algorithm:
  Start code = 0, length = 1
    A (length 1): code = 0b0 = "0"

  Move to length 3: code = (0 + 1) << (3 - 1) = 1 << 2 = 4 = 0b100
    B (length 3): code = 4 = "100"
    C (length 3): code = 5 = "101"
    D (length 3): code = 6 = "110"
    R (length 3): code = 7 = "111"

Canonical codebook:
  A → "0"
  B → "100"
  C → "101"
  D → "110"
  R → "111"

The decompressor only needs to store: the list of code lengths [1, 3, 3, 3, 3] (one per symbol, in sorted order). From this alone, it can reconstruct the entire codebook. This is how DEFLATE (the algorithm inside gzip and PNG) stores its Huffman tables.

def canonical_codebook(lengths: dict[str, int]) -> dict[str, str]:
    """Generate canonical Huffman codes from code lengths."""
    # Sort by length, then by symbol
    sorted_syms = sorted(lengths.items(), key=lambda x: (x[1], x[0]))

    codebook = {}
    code = 0
    prev_len = 0

    for symbol, length in sorted_syms:
        if length > prev_len:
            code <<= (length - prev_len)
        codebook[symbol] = format(code, f'0{length}b')
        code += 1
        prev_len = length

    return codebook

Where Huffman Is Used Today

Huffman coding is not usually used alone — it's a building block inside larger formats:

DEFLATE (gzip, zlib, PNG, zip): Two Huffman trees, one for literals/lengths, one for distances.
JPEG: Each coefficient's magnitude is Huffman coded.
MP3: Scale factors and Huffman tables for spectral data.
HTTP/2 HPACK: A static Huffman table for header value compression.

Pure Huffman is rare in modern systems because it requires two passes (one to count frequencies, one to encode) and doesn't handle adaptive probability models well. But it remains foundational: almost every format that does entropy coding has Huffman somewhere inside it.

Huffman's Limitations

Two-pass requirement: You need to know all frequencies before building the tree, which means reading the data twice or buffering it all. Adaptive Huffman (not covered here) addresses this but is more complex.

Integer code lengths: Each symbol gets a whole number of bits. If a symbol has probability 0.4, it ideally gets 1.32 bits — but Huffman must assign it 1 or 2. This integer rounding is where arithmetic coding gains its advantage.

Block-based: Huffman builds one tree for the whole block. If the data's statistics change mid-block (common in real files), the single tree is suboptimal for parts of the data.

Despite these limitations, Huffman coding's simplicity, efficiency, and optimality guarantee have kept it relevant for over 70 years. It's one of those rare algorithms that was essentially perfect from day one.

Summary

Huffman coding assigns shorter binary codewords to more frequent symbols.
The algorithm builds a binary tree bottom-up using a min-priority queue.
The resulting prefix-free code is provably optimal among all prefix codes.
Canonical Huffman reduces decompressor overhead to just storing code lengths.
Huffman is a building block inside DEFLATE, JPEG, MP3, and dozens of other formats.

Next, we'll look at arithmetic coding — which breaks the "integer bits" barrier and can approach Shannon's limit even more closely.

Chapter 3: Arithmetic Coding

Huffman coding has a fundamental limitation: it assigns a whole number of bits to each symbol. A symbol with probability 0.5 gets 1 bit. A symbol with probability 0.9 gets 1 bit too — even though the Shannon optimal encoding would use only 0.152 bits. This waste accumulates.

Arithmetic coding breaks free from this constraint. Instead of assigning a fixed codeword to each symbol, it encodes an entire message as a single number — a fraction between 0 and 1. The length of that fraction (in bits) approaches the entropy of the message, regardless of individual symbol probabilities.

The price: arithmetic coding is more complex to implement and slightly slower. The gain: it can encode at very close to Shannon's theoretical limit, even for symbols with probabilities that don't map cleanly to powers of 1/2.

The Core Idea: Interval Subdivision

Imagine the number line from 0 to 1. We'll encode a message by progressively narrowing down to a smaller and smaller interval.

Setup: Assign each symbol a subinterval of [0, 1) proportional to its probability.

For a two-symbol alphabet (A, B) where P(A) = 0.7 and P(B) = 0.3:

Initial interval [0, 1):

0.0                    0.7          1.0
 ├──────────────────────┤────────────┤
 │          A           │     B      │
 │       [0.0, 0.7)     │  [0.7,1.0) │

Encoding: For each symbol in the message, narrow the current interval to the subinterval corresponding to that symbol, scaled to fit within the current interval.

Let's encode the message "ABA":

Step 1: Encode 'A'
  Current interval: [0.0, 1.0)
  A's sub-interval:  [0.0, 0.7) of the current interval
  New interval: [0.0 + 0.0×1.0, 0.0 + 0.7×1.0) = [0.0, 0.7)

  0.0          0.49     0.7       1.0
   ├────────────┤─────────┤─────────┤
   │ 'A' region │  'B' region (old) │

Step 2: Encode 'B'
  Current interval: [0.0, 0.7)
  Width: 0.7
  B's sub-interval: [0.7, 1.0) of the current interval
  New low  = 0.0 + 0.7 × 0.7 = 0.49
  New high = 0.0 + 1.0 × 0.7 = 0.70
  New interval: [0.49, 0.70)

  0.0   0.49    0.553  0.70    1.0
   ├─────┤────────┤──────┤──────┤
         │   A    │  B   │
         │[0.49,0.553)[0.553,0.70)

Step 3: Encode 'A'
  Current interval: [0.49, 0.70)
  Width: 0.21
  A's sub-interval: [0.0, 0.7) of the current interval
  New low  = 0.49 + 0.0 × 0.21 = 0.490
  New high = 0.49 + 0.7 × 0.21 = 0.637
  New interval: [0.490, 0.637)

Final interval: [0.490, 0.637)

The output is any value in the final interval. We pick the value that can be represented with the fewest bits. For example, 0.5 is representable as binary 0.1 (1 bit). Unfortunately 0.5 < 0.490 — doesn't work.

Let's try 0.5625 = binary 0.1001 — that's within [0.490, 0.637). Output: 0.1001 → 4 bits.

Compare to Huffman: if we assigned A=0, B=10, message "ABA" would be 0 10 0 = 4 bits. Same here, but arithmetic coding generalizes much better for non-binary-power probabilities.

The Update Formula

For an interval [low, high) and a symbol with cumulative probability range [cdf_low, cdf_high):

new_low  = low + (high - low) × cdf_low
new_high = low + (high - low) × cdf_high

The cdf (cumulative distribution function) values come from the probability model. If we have symbols A (prob 0.7) and B (prob 0.3):

Symbol   Prob   CDF_low   CDF_high
  A      0.70     0.00      0.70
  B      0.30     0.70      1.00

Decoding

Decoding is the reverse: given a value v in [0, 1), repeatedly determine which symbol's interval contains v, output that symbol, and rescale v into the new interval.

Decoding 0.5625 ("0.1001" in binary):

Step 1: Which symbol does 0.5625 fall in?
  A covers [0.0, 0.7) — yes, 0.5625 < 0.7
  Output: 'A'
  Rescale: v = (0.5625 - 0.0) / 0.7 = 0.80357...

Step 2: 0.80357 falls in which symbol's range?
  A covers [0.0, 0.7) — no, 0.80357 > 0.7
  B covers [0.7, 1.0) — yes
  Output: 'B'
  Rescale: v = (0.80357 - 0.7) / 0.3 = 0.34524...

Step 3: 0.34524 falls in:
  A covers [0.0, 0.7) — yes
  Output: 'A'
  (Stop when we've decoded the expected number of symbols)

Result: "ABA" ✓

Implementation: Integer Arithmetic Coding

Working with floating-point numbers is numerically unstable for long messages (precision runs out quickly). Real implementations use fixed-precision integers and a renormalization trick to keep the interval values in a workable range.

The key insight: when the high and low values share a leading bit, that bit is "settled" and can be output. We then shift both values left by one bit to maintain precision.

PRECISION = 32
FULL    = 1 << PRECISION           # 2^32
HALF    = FULL >> 1                # 2^31
QUARTER = FULL >> 2                # 2^30

def arithmetic_encode(symbols: list[int], model: list[int]) -> bytes:
    """
    Encode symbols using arithmetic coding.

    model: list of cumulative frequencies (CDF), length = alphabet_size + 1
           model[i] = cumulative count up to (not including) symbol i
           model[-1] = total count
    """
    low  = 0
    high = FULL
    pending_bits = 0  # bits to output after we resolve the "almost half" case
    output_bits = []

    def emit_bit(bit):
        output_bits.append(bit)
        # Emit any pending opposite bits
        for _ in range(pending_bits):
            output_bits.append(1 - bit)
        return 0  # reset pending count

    total = model[-1]

    for sym in symbols:
        range_size = high - low
        high = low + (range_size * model[sym + 1]) // total
        low  = low + (range_size * model[sym])     // total

        # Renormalize: emit settled bits
        while True:
            if high <= HALF:
                # Both in lower half: emit 0
                pending_bits = emit_bit(0)
            elif low >= HALF:
                # Both in upper half: emit 1
                low  -= HALF
                high -= HALF
                pending_bits = emit_bit(1)
            elif low >= QUARTER and high <= 3 * QUARTER:
                # Straddling the middle: defer
                low  -= QUARTER
                high -= QUARTER
                pending_bits += 1
            else:
                break

            low  <<= 1
            high <<= 1

    # Flush remaining bits
    pending_bits += 1
    if low < QUARTER:
        emit_bit(0)
    else:
        emit_bit(1)

    return output_bits

def build_model(text: str) -> list[int]:
    """Build a simple frequency model (cumulative counts)."""
    from collections import Counter
    freq = Counter(text)
    alphabet = sorted(freq.keys())
    # Map characters to indices
    char_to_idx = {c: i for i, c in enumerate(alphabet)}

    counts = [freq.get(c, 0) for c in alphabet]
    cdf = [0]
    for c in counts:
        cdf.append(cdf[-1] + c)

    return cdf, char_to_idx, alphabet

Adaptive Arithmetic Coding

The version above requires knowing the full probability model upfront — which means two passes over the data, or buffering the entire message. Adaptive arithmetic coding solves this by updating the model after each symbol.

The decompressor can mirror the encoder's updates exactly, because it sees the same symbols in the same order. This makes adaptive arithmetic coding a powerful tool for streaming compression.

Adaptive model update (simplified):

Initialize: all symbols have count = 1 (Laplace smoothing)

After encoding/decoding each symbol s:
  freq[s] += 1

The probability model continuously refines itself to match
the actual distribution of the data.

   Time →
   ┌──────────────────────────────────────────────┐
   │ Early: uniform priors, poor compression       │
   │ After 100 symbols: model adapting             │
   │ After 1000 symbols: model closely matches     │
   │                     true distribution         │
   └──────────────────────────────────────────────┘

Real-world adaptive coders (like those in LZMA and context-mixing compressors) use much more sophisticated context models — they condition the probability of the current symbol on the last N symbols, the current bit position, and other features. This is the heart of state-of-the-art compression.

Range Coding: The Practical Variant

Arithmetic coding as described works on the interval [0, 1) with arbitrary precision. Range coding, introduced by G. Nigel N. Martin in 1979, is an equivalent but practically cleaner implementation that works on integer ranges directly. It's faster and avoids some of the floating-point subtleties.

Range coding is what's used inside LZMA (7-zip, XZ), and is the standard in high-performance entropy coders.

The key difference: instead of maintaining [low, high) ⊂ [0, 1), maintain a range [low, low + range) where range is kept in a fixed-precision integer window by outputting bytes when it becomes too large.

Range coder state machine:

  range = 0xFFFFFFFF  (initial range, 32-bit)
  low   = 0

  For each symbol:
    1. Compute symbol's frequency band within current range
    2. Update low and range accordingly
    3. If range < BOTTOM_VALUE (e.g., 1<<24):
       - Output high byte of low
       - Shift low and range left by 8 bits (renormalize)

Probability Models: The Real Differentiator

The arithmetic coder itself is just a mechanism for converting probabilities to bits. What separates good compressors from great ones is the probability model — how accurately it predicts the next symbol.

Order-0 model: Probability of each symbol is its frequency in the file. Simple, fast, works well for uniform text.

Order-N model: Probability of the next symbol conditioned on the last N symbols. An order-4 model for English text uses context like "tion" to predict that the next character is probably a space, punctuation, or another vowel.

Prediction by Partial Matching (PPM): Uses the longest available context, falling back to shorter contexts when the long one hasn't been seen. PPM was state-of-the-art for general compression through the 1990s and still produces excellent compression ratios.

Context Mixing (CM): Combines predictions from multiple models using mixing weights. PAQ, the reference-grade maximum-compression tool, uses this approach and achieves remarkable compression ratios at enormous computational cost.

Context model comparison on Canterbury Corpus (text):

Model          Bits/char    Decoder complexity
────────────────────────────────────────────────
Order-0           4.6             O(1)
Order-1           3.5             O(1)
Order-4           2.4           O(alphabet)
PPM-C             1.9           O(alphabet × context)
PAQ8              1.7           O(many models × context)
Shannon limit    ~1.3              -

Where Arithmetic Coding Is Used

Arithmetic coding never became as ubiquitous as Huffman, partly because of patent issues in the 1990s (IBM and others held key patents), and partly because Huffman is simpler to implement and fast enough for most uses.

But arithmetic coding (specifically range coding) is central to:

LZMA/7-Zip: Range coding for the entropy stage
HEVC/H.265 video: CABAC (Context-Adaptive Binary Arithmetic Coding) for syntax element coding
JPEG 2000: Arithmetic coding replaces Huffman from JPEG
WebM/VP9: Optimized boolean arithmetic coder
Zstandard's ANS backend: A modern alternative (see below)

ANS: Asymmetric Numeral Systems

Asymmetric Numeral Systems (ANS), invented by Jarosław Duda in 2009 and published in 2014, is the modern successor to arithmetic coding for high-performance systems. It achieves the same near-optimal compression as arithmetic coding but is faster because it can be implemented without division operations.

ANS comes in two main flavors:

rANS (range ANS): Processes symbols serially, similar to arithmetic coding
tANS (tabled ANS): Uses precomputed tables for fast lookup-based coding

Zstandard, Facebook's widely-deployed compression library, uses tANS (called FSE — Finite State Entropy — in its implementation). It encodes at speeds comparable to Huffman while approaching arithmetic coding's compression ratio.

Rough comparison of entropy coder performance:

Coder          Speed (encode)    Compression ratio    Complexity
──────────────────────────────────────────────────────────────────
Huffman          Very fast           Good               Low
Arithmetic       Moderate           Excellent           High
Range coding     Moderate           Excellent           High
rANS             Fast               Excellent           Medium
tANS/FSE         Very fast          Excellent           Medium

ANS is why Zstandard can be both fast and compress well — it uses tANS where DEFLATE uses Huffman, and that's a meaningful part of why Zstd beats gzip at comparable speeds.

Summary

Arithmetic coding encodes an entire message as a single number in [0, 1).
Each symbol narrows the interval proportionally to its probability.
It can encode at arbitrarily close to Shannon's limit, including fractional bits.
Adaptive variants update the model on-the-fly — no two-pass requirement.
Range coding is the practical integer implementation used in real compressors.
ANS (especially tANS/FSE) is the modern high-speed alternative, used in Zstandard.
The probability model determines compression quality; the entropy coder just converts probabilities to bits efficiently.

In the next chapter, we shift from statistical coding to dictionary methods — a completely different approach that replaces repeated patterns with compact references rather than adjusting code lengths.

Chapter 4: Dictionary Methods — LZ77, LZ78, LZW

In 1977 and 1978, Abraham Lempel and Jacob Ziv published two papers that fundamentally changed compression. Their insight was different from Huffman's: instead of assigning shorter codes to frequent symbols, replace repeated strings with references to where they appeared before.

This is the dictionary approach: build a dictionary of previously seen data and encode new data as references into that dictionary. It's the foundation of zip, gzip, zlib, PNG, WebP, and dozens more formats.

Why Dictionary Methods Work

Text, code, and many binary formats are full of repetition:

Words and phrases repeat in text
Function names repeat in source code
Packet headers repeat in network captures
UI element descriptions repeat in HTML

Statistical coders (Huffman, arithmetic) can only exploit symbol frequency — if 'e' is common, give it a short code. But they can't directly exploit the pattern "the " appearing 10,000 times in a document. Dictionary methods can: encode "the " as a single token referencing its first occurrence.

Example: "abcdeabcde"

Statistical coder sees: 10 mostly-equal-frequency symbols.
Not much to gain; each symbol might get ~3 bits.
Total: ~30 bits.

Dictionary coder sees:
  "abcde" → encode literally (5 symbols)
  "abcde" → reference to 5 characters back, length 5

  Output: literal "abcde", then reference (offset=5, length=5)
  Potentially much shorter!

LZ77: The Sliding Window

LZ77 (1977) uses a sliding window: a buffer of recently seen data that serves as the dictionary. When encoding, the algorithm looks backward in the window for the longest match to the current position.

LZ77 window structure:

  ◄──── Search buffer (past data) ────►◄── Lookahead buffer ──►
  │ already encoded / "dictionary"    │ data to encode next    │
  └───────────────────────────────────┴────────────────────────┘
         typically 32KB or 64KB              typically 258 bytes

Current position is the boundary between the two.

Encoding a position: Find the longest string in the search buffer that matches the beginning of the lookahead buffer. Output a triple:

(offset, length, next_char)
  offset: how far back the match starts
  length: how long the match is
  next_char: the first character after the match (that didn't match)

If no match found, output (0, 0, current_char) — a literal.

Step-by-Step Example

Encoding "AABCBBABC" with a window size of 8:

Position 0: Lookahead = "AABCBBABC"
  Search buffer is empty. No match possible.
  Output: (0, 0, 'A')   [literal A]

Position 1: Lookahead = "ABCBBABC"
  Search buffer: "A"
  Match: 'A' at offset 1, length 1
  Next char after match: 'B'
  Output: (1, 1, 'B')   [go back 1, copy 1 char, then 'B']

Position 3: Lookahead = "CBBABC"
  Search buffer: "AAB"
  No match for 'C' in buffer.
  Output: (0, 0, 'C')   [literal C]

Position 4: Lookahead = "BBABC"
  Search buffer: "AABC"
  'B' found at offset 2 (buffer position 2 from right).
  Match length 1. Next char: 'B'
  Output: (2, 1, 'B')

Position 6: Lookahead = "ABC"
  Search buffer: "AABCBB"
  'A' at various positions. Let's check for "AB":
    "AB" found starting at offset 5 (the "AB" in "AABC").
  Then 'C' follows — does the match extend to "ABC"? Yes!
  Output: (5, 3, '') — but we're at the end, so maybe (5, 3)
  (In practice, a special end-of-stream marker is used)

Full encoded sequence: (0,0,'A'), (1,1,'B'), (0,0,'C'), (2,1,'B'), (5,3,...)

Without accounting for encoding overhead, we encoded 9 characters using 5 tokens — some savings, more dramatic on longer repeated sequences.

Finding Matches Efficiently

Naively, finding the longest match requires scanning the entire search buffer for each position — O(window_size × lookahead_size) per position, which is very slow for a 32KB window.

Real LZ77 implementations use hash tables to find candidate match positions in O(1), then verify and extend. DEFLATE uses a hash on the first 3 characters of the lookahead to find candidates.

def lz77_encode(data: bytes, window_size: int = 255, max_match: int = 255):
    """Simple LZ77 encoder (pedagogical, not optimized)."""
    i = 0
    tokens = []

    while i < len(data):
        best_offset = 0
        best_length = 0

        # Search backward in the window
        search_start = max(0, i - window_size)

        for j in range(search_start, i):
            # How long is the match starting at j vs i?
            length = 0
            while (length < max_match
                   and i + length < len(data)
                   and data[j + length] == data[i + length]):
                length += 1
                # Handle overlapping matches (length can exceed i - j)
                if j + length >= i:
                    # Wrap around: copy from j modulo current i
                    pass  # simplified: stop at boundary for clarity

            if length > best_length:
                best_length = length
                best_offset = i - j

        if best_length > 2:  # Only use reference if it saves bits
            next_char = data[i + best_length] if i + best_length < len(data) else None
            tokens.append(('ref', best_offset, best_length, next_char))
            i += best_length + (1 if next_char is not None else 0)
        else:
            tokens.append(('lit', data[i]))
            i += 1

    return tokens

LZ78: The Growing Dictionary

LZ78 (1978) takes a different approach: instead of a sliding window over past data, it builds an explicit dictionary of phrases encountered so far.

The dictionary grows with the data. When a new string is seen:

Look it up in the dictionary.
If found, extend the lookup by one more character.
When the extension isn't found, output the longest found phrase's index plus the new character, and add the new phrase to the dictionary.

LZ78 encoding of "AABABCABABCABC":

Dictionary starts empty.

Char 'A': Not in dict. Output (0, 'A'). Add "A" → entry 1.
Dict: {1:"A"}

Char 'A': In dict (entry 1). Try to extend.
Char 'B': "AB" not in dict. Output (1, 'B'). Add "AB" → entry 2.
Dict: {1:"A", 2:"AB"}

Char 'A': Entry 1. Extend.
Char 'B': "AB" is entry 2. Extend.
Char 'C': "ABC" not in dict. Output (2, 'C'). Add "ABC" → entry 3.
Dict: {1:"A", 2:"AB", 3:"ABC"}

Char 'A': Entry 1. Extend.
Char 'B': "AB" is entry 2. Extend.
Char 'A': "ABA" not in dict. Output (2, 'A'). Add "ABA" → entry 4.
...

Output sequence: (0,'A'), (1,'B'), (2,'C'), (2,'A'), (3,'B'), (3,'C')

LZ78's advantage: the dictionary can capture patterns from anywhere in the past, not just the most recent window. Its disadvantage: the dictionary can grow unbounded (or must be reset), and it's harder to implement efficiently.

LZW: LZ78 Without the Literal

LZW (Lempel-Ziv-Welch, 1984) simplifies LZ78 by initializing the dictionary with all single characters, eliminating the need to transmit literals.

Key change: Output is just the dictionary index, never a raw character. Since all single characters are pre-loaded, every output is an index reference.

LZW became famous because it was used in:

GIF (the primary reason GIF images exist)
TIFF (optional compression)
PDF (early versions)
Unix compress utility

It was also famously patented by Unisys, which caused the GIF controversy of the 1990s and drove developers toward PNG (which uses DEFLATE instead).

LZW encoding of "ABABABAB":

Initialize dict: {'A':0, 'B':1, 'C':2, ..., 'Z':25}
                 (all 256 ASCII values for binary data)
Next code: 256

i=0: Read 'A'. Match so far: "A"
     Read 'B'. "AB" not in dict.
     Output: 0 (code for 'A')
     Add "AB" → 256. Match = "B"

i=1: Match: "B". Read 'A'. "BA" not in dict.
     Output: 1 (code for 'B')
     Add "BA" → 257. Match = "A"

i=2: Match: "A". Read 'B'. "AB" is in dict! (code 256). Match = "AB"
     Read 'A'. "ABA" not in dict.
     Output: 256 (code for "AB")
     Add "ABA" → 258. Match = "A"

...

Output: 0, 1, 256, 256, ...

The elegant trick in LZW: the decompressor can reconstruct the dictionary in sync with the encoder, because each dictionary entry is added exactly when the encoder adds it. The decompressor needs no dictionary transmitted separately.

How ZIP Actually Works

ZIP files use DEFLATE compression (Chapter 5), but understanding how a zip container is structured illuminates how dictionary compression is packaged for real use.

ZIP file structure:

┌─────────────────────────────────────────────────────────────┐
│  Local file header + compressed data (for each file)        │
│  ┌──────────────────┬──────────────────────────────────┐    │
│  │ Local File Header│ Compressed file data              │    │
│  │ - Magic: PK\x03\x04                                 │    │
│  │ - Version needed                                    │    │
│  │ - Compression method (0=store, 8=DEFLATE)           │    │
│  │ - Last-modified time/date                           │    │
│  │ - CRC-32                                            │    │
│  │ - Compressed size                                   │    │
│  │ - Uncompressed size                                 │    │
│  │ - Filename length + filename                        │    │
│  └──────────────────┴──────────────────────────────────┘    │
│  ... (repeated for each file)                               │
│                                                             │
│  Central Directory (one entry per file)                     │
│  - Same metadata as local headers                           │
│  - Offset to local header in archive                        │
│  - Extra: file comment, external attributes                 │
│                                                             │
│  End of Central Directory Record                            │
│  - Offset and size of central directory                     │
│  - Archive comment                                          │
└─────────────────────────────────────────────────────────────┘

Key design decisions in ZIP:

Per-file compression: Each file is compressed independently. This allows random access to individual files without decompressing the whole archive.
No archive-level dictionary: Files don't share a dictionary. This is why .tar.gz (compress all, then archive) often beats .zip for collections of similar files — tar first concatenates everything, then gzip can find cross-file patterns.
Central directory at the end: Allows ZIP writers to stream (compress file, then seek back and write metadata), and allows quick listing without decompressing data.

import zipfile
import io

# Creating a ZIP in Python
buf = io.BytesIO()
with zipfile.ZipFile(buf, 'w', compression=zipfile.ZIP_DEFLATED) as zf:
    zf.writestr('hello.txt', 'Hello, world! ' * 1000)
    zf.writestr('data.csv', 'id,name,value\n' * 500)

# Reading it back
buf.seek(0)
with zipfile.ZipFile(buf, 'r') as zf:
    print(zf.namelist())           # ['hello.txt', 'data.csv']
    print(zf.getinfo('hello.txt').compress_size)  # much smaller than original

Comparing the LZ Variants

Algorithm   Dictionary    Window       Best for
──────────────────────────────────────────────────────────────────
LZ77        Sliding       Recent data  Streaming, simple decoder
LZ78        Growing       All past     Better adaptation over time
LZW         Growing       All past     Simple implementation, patents
DEFLATE     Sliding+Hash  32KB         Universal (see Chapter 5)
LZMA        Sliding+PPM   Up to 4GB    Maximum compression (Ch 6)

The LZ77 family (sliding window) is much more common in practice because:

Bounded memory: a fixed-size window means predictable memory use
Simple decoder: just copy bytes from the output buffer
Better for streaming: doesn't require dictionary reset

The LZ78 family excels when data has patterns that are far apart, but the unbounded dictionary is a practical challenge.

The Asymmetry That Matters

One of the most important properties of LZ-based compression: decompression is much faster than compression.

Compression requires searching for matches — expensive. Decompression just follows back-references — basically a memcpy. This asymmetry is intentional and valuable: you compress once (or slowly in batch), you decompress many times (fast, on demand).

Typical LZ77-family speed asymmetry:

gzip    compress:  ~25 MB/s    decompress:  ~250 MB/s   (10× asymmetry)
Zstd    compress: ~400 MB/s    decompress: ~1200 MB/s   (3× asymmetry)
LZ4     compress: ~700 MB/s    decompress: ~4000 MB/s   (6× asymmetry)

LZ4 is interesting: it's designed so that decompression is the bottleneck for nothing — it's fast enough that you can decompress data faster than RAM can deliver it on most systems. This is the design goal for in-memory and database compression.

Summary

LZ77 uses a sliding window as a dictionary; references are (offset, length) pairs.
LZ78 builds an explicit, growing dictionary of seen phrases.
LZW pre-initializes the dictionary and eliminates explicit literals.
ZIP uses DEFLATE (LZ77 + Huffman) per file, enabling random access at the cost of cross-file compression.
The asymmetry between compression and decompression speed is a feature, not a bug.

The next chapter shows how modern compressors — DEFLATE, Brotli, and Zstandard — took the LZ77 core and built major improvements around it.

Chapter 5: Modern Lossless — Deflate, Brotli, Zstandard

The LZ family gave us the dictionary method. Huffman coding gave us the entropy stage. The insight of DEFLATE was: combine them. Use LZ77 to remove repetition, then Huffman to compress the residue. The result became the compression workhorse of the internet — used inside gzip, zlib, PNG, PDF, and zip files.

Brotli and Zstandard improved on DEFLATE with better probability modeling, larger dictionaries, and faster algorithms. Zstd in particular has become the default choice for new systems, with a design that's both faster and more compressible than gzip.

DEFLATE: LZ77 + Huffman

DEFLATE was defined by Phil Katz in 1991 (for PKZIP) and later formalized in RFC 1951. It's not one algorithm — it's a carefully designed combination:

DEFLATE pipeline:

  Raw data
     │
     ▼
  LZ77 matching
  (find back-references in 32KB sliding window)
     │
     ▼
  Token stream: literals and (offset, length) pairs
     │
     ▼
  Huffman encoding
  (two trees: one for literals/lengths, one for offsets)
     │
     ▼
  Compressed bitstream

The Two Huffman Trees

DEFLATE uses two Huffman trees simultaneously:

Literal/Length tree: Encodes 286 possible values.

0–255: literal byte values
256: end-of-block marker
257–285: match lengths (lengths 3–258, with some range-coded)

Distance tree: Encodes 30 possible distance codes (representing the back-reference offset). Short distances get short codes; long distances (near 32KB) get longer codes.

DEFLATE symbol types:

  Literal 'A' (0x41):
  ┌──────────┐
  │  0x41    │ → Literal/Length Huffman code for 0x41
  └──────────┘

  Back-reference (offset=7, length=5):
  ┌──────────┬─────────────┐
  │ Len code │  Dist code  │
  │  (261)   │    (4)      │
  └──────────┴─────────────┘
  Length 261 means "length 5" (with no extra bits)
  Distance 4 means "offset 7-8" (with 1 extra bit)

Block Types

DEFLATE has three block types, and the compressor chooses per-block which to use:

Type 0 (stored): No compression. Used when the input is already compressed or random.
Type 1 (fixed Huffman): Uses predefined, hardcoded Huffman codes. Fast to encode, slightly suboptimal.
Type 2 (dynamic Huffman): Stores custom Huffman trees at the start of each block, then uses them. Optimal but has header overhead.

gzip file structure:

┌─────────────────────────────────────────────────────┐
│ gzip header (10 bytes)                              │
│  - Magic: \x1f\x8b                                  │
│  - Compression method: 8 (DEFLATE)                  │
│  - Flags, modification time, OS                     │
├─────────────────────────────────────────────────────┤
│ DEFLATE compressed data (multiple blocks)           │
│  Block 1: [type][Huffman tables if type 2][data]    │
│  Block 2: [type][Huffman tables if type 2][data]    │
│  ...                                                │
│  Final block: BFINAL bit set                        │
├─────────────────────────────────────────────────────┤
│ CRC-32 of original data (4 bytes)                   │
│ Original size mod 2^32 (4 bytes)                    │
└─────────────────────────────────────────────────────┘

DEFLATE's Limitations

DEFLATE was designed in 1991 and it shows:

32KB window: Modern files are often megabytes or gigabytes. Patterns farther than 32KB apart can't be referenced.
Huffman entropy coding: As we saw in Chapter 3, arithmetic coding and ANS can do better for skewed probabilities.
Static two-tree model: The Huffman trees are trained on the block, not on the specific statistics of literals vs. back-references.
Sequential only: No parallelism; compressor must process the stream serially.

Despite these limitations, DEFLATE is everywhere. The decompressor is simple enough to implement in firmware, the format is standardized and stable, and it's good enough for most uses.

Brotli: Google's Web Compressor

Brotli was released by Google in 2013 and standardized in RFC 7932 (2016). It's designed primarily for compressing web content (HTML, CSS, JS), and it has a trick that DEFLATE doesn't: a built-in static dictionary.

The Static Dictionary

Brotli ships with a 120KB dictionary of common web strings. Things like "<!DOCTYPE html>", "Content-Type: ", "text/javascript", "function ", and thousands of common JavaScript, HTML, and CSS tokens.

References to this static dictionary can be made without the dictionary appearing in the compressed stream at all — the decompressor already has it. This gives Brotli a massive advantage on small web resources where a traditional compressor would waste most of its window learning the format.

Brotli static dictionary example:

Compressing: '<html lang="en">'

Without static dict:
  Encode each character literally, or hope it appears earlier in the window.

With Brotli static dict:
  '<html ' → static dict entry #1234 (immediate reference, no prior occurrence needed)
  'lang="en"' → static dict entry #5678
  '>' → literal

Much better compression for the first occurrence of common patterns.

Brotli's Other Improvements

Beyond the static dictionary, Brotli improves on DEFLATE with:

Larger sliding window: Up to 16MB (vs 32KB for DEFLATE)
Context modeling: Uses 64 context maps, conditioning symbol probabilities on the context (previous bytes, literal vs. length/distance)
Distance ring buffer: Tracks the four most recent distances; referencing a recent distance costs fewer bits
Better match finding: More sophisticated algorithms for finding long matches

Compression ratios on web content (lower bits/byte = better):

Format                  HTML    JS     CSS    Average
────────────────────────────────────────────────────
Uncompressed (bytes)    100%   100%   100%    100%
gzip level 6             27%    32%    23%     27%
Brotli level 6           22%    26%    19%     22%
Brotli level 11          19%    23%    17%     20%

Brotli compresses better, but it's slower than gzip at comparable quality settings. This is acceptable for static assets that are compressed once and served many times. For dynamic content, gzip or Zstd are better choices.

Brotli's decompression speed is comparable to gzip — fast enough for real-time web serving.

Zstandard: The New General Purpose

Zstandard (zstd) was developed at Facebook and open-sourced in 2016. It's designed to be:

Fast at both compression and decompression
Competitive with gzip at comparable quality
Better than gzip at higher quality settings
A drop-in replacement for zlib in many contexts

Zstandard has become the default compression algorithm for Linux kernels, Python wheel packages, Facebook's storage infrastructure, and Debian packages.

Zstd's Core Architecture

Zstandard encoding pipeline:

  Raw data
     │
     ├──► LZ77 matching (larger window than DEFLATE, configurable)
     │
     ├──► Sequence encoding (literals + match tokens)
     │
     ├──► Literal compression (tANS/FSE entropy coding)
     │
     ├──► Sequence header compression (tANS for match commands)
     │
     ▼
  zstd frame (with optional frame header, checksums)

The key innovations in Zstd:

tANS (Finite State Entropy): Zstd uses tANS instead of Huffman for entropy coding. As discussed in Chapter 3, this approaches arithmetic coding quality at Huffman speed.

Training dictionaries: Zstd supports trained dictionaries — if you're compressing many similar small files (like JSON API responses), you can train a dictionary on a sample set. References to the dictionary dramatically improve compression of small inputs.

Configurable window size: DEFLATE is stuck at 32KB. Zstd can use windows up to 2GB, enabling better compression of large files with distant repetitions.

Parallel compression: zstd can split the input into blocks and compress them in parallel using multiple cores (with the --threads flag). Output is still a single stream, fully compatible with single-threaded decompression.

Zstd Compression Levels

Zstd's level system is broader than gzip's 1-9:

Zstd compression levels:

Level   Speed (encode)   Ratio      Notes
──────────────────────────────────────────────────────
 1      Very fast        Moderate   Competitive with LZ4
 3      Fast             Good       Default; beats gzip speed
 6      Moderate         Good       Near gzip -6 quality, faster
 9      Moderate         Better     Better than gzip -9
 12     Slow             Best fast  Starts using more memory
 19     Very slow        Excellent  Near Brotli quality
 22     Extremely slow   Maximum    (--ultra flag required)

Negative levels (e.g. -1, -5): faster than level 1, less compression.
Useful for real-time or log compression where CPU is the bottleneck.

# Python example using zstandard library
import zstandard as zstd

# Basic compression
cctx = zstd.ZstdCompressor(level=3)
compressed = cctx.compress(data)

dctx = zstd.ZstdDecompressor()
original = dctx.decompress(compressed)

# Streaming (memory-efficient for large data)
cctx = zstd.ZstdCompressor(level=6)
with open('large_file.txt', 'rb') as fin:
    with open('large_file.zst', 'wb') as fout:
        cctx.copy_stream(fin, fout)

# With a trained dictionary (for many small files)
# Train on samples:
samples = [b'{"user":"alice","action":"login"}',
           b'{"user":"bob","action":"view"}',
           ...]  # many samples
params = zstd.ZstdCompressionParameters.from_level(3)
dict_data = zstd.train_dictionary(8192, samples)  # 8KB dict

# Compress with dictionary:
cctx = zstd.ZstdCompressor(dict_data=dict_data, level=3)
compressed = cctx.compress(b'{"user":"carol","action":"login"}')
# Much smaller than without dictionary!

Why Zstd Won

Head-to-head: Zstd vs. gzip on a 1GB log file

Metric                   gzip -6      zstd -3      zstd -6
──────────────────────────────────────────────────────────
Compression time          28s          1.8s         4.2s
Compressed size           312MB        320MB        298MB
Decompression time         3.2s        0.9s         0.9s
Compression ratio          3.2×         3.1×         3.4×

Memory usage (compress)   256KB         8MB          16MB
Memory usage (decompress) 32KB          128KB        128KB

Zstd is ~15× faster to compress at comparable ratio, and ~3.5× faster to decompress. The cost is slightly higher memory usage — acceptable for almost every modern context.

For static web assets, Brotli at level 11 is better (higher compression ratio). For real-time or batch compression of dynamic data, Zstd is the choice.

Putting It Together: When to Use What

Decision tree for lossless compression:

Is compatibility with old tools required?
├─ Yes → gzip/DEFLATE (ubiquitous)
└─ No → continue

Is compression speed the primary constraint?
├─ Yes (real-time, < 100ms) → LZ4 or Zstd level 1
└─ No → continue

Is decompression speed critical?
├─ Very (database, in-memory) → LZ4 or Snappy
└─ No → continue

Is the data primarily web content?
├─ Yes, static assets → Brotli level 6-11
├─ Yes, dynamic content → Zstd level 3 or gzip
└─ No → continue

Is maximum compression ratio needed?
├─ Yes, time doesn't matter → LZMA/7-zip (Chapter 6)
└─ No → Zstd level 6-9 (sweet spot)

Under the Hood: The zlib API

Most code doesn't call DEFLATE directly — it calls zlib, the C library that wraps DEFLATE and exposes a streaming API. Understanding the zlib API helps when debugging compression issues:

#include <zlib.h>

// Compress in one shot
int compress_buffer(const uint8_t *src, size_t src_len,
                    uint8_t *dst, size_t *dst_len) {
    z_stream strm = {0};
    deflateInit(&strm, Z_DEFAULT_COMPRESSION);  // level 6

    strm.next_in  = (Bytef*)src;
    strm.avail_in = src_len;
    strm.next_out = dst;
    strm.avail_out = *dst_len;

    int ret = deflate(&strm, Z_FINISH);
    *dst_len = strm.total_out;

    deflateEnd(&strm);
    return ret == Z_STREAM_END ? Z_OK : ret;
}

The streaming API (deflate/inflate called repeatedly) handles data larger than memory by processing in chunks. This is how nginx compresses HTTP responses on-the-fly without buffering the entire response body.

Summary

DEFLATE = LZ77 + Huffman, 32KB window. Universal, fast to decompress, shows its age.
Brotli adds a static web dictionary and better entropy coding. Ideal for static web assets.
Zstandard uses tANS entropy coding, configurable windows, and trained dictionaries. The modern default for general-purpose compression — faster than gzip, better compression at high levels.
For most new systems: use Zstd. For web serving: serve Brotli for supporting clients, gzip for older ones.

Next up: LZMA and 7-Zip — what happens when you push lossless compression to its practical limit.

Chapter 6: LZMA and 7-Zip

LZMA (Lempel-Ziv-Markov chain Algorithm) is the compression algorithm behind 7-Zip's .7z format, and also used in .xz — the format that replaced .bz2 in most Linux distributions. It achieves compression ratios that gzip can't touch, at the cost of significantly more time and memory.

Understanding LZMA means understanding what happens when you take every technique we've discussed and combine them: large sliding windows, Markov chain probability models, and range coding for entropy. The result is a compressor that's slow by design — but for cold storage, software distribution, and archiving, slow-to-compress is fine.

The LZMA Architecture

LZMA is best understood as a three-layer system:

LZMA compression layers:

  Raw data
     │
     ▼
  ┌─────────────────────────────────┐
  │  Layer 1: LZ77 Matching         │
  │  - Sliding window (up to 4GB)   │
  │  - Match finder (hash chains,   │
  │    binary trees, or BT4)        │
  │  - Output: literal bytes +      │
  │    (offset, length) pairs       │
  └─────────────────────────────────┘
     │
     ▼
  ┌─────────────────────────────────┐
  │  Layer 2: Markov Model          │
  │  - Conditions each bit on       │
  │    context (recent bits,        │
  │    match/literal state machine) │
  │  - Hundreds of probability      │
  │    models, one per context      │
  └─────────────────────────────────┘
     │
     ▼
  ┌─────────────────────────────────┐
  │  Layer 3: Range Coding          │
  │  - Arithmetic-coding-equivalent  │
  │  - Each bit encoded using its   │
  │    model's probability estimate │
  └─────────────────────────────────┘
     │
     ▼
  .7z / .xz compressed stream

Layer 1: The Large Sliding Window

DEFLATE uses a 32KB sliding window. Brotli goes up to 16MB. LZMA goes up to 4GB (configurable, with common settings at 4MB–64MB for a balance of memory and compression).

Why does window size matter so much?

Example: Compressing a 10MB C source code archive

Pattern "    return NULL;" appears 847 times throughout the file.
The first occurrence is at byte 0, the last at byte 9,900,000.

DEFLATE (32KB window):
  First occurrence: encode literally
  Subsequent occurrences within 32KB: back-reference (cheap)
  After moving 32KB past last reference: must re-encode (expensive!)
  Result: can only reference the ~last 32,000 bytes.

LZMA (64MB window):
  First occurrence: encode literally
  ALL 847 subsequent occurrences: back-reference (cheap)
  Even occurrences 9MB later can reference the first.

For software distributions — which typically contain many similar files, headers, and boilerplate — large windows dramatically improve compression.

Match Finders

Finding the best match in a large window requires sophisticated data structures. LZMA uses several:

Hash chains (fast, less optimal): Hash the first 2-3 characters of each potential match position; store chains of positions with the same hash.

Binary trees (BT3, BT4): For each hash bucket, maintain a binary tree sorted by the actual string content at each position. The 4 in BT4 means the first 4 bytes are hashed before tree lookup. This finds near-optimal matches but requires more memory and CPU.

Match finder complexity tradeoff:

Finder    Memory     Speed     Match quality
─────────────────────────────────────────────
HC4       Medium     Fast      Good
BT2       High       Slow      Excellent
BT3       Higher     Slower    Excellent
BT4       Highest    Slowest   Excellent (best for text)

Layer 2: Markov Chain Models

This is where LZMA gets unusual. After the LZ77 stage produces a stream of tokens (literals and back-references), LZMA doesn't just Huffman-code them. It uses a rich set of probability models, one per context.

The insight: the probability of the next bit depends heavily on what state the encoder is in. A bit that's part of a match length is statistically different from a bit that's part of a literal. A bit that's the high bit of a match offset is different from a bit that's the low bit.

LZMA defines a finite state machine with states like:

LIT: Last token was a literal
MATCH: Last token was a match
LONGREP: Last token was a "long repeat" match
SHORTREP: Last token was a short repeat (single-byte repeat)

And within each state, each bit position within each field (literal byte, offset, length) has its own probability model.

LZMA state machine (simplified):

      ┌─────────────────────────────────────────┐
      │              STATES                     │
      │                                         │
      │  LIT ──────► MATCH ──────► LONGREP      │
      │   ▲           │   ▲         │  ▲        │
      │   │           ▼   │         ▼  │        │
      │   └────── SHORTREP ◄──── LONGREP1       │
      └─────────────────────────────────────────┘

Each state has its own probability tables for the next token type.
Each bit within each token field has its own probability model
conditioned on the current state + previously coded bits.

Total probability models: hundreds

This is why LZMA compresses so well — and why it's slow. The price of accurate probability models is maintaining and querying hundreds of them.

Layer 3: Range Coding

The probability estimates from the Markov models feed directly into a range coder (the arithmetic coding variant from Chapter 3). Each bit is encoded using its own probability, approaching the Shannon limit.

Range coding here operates at the bit level (binary arithmetic coding): each bit is either 0 or 1, and the model gives the probability of 0. The range coder narrows its interval accordingly.

# Simplified LZMA range coder core (pedagogical)
class RangeCoder:
    RANGE_TOP = 1 << 24
    MODEL_BITS = 11
    MODEL_TOTAL = 1 << MODEL_BITS  # 2048

    def __init__(self):
        self.low = 0
        self.range = 0xFFFFFFFF
        self.output = []

    def encode_bit(self, prob: int, bit: int):
        """
        Encode one bit using probability model.
        prob: probability of bit=0, in range [0, 2048)
        """
        bound = (self.range >> self.MODEL_BITS) * prob

        if bit == 0:
            self.range = bound
            # Update model toward 0 (increase prob of 0)
            prob += (self.MODEL_TOTAL - prob) >> 5
        else:
            self.low += bound
            self.range -= bound
            # Update model toward 1 (decrease prob of 0)
            prob -= prob >> 5

        # Renormalize
        while self.range < self.RANGE_TOP:
            self.output.append((self.low >> 24) & 0xFF)
            self.low = (self.low << 8) & 0xFFFFFFFF
            self.range <<= 8

        return prob  # return updated probability for next call

The model update (prob += (MODEL_TOTAL - prob) >> 5) is an exponential moving average — recent bits influence the probability more than old ones. The >> 5 factor controls the adaptation rate: faster adaptation tracks local statistics, slower adaptation tracks global statistics.

The 7-Zip Container Format (.7z)

LZMA (and its enhanced version LZMA2) is the primary codec in the .7z container format.

.7z file structure:

  ┌─────────────────────────────────────────────────┐
  │  Signature: '7z\xBC\xAF\x27\x1C' (6 bytes)     │
  │  Version + CRC of "Next Header Info" (12 bytes) │
  ├─────────────────────────────────────────────────┤
  │  Compressed data streams                        │
  │  - Optionally solid (all files in one stream)   │
  │  - Multiple codecs can be chained:              │
  │    e.g., BCJ (x86 filter) → LZMA2              │
  ├─────────────────────────────────────────────────┤
  │  Header (compressed)                            │
  │  - File list, sizes, timestamps, attributes     │
  │  - Offsets to each file's data in stream        │
  └─────────────────────────────────────────────────┘

Solid archives: A key 7-Zip feature is solid compression — all files are concatenated into one stream before compression. This allows the compressor to exploit similarities across files (e.g., many .c files sharing headers). Solid archives compress much better but require decompressing from the beginning to extract a single file.

Archive type comparison (100 similar text files, 10MB total):

              Ratio    Extract-single-file
──────────────────────────────────────────
zip (per-file)   40%    Fast (seek to offset)
7z non-solid     42%    Fast
7z solid         28%    Slow (decompress from start)

For software distribution archives (read-once, extract-all), solid is clearly better. For random-access archives (extract individual files frequently), non-solid or per-file compression wins.

The BCJ Filter

Executable files (x86, ARM, etc.) contain instruction addresses — absolute memory addresses embedded in jump and call instructions. These addresses look random to a compressor, even though they have structure.

LZMA uses preprocessor filters (BCJ = Branch-Call-Jump filters) that convert these absolute addresses to relative ones before compression. Relative addresses are much more redundant (many calls are to nearby functions, so many relative addresses are small and similar).

BCJ filter effect on x86 binary:

Without BCJ:
  CALL instruction at offset 0x1000 targeting 0x5000:
    Bytes: E8 FB 3F 00 00   (relative: 0x3FFB = 0x5000 - 0x1000 - 5)
  CALL instruction at offset 0x1100 targeting 0x5000:
    Bytes: E8 FB 3E 00 00   (relative: 0x3EFB = 0x5000 - 0x1100 - 5)
  These two look different to the compressor.

With BCJ (convert to absolute):
  Both become: E8 00 50 00 00   (stores absolute target 0x5000)
  Now they're identical! Back-reference saves 4 bytes.

BCJ filters can reduce compressed executable size by 10-30% compared to raw LZMA.

XZ: LZMA2 for Unix

The .xz format packages LZMA2 (an improved version of LZMA that handles multi-stream and partial block updates better) with a Unix-friendly wrapper:

.xz structure:

  ┌────────────────────────────────────────────┐
  │  Stream header                             │
  │  - Magic: \xFD7zXZ\x00                    │
  │  - Stream flags (check type: CRC32/CRC64) │
  ├────────────────────────────────────────────┤
  │  Blocks (LZMA2 compressed data)           │
  │  Each block has its own compressed size   │
  │  and uncompressed size                    │
  ├────────────────────────────────────────────┤
  │  Index (optional block index for seeking) │
  ├────────────────────────────────────────────┤
  │  Stream footer                            │
  │  - Size of index                          │
  │  - Stream flags                           │
  │  - Magic: YZ                              │
  └────────────────────────────────────────────┘

XZ replaced bz2 as the standard Linux source distribution format because it compresses significantly better at comparable memory use (bz2 has a 900KB memory limit that hurts compression on large files).

The Real Cost: Memory and Time

The reason LZMA isn't used everywhere is cost:

Compression resource requirements:

                    Memory (compress)  Memory (decompress)  Time
─────────────────────────────────────────────────────────────────
gzip -6               256KB               32KB            1×
Brotli-6              8MB                 1MB             4×
Zstd-6                16MB               128KB            0.3×
xz -6                 97MB               65MB             20×
xz -9                 683MB              65MB             50×

xz at -9 needs 683MB to compress and 65MB to decompress. For a server decompressing a package install, 65MB is fine. For an embedded device — it might not have 65MB free.

The decompression memory requirement is the key constraint for LZMA in embedded contexts. This is why most Linux distributions ship packages compressed with xz (because the build server has plenty of memory and time), but the package manager's memory footprint during installation is bounded.

When LZMA/7-Zip Makes Sense

Use LZMA when:

✓ Cold storage or archival (compress once, rare access)
✓ Software distribution (compress once, download many times)
✓ Solid archives of similar files
✓ Target has adequate decompression memory (~64MB+)
✓ Compression time is not a constraint
✓ Maximizing compression ratio is the priority

Avoid LZMA when:

✗ Real-time or low-latency compression required
✗ Memory is severely constrained (< 64MB available)
✗ Data changes frequently (re-compression would be prohibitive)
✗ Random access to parts of the archive is needed
✗ Compression speed matters (CI/CD pipelines, logging, etc.)

Tuning LZMA

The main knobs:

Dictionary size (-d in 7z, --dict in xz): Bigger window = better compression, more memory. Common values: 4MB, 8MB, 16MB, 64MB.

Word size / fast bytes (-fb): Maximum match length the match finder will consider. Higher = better matches, slower.

Match cycles / depth (-mc, -md): How hard to search for matches. Diminishing returns past a certain point.

xz preset breakdown:

Preset  Dict     Word   Depth   Memory
──────────────────────────────────────
-0      256KB    32     2        2MB
-1      1MB      32     3        4MB
-3      4MB      32     4       15MB
-6      8MB      64     4       97MB
-9      64MB     273   unlimited  683MB

Summary

LZMA combines large sliding windows (up to 4GB), hundreds of Markov chain probability models, and range coding.
The Markov models condition each bit's probability on the encoder's current state.
7-Zip uses solid archives (compressing all files as one stream) to exploit cross-file redundancy.
BCJ filters convert absolute addresses in executables to relative ones, dramatically improving compression.
XZ wraps LZMA2 for Unix with multi-stream support and block-level CRC integrity checks.
The cost: decompression needs ~64MB; compression can need hundreds of MB. Slow by design.

For everyday use, Zstd is a better choice. For software distribution archives where compression ratio is paramount and compress-once-decompress-many economics apply, LZMA is hard to beat.

Next, we step back to some simpler methods — run-length encoding and its friends — and find out where they still shine.

Chapter 7: Run-Length Encoding and Simple Methods

Not every compression problem requires a sophisticated algorithm. Some data has structure so obvious that a simple encoder can compress it optimally. Run-length encoding is the oldest compression technique — simple enough to implement in an afternoon, yet still used in fax machines, printer languages, and as a preprocessing step in far more complex compressors.

This chapter covers the "simple" methods: RLE, delta encoding, and the Burrows-Wheeler Transform. None of these alone produces state-of-the-art compression, but they each illuminate an important principle, and BWT in particular is a beautiful rearrangement that enables bzip2's remarkable compression ratios.

Run-Length Encoding (RLE)

The idea is almost embarrassingly simple: instead of encoding a run of repeated values literally, encode the value once and the count.

RLE example:

Input:   AAAAAABBBCCDDDDDDDDDD
         ──────╌──╌──────────
         6 A's, 3 B's, 2 C's, 10 D's

Output:  6A 3B 2C 10D

Compression: 21 bytes → 8 bytes

The Encoding Ambiguity Problem

Naive RLE has an ambiguity problem: how do you tell runs from literals? If your encoded byte 6A means "6 A's", what does 1B mean — one B (literal), or a run of 1 B? And what if the data contains the digit '6'?

Different formats resolve this differently:

Fixed format (always a pair): Always output (count, value). Problem: single non-repeated bytes still cost 2 bytes to encode.

Input: ABCAAAA
With fixed (count, value):
  Output: 1A 1B 1C 4A = 8 bytes (worse than 7-byte input!)

Run/literal flag: Use the high bit of the count byte to distinguish runs from literal sequences.

PackBits format (used in TIFF, Macintosh PICT):

  Count byte n:
    n = 0..127: copy next (n+1) bytes literally
    n = -1..-127 (signed, or 0x81..0xFF): repeat next byte (-n+1) times
    n = -128: no-op

  Input:  AAAAAABBBCCDDDDDDDDDD
  As PackBits:
    -5, A           → repeat A 6 times
    -2, B           → repeat B 3 times
    -1, C           → repeat C 2 times
    -9, D           → repeat D 10 times
  Output: 4 pairs = 8 bytes ✓

Escape-based: Use an escape character to signal a run. All non-escape bytes are literals.

def rle_encode(data: bytes, min_run: int = 3) -> bytes:
    """
    RLE encoder. Uses escape byte 0x90 followed by count.
    Single occurrence of 0x90 in input is encoded as 0x90, 0x00.
    """
    ESCAPE = 0x90
    result = bytearray()
    i = 0

    while i < len(data):
        run_char = data[i]
        run_len = 1

        while i + run_len < len(data) and data[i + run_len] == run_char:
            run_len += 1

        if run_len >= min_run or run_char == ESCAPE:
            # Encode as run(s) — max run length 255
            while run_len > 0:
                chunk = min(run_len, 255)
                if chunk == 1 and run_char != ESCAPE:
                    result.append(run_char)
                else:
                    result.append(ESCAPE)
                    result.append(chunk)
                    result.append(run_char)
                run_len -= chunk
        else:
            # Encode literally
            result.extend(data[i:i+run_len])

        i += run_len

    return bytes(result)

Where RLE Shines (and Doesn't)

RLE is optimal for highly repetitive data and terrible for random data:

Compression ratio by data type:

Data type              RLE performance
─────────────────────────────────────────────────
Solid color regions    Excellent (100:1 possible)
Black/white fax pages  Excellent (used in G3/G4 fax)
Palette-based graphics Good (large solid areas)
Natural photographs    Terrible (rarely adjacent pixels match)
Binary executable      Terrible to neutral
Compressed data        Terrible (already near-random)

Overhead for random data:
  A byte of escape value in the input → 2 bytes out. Up to 100% expansion.

Group 4 fax compression (the ITU-T standard for fax transmission) is essentially 2D RLE on black-and-white bit images, combined with a reference line. A typical fax page of text compresses from ~1MB to ~10KB — a 100:1 ratio — because pages of text are overwhelmingly white with small black runs for characters.

BMP and TGA image formats optionally use RLE for palette-based images where large areas share the same color.

Delta Encoding

Delta encoding doesn't compress repeated values — it compresses slowly changing values. Instead of storing absolute values, store the difference between consecutive values.

Delta encoding of time-series data:

Raw timestamps (Unix epoch, seconds):
  1700000000
  1700000060
  1700000120
  1700000180
  ...

Deltas:
  1700000000  ← first value, store literally (4 bytes)
  60          ← delta, fits in 1 byte!
  60
  60
  ...

Savings: 4 bytes per value → 1 byte per value after the first.

Delta encoding is the foundation of many domain-specific compressors:

Audio: CD audio is 16-bit samples at 44,100 Hz. Consecutive samples differ by small amounts (audio signals are smooth). Delta encoding turns these into small numbers that compress well with RLE or Huffman. FLAC uses a generalization: it fits a polynomial to blocks of samples and stores the residual (error).

Image compression: PNG uses delta encoding as a filter before compression:

Subtract adjacent pixels (Sub filter)
Subtract from the pixel above (Up filter)
Average of left and above (Average filter)
Paeth predictor (best linear predictor from left, above, upper-left)

PNG filter example (Sub filter):

Original row:  100 102 105 108 112 115
Filtered row:  100   2   3   3   4   3  (subtract previous pixel)

The filtered row has small values — easier to compress.

Columnar databases: Sorted timestamp or monotonically increasing ID columns compress extremely well with delta encoding + RLE.

Columnar delta + RLE example (user IDs in a sorted table):

User IDs (sorted): 1001, 1002, 1003, 1004, 1005, ...

After delta:  1001, 1, 1, 1, 1, ...

After RLE on the delta stream:
  1001 (literal), then (count=N-1, value=1)

Compression: (N × 4 bytes) → 8 bytes for any sorted sequential range.

The Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform is remarkable: it's a reversible rearrangement of data that makes it much easier to compress. It doesn't compress anything by itself — but after a BWT, the data often compresses dramatically better with simple methods like RLE and Huffman.

BWT is the core of bzip2, which consistently beats gzip for text compression despite being simpler conceptually.

How BWT Works

Given input string S, the BWT:

Forms all rotations of S (appended with a special end-of-string marker $)
Sorts them lexicographically
Outputs the last column of the sorted matrix

BWT of "banana$":

All rotations:
  banana$
  anana$b
  nana$ba
  ana$ban
  na$bana
  a$banan
  $banana

Sorted:
  $banana
  a$banan
  ana$ban
  anana$b
  banana$
  na$bana
  nana$ba

Last column (BWT output): annb$aa

The magic: the last column tends to have long runs of repeated characters, because rotations that sort together often differ only at the end — meaning the character before the current sorted context is often the same.

For "banana": after BWT, we get "annb$aa" — three runs of 2+ characters. Not impressive here, but on real text:

BWT effect on English text (first 100KB of War and Peace):

Before BWT: Character runs of length 1 dominate (~99%)
After BWT:  Character runs of length 3+ become common (~40%)

BWT output example fragment:
  ...eeeeeeeeeeeeeeeeeeetttttttttttttt...
  (long runs of common characters that follow similar contexts)

Inverting the BWT

The inversion is the elegant part. Given only the last column (and the position of the original string's end-of-string marker), you can reconstruct the full original string.

The key insight: you can reconstruct the first column by sorting the last column. And the first column plus last column together give you enough to "walk" the permutation:

def bwt_encode(s: str) -> tuple[str, int]:
    """Compute BWT. Returns (transformed, index of original)."""
    s = s + '\x00'  # Append null byte as end-of-string marker
    n = len(s)
    rotations = sorted(s[i:] + s[:i] for i in range(n))
    bwt = ''.join(r[-1] for r in rotations)
    idx = rotations.index(s[1:] + s[0])  # index of sorted rotation starting with original
    return bwt, idx

def bwt_decode(bwt: str, idx: int) -> str:
    """Invert BWT. Returns original string (without end-of-string marker)."""
    n = len(bwt)
    # Build table of (char, original_position) pairs
    table = sorted((char, i) for i, char in enumerate(bwt))
    # Walk the permutation
    row = idx
    result = []
    for _ in range(n):
        char, row = table[row]
        result.append(char)
    return ''.join(reversed(result))[1:]  # strip the null byte

The Full bzip2 Pipeline

bzip2 applies BWT as part of a pipeline:

bzip2 compression pipeline:

  Raw data
     │
     ▼
  Run-Length Encoding (pass 1)
  - Pre-compress runs of 4+ identical bytes
  - Prevents the BWT from sorting very long runs badly
     │
     ▼
  Burrows-Wheeler Transform
  - Sort all rotations, output last column
  - Groups similar contexts → long runs of same character
     │
     ▼
  Move-to-Front Transform
  - Convert characters to ranks in an MRU (Most Recently Used) list
  - Recently seen characters get small indices (0, 1, 2, ...)
  - Makes output dominated by small integers
     │
     ▼
  Run-Length Encoding (pass 2)
  - Compress the runs of 0s that MTF produces
     │
     ▼
  Huffman Coding
  - Entropy code the MTF+RLE output
  - bzip2 uses multiple Huffman tables, switching between them
     │
     ▼
  .bz2 file

Move-to-Front (MTF)

MTF is the step that makes BWT+Huffman work well together. It converts characters into indices of an MRU list:

MTF encoding of "aaabbbcccaaa":

Initial list: [a, b, c, d, e, ...]  (all possible symbols)

Symbol 'a': index 0. Output 0. Move 'a' to front: [a,b,c,...]
Symbol 'a': index 0. Output 0. List unchanged.
Symbol 'a': index 0. Output 0.
Symbol 'b': index 1. Output 1. Move 'b' to front: [b,a,c,...]
Symbol 'b': index 0. Output 0. (b is now at front)
Symbol 'b': index 0. Output 0.
Symbol 'c': index 2. Output 2. Move 'c' to front: [c,b,a,...]
...

MTF output: 0, 0, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0

The BWT output has long character runs.
MTF converts those runs to long runs of 0s.
RLE compresses the 0s.
Huffman codes the remaining non-zero values.

bzip2 in Practice

bzip2 vs. gzip on common file types:

File type          gzip -6    bzip2    xz -6
────────────────────────────────────────────
English text        28%        22%      18%
C source code       22%        18%      15%
HTML files          25%        21%      17%
Binary executable   47%        40%      32%
JPEG (already cmp) 100%       100%     100%

bzip2 is typically 10-15% smaller than gzip on text, but compresses and decompresses more slowly. xz (LZMA) is typically another 15-20% smaller again.

bzip2 also has a useful property: its output consists of independent 900KB blocks, each with its own Huffman tables. This means:

Parallel decompression is straightforward (process blocks independently)
Corruption only destroys affected blocks, not the whole archive
Random access is possible at block boundaries (with a block index)

When Simple Methods Still Win

It's tempting to think that simple methods have been made obsolete by sophisticated ones. Not so:

Sensor data: IoT temperature sensors produce slowly-changing values. Delta encoding (1-2 bytes per reading) + Huffman easily beats generic lossless compression and is implementable on a microcontroller.

Screen recordings / VNC: Consecutive frames differ only in regions of mouse activity. A simple delta between frames (XOR, then RLE), followed by Huffman, achieves 50-100:1 compression with negligible CPU.

Database WAL logs: Write-ahead logs are sequences of delta operations (insert row, update field). Delta encoding the operations themselves, not the data, can achieve 10:1 compression.

Palette images: 8-bit-per-pixel images with large solid areas compress better with RLE than JPEG or even PNG deflate.

The general rule: if you can characterize the structure of your specific data, a targeted simple method beats a general-purpose complex one.

Summary

RLE encodes runs of repeated values as (count, value). Excellent for highly repetitive data, terrible for random data.
Delta encoding stores differences between consecutive values. Excellent for slowly-changing sequences (timestamps, sensor data, sorted IDs).
BWT rearranges data by sorting all rotations — a reversible transform that groups similar contexts and enables long runs of common characters.
MTF converts character sequences into MRU indices, turning character runs into 0-runs that compress extremely well.
bzip2 chains RLE → BWT → MTF → RLE → Huffman into a surprisingly effective pipeline.
Simple methods often beat general-purpose compressors for domain-specific data with known structure.

Next, we enter the world of lossy compression with images — where the interesting question shifts from "can we reconstruct it exactly?" to "what can we safely throw away?"

Chapter 8: Image Compression

Images are where compression theory meets human perception in the most visible way. A JPEG artifact, a PNG that's larger than the original bitmap, a WebP that's visually indistinguishable from the original at a third the size — these are compression algorithms at work in a domain where "quality" is defined by what humans can see.

Image compression splits cleanly along the lossless/lossy divide. PNG and related formats preserve every pixel exactly. JPEG, WebP in lossy mode, and AVIF discard information your visual system is unlikely to notice. Understanding which approach to use — and why — requires understanding a bit about both the algorithms and the human visual system.

The Raw Numbers: Why Image Compression Matters

A 4000×3000 pixel photo, 24 bits per pixel (8 bits each R, G, B):

Uncompressed: 4000 × 3000 × 3 bytes = 36,000,000 bytes ≈ 36MB

PNG (lossless):      ~18MB   (2× compression on typical photos)
JPEG quality 90:     ~4MB    (9× compression)
JPEG quality 75:     ~2MB    (18× compression, still good quality)
AVIF quality 80:     ~1MB    (36× compression, excellent quality)

Modern smartphone photos are 10–20MB because they're stored as JPEG. Without compression, your photo library would be terabytes.

PNG: Lossless Compression for Images

PNG (Portable Network Graphics) was created in 1995 to replace GIF (which had LZW patent issues). It's lossless: decompressing a PNG gives you the exact original pixel values.

PNG's Filter Stage

PNG applies a filter to each row of pixels before compression. Filters exploit spatial correlation — adjacent pixels tend to be similar — by predicting each pixel from its neighbors and encoding the prediction error instead of the raw value.

PNG filter types (applied per row):

None:     Px = raw value
Sub:      Px = raw - left
Up:       Px = raw - above
Average:  Px = raw - floor((left + above) / 2)
Paeth:    Px = raw - paeth_predictor(left, above, upper_left)

Paeth predictor (Philippe Paeth, 1991):
  a = left pixel, b = above pixel, c = upper-left pixel
  p = a + b - c
  pa = |p - a|, pb = |p - b|, pc = |p - c|
  if pa ≤ pb and pa ≤ pc: return a
  elif pb ≤ pc: return b
  else: return c

The PNG encoder typically tries all five filter types for each row and picks the one that produces the smallest values (easiest to compress). This adaptive filtering is one reason PNG compresses reasonably well despite using generic DEFLATE.

Filter example (a gradient row):

Raw pixels:  100, 102, 104, 106, 108, 110

Sub filter (subtract left):
  100, 2, 2, 2, 2, 2

DEFLATE will easily compress the run of 2s.
Without filtering, DEFLATE has to match the gradually-changing values.

PNG Internals

PNG file structure:

  ┌─────────────────────────────────────────┐
  │ PNG Signature: \x89PNG\r\n\x1a\n        │
  ├─────────────────────────────────────────┤
  │ IHDR chunk (13 bytes)                   │
  │  - Width, height (4 bytes each)         │
  │  - Bit depth, color type                │
  │  - Compression method (always 0=DEFLATE)│
  │  - Filter method, interlace method      │
  ├─────────────────────────────────────────┤
  │ Optional chunks: tEXt, gAMA, cHRM, etc.│
  ├─────────────────────────────────────────┤
  │ IDAT chunk(s) — compressed image data  │
  │  - Filtered rows, DEFLATE compressed   │
  │  - May be split across multiple IDATs   │
  ├─────────────────────────────────────────┤
  │ IEND chunk — marks end of file         │
  └─────────────────────────────────────────┘

All PNG chunks have a 4-byte CRC. This makes PNG moderately resilient to corruption — a bad chunk can be detected (though not corrected).

Optimizing PNGs

Because PNG uses DEFLATE, the compressor's settings matter. Default PNG encoders often don't use the best DEFLATE settings. Tools like oxipng and pngcrush can reduce PNG file sizes by 20–50% without any quality loss by:

Trying all filter combinations and picking the best per-row
Using maximum DEFLATE settings
Removing unnecessary metadata chunks (GPS, camera info)

# Optimize in place, removing unnecessary metadata
oxipng --opt 4 --strip all image.png

# Typical improvement
Before: 1.2MB
After:  800KB   (33% smaller, identical pixels)

When to Use PNG

Icons, logos, UI elements: Exact colors matter; flat regions with solid colors compress well.
Screenshots: Text is sharp-edged; JPEG would produce visible ringing artifacts.
Images with transparency: PNG supports alpha channels; JPEG does not.
Workflow/intermediate images: When you'll process the image further, preserve all information.
Synthetic images: Computer-generated graphics often have exact flat regions where PNG excels.

When not to use PNG: Photographs. A typical photo in PNG is 3–5× larger than a good JPEG at perceptually equivalent quality. The DEFLATE compression engine isn't designed for the statistical properties of photographic data.

JPEG: Lossy DCT Compression

JPEG (Joint Photographic Experts Group, 1992) is the format that stores almost every photograph in existence. It achieves dramatic compression by:

Converting to a perceptually-motivated color space
Applying the Discrete Cosine Transform (DCT) to 8×8 blocks
Quantizing the DCT coefficients (this is where information is lost)
Entropy coding the quantized coefficients

Step 1: Color Space Conversion (RGB → YCbCr)

The human visual system is much more sensitive to brightness (luminance) changes than to color (chrominance) changes. JPEG exploits this by separating luminance (Y) from two chrominance channels (Cb = blue-yellow, Cr = red-green).

RGB to YCbCr conversion:

Y  =  0.299R + 0.587G + 0.114B
Cb = -0.168R - 0.331G + 0.500B + 128
Cr =  0.500R - 0.418G - 0.081B + 128

Why this matters:
- Y channel: human eye is highly sensitive → compress carefully
- Cb, Cr channels: human eye is less sensitive → compress aggressively

Chroma subsampling (4:2:0): Keep all Y values, but only keep
1 Cb and 1 Cr value per 2×2 block of pixels.

  ┌───┬───┬───┬───┐    Y: full resolution
  │ Y │ Y │ Y │ Y │
  ├───┼───┼───┼───┤    Cb, Cr: half resolution in each direction
  │ Y │ Y │ Y │ Y │    (one value per 4 pixels)
  └───┴───┴───┴───┘

This halves chroma data before any further compression.
The result is usually invisible at normal viewing distances.

Step 2: The Discrete Cosine Transform

Each 8×8 pixel block is transformed using the DCT — related to the Fourier transform, but working with cosines only (which avoids the phase issues of the full Fourier transform for real-valued data).

The DCT converts 64 spatial values (pixel intensities) into 64 frequency coefficients. The top-left coefficient (DC) represents the average brightness of the block. The remaining coefficients (AC) represent progressively higher spatial frequencies.

An 8×8 pixel block (luminance values):

  139 144 149 153 155 155 155 155
  144 151 153 156 159 156 156 156
  150 155 160 163 158 156 156 156
  159 161 162 160 160 159 159 159
  159 160 161 162 162 155 155 155
  161 161 161 161 160 157 157 157
  162 162 161 163 162 157 157 157
  162 162 161 163 163 158 158 158

After DCT (simplified, showing the low-frequency dominance):

  1260  -1  -12   -5    2   -2    0    0   ← DC + low freq
    -23  -17    -6  -3    2    0    0    0
    -11   -9    -2   2    0    1    0    0
     -7   -2     0   1    0    0    0    0
      2    0     0   0    0    0    0    0
      0    0     0   0    0    0    0    0   ← high freq
      0    0     0   0    0    0    0    0
      0    0     0   0    0    0    0    0

Most energy is concentrated in the top-left (low-frequency) coefficients.
High-frequency coefficients are near zero for natural images.

Step 3: Quantization (The Lossy Step)

Each DCT coefficient is divided by a quantization value from a quantization table, then rounded to the nearest integer. This is where information is permanently lost.

Quantization table (JPEG standard luminance, quality 50):

  16  11  10  16  24  40  51  61
  12  12  14  19  26  58  60  55
  14  13  16  24  40  57  69  56
  14  17  22  29  51  87  80  62
  18  22  37  56  68 109 103  77
  24  35  55  64  81 104 113  92
  49  64  78  87 103 121 120 101
  72  92  95  98 112 100 103  99

Each DCT coefficient is divided by the corresponding table value and rounded.

Large quantization values → coarse rounding → more information lost (smaller file).
Small quantization values → fine rounding → less information lost (larger file).

"Quality" in JPEG is just a scalar applied to this table:
  Quality 90: divide table by ~3 (fine quantization)
  Quality 50: use table as-is (moderate quantization)
  Quality 10: multiply table by ~8 (coarse quantization)

The visual effect of quantization is visible as "blocking" — 8×8 pixel blocks with slightly wrong colors that become obvious at low quality settings.

Step 4: Zigzag Scan and Entropy Coding

After quantization, the 64 coefficients are read out in a zigzag pattern that puts low-frequency (large) coefficients first and high-frequency (near-zero) coefficients last:

Zigzag scan order:

  ┌──┬──┬──┬──┬──┬──┬──┬──┐
  │ 0│ 1│ 5│ 6│14│15│27│28│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │ 2│ 4│ 7│13│16│26│29│42│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │ 3│ 8│12│17│25│30│41│43│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │ 9│11│18│24│31│40│44│53│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │10│19│23│32│39│45│52│54│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │20│22│33│38│46│51│55│60│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │21│34│37│47│50│56│59│61│
  ├──┼──┼──┼──┼──┼──┼──┼──┤
  │35│36│48│49│57│58│62│63│
  └──┴──┴──┴──┴──┴──┴──┴──┘

Reading zigzag order from quantized coefficients:
  1260, -2, -23, -11, -17, -12, -7, ...  0, 0, 0, 0, 0, 0, 0, 0

The long run of zeros at the end is RLE-encoded with a special
end-of-block (EOB) symbol. Then Huffman coding compresses everything.

JPEG Artifacts

The DCT-based approach produces characteristic artifacts:

Blocking: 8×8 grid visible, especially at edges and in smooth gradients
Ringing: Dark halos around sharp edges (Gibbs phenomenon from truncating high-frequency content)
Mosquito noise: Temporal noise in areas of fine texture in videos
Color bleeding: Chroma subsampling creates color fringing at high-contrast edges

Where you'll see JPEG artifacts:

  Sharp text on colored background → heavy ringing
  ──────────────────────────────────────────────
  │  HELLO  │  ← Dark halos around the letters
  ──────────────────────────────────────────────

  Flat blue sky → visible 8×8 banding
  Fine diagonal lines → "staircase" pattern

  JPEG is optimized for natural photographic content, not:
  - Text or diagrams with sharp edges
  - Flat synthetic colors
  - Images with text overlaid on photos

WebP: VP8 Intra-Frame Coding

WebP was developed by Google (from the VP8 video codec) and released in 2010. It supports both lossless and lossy modes.

Lossy WebP uses VP8's intra-frame prediction with smaller blocks than JPEG (4×4 in addition to 16×16), better chroma handling, and an arithmetic coder instead of Huffman — getting roughly 25–35% better compression than JPEG at equivalent perceptual quality.

Lossless WebP is more interesting: it uses a completely different algorithm from PNG, with:

Spatial prediction (like PNG filters but more sophisticated, including multiple predictors per block)
Color transform (subtract color correlation between channels)
Subtract green (G channel before R and B)
Palette coding (for images with limited colors)
LZ77 + Huffman on the resulting values

WebP lossless vs. PNG on typical web images:

Image type           PNG size    WebP lossless    Savings
──────────────────────────────────────────────────────────
Icon/logo              24KB         18KB            25%
Screenshot            450KB        320KB            29%
Illustration          1.2MB        820KB            32%
Photo (no alpha)     18MB (rare)   15MB             17%

WebP lossless is generally 26% smaller than equivalent PNG, with slightly slower encoding and comparable decoding speed.

AVIF: The Modern Challenger

AVIF (AV1 Image File Format) uses the AV1 video codec's intra-frame encoding to compress still images. Released in 2019 by the Alliance for Open Media (Apple, Google, Netflix, Amazon, et al.), it's the current state of the art for image compression.

What Makes AVIF Better

AVIF inherits AV1's modern encoding toolkit:

Variable block sizes: 4×4 to 128×128 (vs JPEG's fixed 8×8)
More sophisticated intra-prediction: 56 directional modes + DC and smooth predictors
Better chroma subsampling: 4:4:4, 4:2:2, and 4:2:0, all handled natively
Film grain synthesis: Rather than preserving noise in the bitstream, AVIF can store grain parameters and resynthethize natural-looking film grain on decode
HDR and wide gamut: Native 10-bit and 12-bit support
Alpha channel: In the lossy bitstream, not a separate lossless channel

Compression comparison at equivalent quality (DSSIM ≈ 0.001):

Format      File size    Encoding time    Decoding time
──────────────────────────────────────────────────────────
JPEG (MozJPEG)  100%       1×               1×
WebP lossy       72%       1.5×             1×
AVIF             45%      50×              0.5×  (hardware decode)
                          (software encode is slow)

AVIF encodes at roughly half the size of JPEG for equivalent perceptual quality. The catch: encoding is currently slow in software (AV1 was designed with hardware in mind). Browser support arrived in 2021 (Chrome, Firefox), with Safari following in 2023.

AVIF Deployment in Practice

# Using Pillow to encode AVIF
from PIL import Image

img = Image.open('photo.jpg')

# AVIF with quality control
img.save('photo.avif', quality=80)
# quality=100: visually lossless, ~50% smaller than JPEG 95
# quality=80: good quality, ~50% size of JPEG 80
# quality=50: noticeable artifacts, very small

# For web serving: offer AVIF with JPEG fallback
# <picture>
#   <source srcset="photo.avif" type="image/avif">
#   <source srcset="photo.webp" type="image/webp">
#   <img src="photo.jpg">
# </picture>

Format Selection Guide

Image format decision tree:

Does the image require transparency (alpha)?
├─ Yes, lossless alpha required → PNG (or WebP lossless)
├─ Yes, lossy acceptable → WebP lossy or AVIF
└─ No → continue

Is it a photograph or complex scene?
├─ No (icon, logo, diagram, screenshot) → PNG or WebP lossless
└─ Yes → continue

Is encoding speed critical?
├─ Yes (dynamic generation, on-the-fly) → JPEG or WebP lossy
└─ No (batch encode, CDN delivery) → AVIF

Is browser support a constraint?
├─ Must support IE/old browsers → JPEG/PNG only
├─ Modern browsers (2021+) → AVIF with WebP/JPEG fallback
└─ Latest browsers only → AVIF

File size priority?
├─ Maximum compression → AVIF
├─ Good compression, fast encode → WebP lossy
└─ Compatibility → JPEG quality 75-85

The Quality-Size Curve

Every lossy format has a quality dial. Understanding how quality translates to visual fidelity helps set appropriate targets:

JPEG quality guidelines:

Quality 95+: Visually lossless for most humans. Use for source masters.
             Typical size: 4-5× the AVIF equivalent.

Quality 85:  High quality. Artifacts invisible at normal viewing distance.
             Good for photos where quality is a priority.

Quality 75:  Good quality. Slight artifacts visible at 1:1 pixel.
             Standard web quality for photographs.

Quality 60:  Moderate quality. Visible artifacts on close inspection.
             OK for thumbnails, preview images.

Quality 40:  Low quality. Obvious blocking on smooth gradients.
             Only appropriate for very small sizes or previews.

Quality 20:  Very low. Clear blocking, ringing, color errors.
             Rarely appropriate.

Summary

PNG uses predictive filtering (Paeth, Sub, etc.) + DEFLATE. Lossless, ideal for graphics with flat regions, transparency, and text. Too large for photographs.
JPEG converts to YCbCr, applies DCT to 8×8 blocks, quantizes (lossily), then Huffman-codes. The standard for photos; characteristic blocking and ringing artifacts at low quality.
WebP offers both lossless (26% better than PNG) and lossy (25% better than JPEG) modes. Good choice when broad support is available.
AVIF uses AV1 intra-frame coding. ~50% smaller than JPEG at equivalent quality. Slow to encode in software; the future of web image compression.
The human visual system cares more about luminance than chrominance — all lossy formats exploit this.

Next: audio compression, where the same DCT-based approach runs on time, and psychoacoustics determines what we can safely throw away.

Chapter 9: Audio Compression

Audio compression has a split personality that mirrors the lossless/lossy divide more starkly than almost any other domain. On one side: FLAC, where audiophiles insist on bit-perfect fidelity, and every sample is preserved exactly. On the other: MP3, which threw away half the bits and convinced the world it sounded fine.

Both are right, for different contexts. Understanding audio compression means understanding what the ear actually hears — and that turns out to be a surprisingly limited subset of what a microphone records.

The Physics of Audio Data

Digital audio is a sequence of samples, each representing the amplitude of a sound wave at a point in time. CD quality is:

44,100 samples per second (44.1 kHz)
16 bits per sample
2 channels (stereo)

That's 44100 × 16 × 2 = 1,411,200 bits per second, or about 10MB per minute uncompressed. A typical song is ~50MB of raw PCM data.

Audio signal visualization (1ms of a sine wave at 440 Hz):

Amplitude
  +32767 ┤                 ╭─╮
         ┤              ╭──╯ ╰──╮
         ┤            ╭╯        ╰╮
         ┤           ╭╯          ╰╮
       0 ┤──────────╯──────────────╰──────────
         ┤                          ╭╯  ╰╮
         ┤                        ╭╯     ╰╮
         ┤                      ╭─╯       ╰─╮
  -32768 ┤                    ╭─╯           ╰─╮
         └──────────────────────────────────────→
         0                                    1ms
         (44.1 samples per ms of audio)

FLAC: Lossless Audio Compression

FLAC (Free Lossless Audio Codec, 2001) is the standard for lossless audio. It compresses CD-quality audio to roughly 50-60% of the original size without losing a single sample. You can decompress FLAC and compare it byte-by-byte to the original PCM — they're identical.

FLAC's Prediction + Residual Architecture

Audio signals are predictable. Given a few past samples, you can estimate the current sample quite accurately. FLAC exploits this through linear prediction:

Linear prediction:

Given past samples: x[n-1], x[n-2], ..., x[n-p]
Predict current:    x̂[n] = a₁x[n-1] + a₂x[n-2] + ... + aₚx[n-p]

The residual (prediction error):
    e[n] = x[n] - x̂[n]

For audio signals, residuals are much smaller than samples.
They compress much better.

Example: gradual sine wave
  Samples:   1000, 1050, 1095, 1134, 1165, ...
  Predicted: 1000, 1050, 1095, 1133, 1163, ... (from coefficients)
  Residuals:    0,    0,    0,    1,    2,  ... (tiny errors)

FLAC uses Linear Predictive Coding (LPC) with:

Up to 32 predictor coefficients
Coefficients quantized to integers (for exact arithmetic)
Multiple sub-frame types (constant, verbatim, fixed predictor, LPC)

Rice Coding for Residuals

Residuals cluster around zero — they're much more likely to be small than large. FLAC compresses them using Rice coding, a variant of Golomb coding:

Rice coding with parameter k:

Every integer n is encoded as:
  quotient q = floor(n / 2^k)   → encoded in unary (q zeros, then a 1)
  remainder r = n mod 2^k       → encoded in k bits (binary)

For FLAC (using mapped signed integers):
  0  → 0           (0 ones → unary "1", remainder "")
  1  → 10
  -1 → 11
  2  → 010
  -2 → 011

With k=0: small numbers get short codes.
With k=3: works better when residuals are larger.

FLAC chooses k adaptively per block to minimize bits used.

FLAC Frame Structure

FLAC file:

  ┌───────────────────────────────────────────┐
  │ fLaC marker (4 bytes: 0x66 0x4C 0x61 0x43)│
  ├───────────────────────────────────────────┤
  │ Metadata blocks:                          │
  │  STREAMINFO (mandatory, always first)     │
  │  - Sample rate, channels, bit depth       │
  │  - Total samples, MD5 of uncompressed data│
  │  SEEKTABLE (optional, enables fast seek)  │
  │  VORBIS_COMMENT (tags: artist, title...) │
  │  PICTURE (embedded album art)             │
  ├───────────────────────────────────────────┤
  │ Audio frames                              │
  │  Frame header: sample count, channels,    │
  │                blocking strategy          │
  │  Subframes (one per channel):             │
  │    Type: CONSTANT, VERBATIM, FIXED, LPC  │
  │    LPC coefficients (if LPC type)         │
  │    Rice-coded residuals                   │
  │  Frame footer: CRC-16                     │
  └───────────────────────────────────────────┘

Typical FLAC compression ratios:

Genre                  Ratio
─────────────────────────────────
Classical (sparse)     45% of original size
Speech                 48%
Rock/pop               52%
Electronic/electronic  55%
Already-lossy re-encode 90%+  (can't reclaim what's gone)

FLAC is also streamable — each frame is independently decodable. This enables gapless playback, random seeking, and robust error handling.

MP3: Psychoacoustic Compression

MP3 (MPEG-1 Audio Layer 3, 1993) achieved something remarkable: it convinced millions of people that 128kbps audio sounded "good enough" — a 90% reduction from CD quality. It did this by understanding what the human auditory system actually processes.

The Psychoacoustic Model

Human hearing has several exploitable limitations:

Frequency masking: A loud tone at one frequency masks quieter tones at nearby frequencies. If a 1 kHz tone is playing at 80dB, you can't hear a 1.1 kHz tone at 60dB — the louder one masks it.

Frequency masking:

Sound     ╭─────────────╮
level(dB) │   Loud       │
          │   tone at    │  ← Audible
          │   1kHz       │
          ╰──────────────╯
          Masking         ╭──────╮  ← Also audible
          threshold       │Quiet │
                          │tone  │
          Hidden by mask: ╰──────╯
                               ↕
                          Would need to be
                          HERE to be heard
          ─────────────────────────────────→ frequency

Temporal masking: A loud sound masks quieter sounds that occur slightly before and after it. The ear needs time to recover from loud stimuli.

Pre-masking: ~5ms before the loud sound
Post-masking: ~100ms after the loud sound

Absolute threshold of hearing: Some frequencies can't be heard at low volumes regardless (humans can't hear 20kHz at 20dB SPL, for example).

Critical bands: The cochlea divides the frequency spectrum into roughly 24 overlapping bands. Masking effects are strongest within the same critical band.

MP3 Encoding Pipeline

MP3 encoding pipeline:

  PCM input (44100 Hz, 16-bit stereo)
       │
       ▼
  Filter bank (32 sub-bands) AND psychoacoustic model run in parallel:
  ┌────────────────────┐   ┌──────────────────────────────────┐
  │ Polyphase filter   │   │ FFT-based psychoacoustic analysis │
  │ bank (32 sub-bands)│   │ - Compute masking thresholds     │
  │                    │   │ - Determine allowed noise per band│
  └────────────────────┘   └──────────────────────────────────┘
       │                         │
       └─────────────┬───────────┘
                     ▼
  MDCT (Modified Discrete Cosine Transform)
  - 576 frequency coefficients per granule
  - Better frequency resolution than polyphase alone
       │
       ▼
  Quantization
  - Each coefficient divided by step size
  - Step sizes chosen so quantization noise stays below masking threshold
  - Bit allocation: louder/more-masked bands get fewer bits
       │
       ▼
  Huffman coding
  - Quantized coefficients coded using Huffman tables
  - Quad-point, pair, and singleton encoders for different value ranges
       │
       ▼
  Bit reservoir management
  - Smooth out bit usage across frames
  - Complex sections use bits saved from simple sections
       │
       ▼
  MP3 frame output (each ~26ms of audio)

The MDCT: Frequency Analysis Over Time

The Modified Discrete Cosine Transform is a variant of the DCT designed for audio:

Processes overlapping blocks (half-overlap to avoid blocking artifacts)
Converts 1152 time-domain samples to 576 frequency coefficients
"Long blocks" for steady-state content, "short blocks" for transients

MDCT overlapping blocks:

Block N:    ├────────────────────┤
Block N+1:            ├────────────────────┤
Block N+2:                        ├────────────────────┤
                50% overlap

This overlap-add approach avoids blocking artifacts at the
boundaries of analysis windows.

MP3 Quality at Different Bitrates

MP3 bitrate guide:

 64 kbps: Acceptable for speech, AM-radio quality audio
          Ratio: ~22:1 vs CD
128 kbps: "Transparent" for most listeners on most content
          Ratio: ~11:1 vs CD
          Long considered the minimum for music
192 kbps: High quality; differences from original audible only
          on extreme material (cymbals, sustained strings)
          Ratio: ~7:1 vs CD
256 kbps: Very high quality; audibly transparent on almost
          all content for most listeners
320 kbps: Maximum MP3 bitrate; effectively transparent
          Ratio: ~4:1 vs CD
VBR V0:   Variable bitrate, targeting ~245 kbps average;
          better quality than 256 kbps CBR at similar size

The "128kbps is transparent" belief is now considered outdated — controlled blind tests show many listeners can distinguish 128kbps from lossless on quality headphones. 192kbps or VBR V2 (~190 kbps) is the current pragmatic minimum for music.

AAC: The MP3 Successor

Advanced Audio Coding (AAC, 1997) was developed as the successor to MP3. It achieves better audio quality than MP3 at the same bitrate, or equal quality at lower bitrates.

Key improvements over MP3:

Longer MDCT windows: 2048 coefficients (vs 576), better frequency resolution
More Huffman codebooks: MP3 has 32, AAC has 13+
Temporal noise shaping (TNS): Shapes quantization noise in time, reducing pre-echo artifacts
Perceptual noise substitution: For bands where noise is fully masked, replace the signal with synthesized noise (cheaper to encode)
Better stereo handling: Mid/Side stereo coding, intensity stereo

Quality comparison (bitrate for transparent quality):

Format    Bitrate for transparency
──────────────────────────────────
MP3         ~192 kbps
AAC         ~128 kbps    (AAC-LC, "Low Complexity" profile)
AAC-HE      ~64 kbps     (High Efficiency AAC, adds SBR)
AAC-HEv2    ~48 kbps     (adds PS: Parametric Stereo)
Opus        ~96 kbps     (see below)

AAC is the standard for Apple Music, YouTube, streaming services, smartphones (AAC over Bluetooth), and broadcast television (AC-3/Dolby Digital is a related format).

Spectral Band Replication (HE-AAC)

AAC-HE adds Spectral Band Replication (SBR) — a clever trick for encoding at very low bitrates:

Encode only the low-frequency half of the spectrum at full quality
Encode a compact "envelope" describing the high-frequency energy distribution
On decode, extrapolate high frequencies from the low-frequency signal using the envelope

At 48kbps, HE-AAC produces reasonable quality by letting the decoder reconstruct the "feel" of the high frequencies without transmitting them explicitly.

Opus: The Modern All-Purpose Codec

Opus (2012, standardized in RFC 6716) is the newest audio codec and in many ways the most impressive. It was designed for internet use — specifically WebRTC and streaming — and it outperforms every other audio codec across the bitrate range.

Codec performance by use case:

Low bitrate (8–24 kbps): Voice, narrow-band
  Winner: Opus (uses SILK mode, speech-optimized)
  Runner-up: AMR-WB

Medium bitrate (32–96 kbps): Music, wideband voice
  Winner: Opus (uses CELT mode, MDCT-based)
  Runner-up: AAC-HE

High bitrate (96+ kbps): High-quality music
  Winner: Opus / AAC-LC (comparable)
  Lossless: FLAC

For voice/teleconference: Always Opus
For streaming music: AAC-LC or Opus
For downloaded music: FLAC (lossless) or AAC-LC (lossy)

Opus works by combining two underlying codecs:

SILK: Originally developed by Skype for voice, optimized for speech signals
CELT: A high-quality audio codec based on MDCT with perceptual coding

Opus dynamically switches between them based on content type, and can also use a hybrid mode. This makes it uniquely suited for mixed voice-and-music content (like a podcast with intro music).

# Encoding audio with Opus in Python (using opuslib)
import opuslib

# Create encoder: 48000 Hz, 2 channels, audio application
enc = opuslib.Encoder(48000, 2, opuslib.APPLICATION_AUDIO)
enc.bitrate = 128000  # 128 kbps

# Encode a 20ms frame (960 samples at 48kHz)
pcm_frame = get_pcm_samples(960)  # your audio source
encoded = enc.encode(pcm_frame, 960)

# Decode
dec = opuslib.Decoder(48000, 2)
decoded = dec.decode(encoded, 960)

The Container vs. Codec Distinction

A common confusion: "MP3" is both a codec and a container. But "AAC" is just a codec — it can be stored in several containers: .aac (raw), .m4a (MPEG-4 container), .mp4 (video container), .ogg (Ogg container, uncommon).

Container / Codec combinations:

Container  Extension   Common codecs
─────────────────────────────────────────────
MPEG-4 Audio  .m4a    AAC-LC, HE-AAC
Ogg           .ogg    Vorbis, Opus, FLAC
WebM          .webm   Vorbis, Opus
MPEG-3        .mp3    MP3 only (format = codec)
FLAC          .flac   FLAC only
WAV           .wav    PCM (uncompressed), sometimes others
AIFF          .aiff   PCM (uncompressed), Apple LPCM

The container matters for:

Chapter markers and metadata
Gapless playback (requires container support)
Video synchronization
DRM integration

What Happens to Transcoded Audio

A critical warning: transcoding between lossy formats (e.g., MP3 → AAC) does not improve quality. It makes it worse.

Transcoding degradation:

Original PCM  →  MP3 128kbps  →  re-encoded AAC 128kbps
Quality drop:     step 1           step 2 (additional loss)
Total loss = loss1 + loss2  >  loss1

The artifacts from MP3 become inputs to the AAC encoder,
which can't distinguish "this artifact was deliberately encoded"
from "this is the original signal" and may make different
(equally wrong) decisions about what to discard.

Always transcode from the highest-quality source available. For music production: record lossless, distribute lossy.

Summary

FLAC uses LPC prediction + Rice coding for bit-perfect audio compression at ~50% of original size.
MP3 uses psychoacoustic masking models to discard frequencies the human ear can't hear, achieving 10:1+ compression.
AAC improves on MP3 with better MDCT windows, temporal noise shaping, and spectral band replication in HE-AAC variants.
Opus is the modern all-purpose codec — better than everything else across all bitrates, especially for voice.
Never transcode between lossy formats; always use a lossless source when re-encoding.
Psychoacoustic models are the key to lossy audio: they define what can be safely discarded by modelling the limits of human perception.

Next: video compression, where everything from this chapter applies, plus an entirely new dimension — time.

Chapter 10: Video Compression

Video is where the stakes are highest and the algorithms are most complex. A single minute of uncompressed 1080p video at 30fps is about 6GB. Streaming services deliver that minute in roughly 50MB at good quality — a 120:1 compression ratio — in real time, to billions of devices.

Understanding video compression means understanding everything from the previous chapters, applied to a new dimension: time. Video has spatial redundancy (within a frame, like images) and temporal redundancy (between frames, since most of the scene doesn't move much between consecutive frames). The second kind is where the dramatic compression ratios come from.

The Scale of the Problem

Uncompressed video data rates:

Resolution    FPS    Bit depth    Data rate
──────────────────────────────────────────────
720p         30      8-bit        777 Mbps
1080p        30      8-bit        1.5 Gbps
1080p        60      8-bit        3.0 Gbps
4K UHD       30      8-bit        5.7 Gbps
4K UHD       60      10-bit      17.6 Gbps
8K UHD       60      10-bit      71.6 Gbps

Target streaming bitrates (compressed):
1080p at good quality: 4–8 Mbps (H.264)
                       2–4 Mbps (H.265/HEVC)
                       1.5–3 Mbps (AV1)
4K at good quality:   15–25 Mbps (H.264, rarely used)
                       7–12 Mbps (H.265)
                       4–8 Mbps (AV1)

Spatial Compression: Intra-Frame Coding

Each individual video frame can be compressed like an image — using DCT, quantization, and entropy coding. This is called intra-frame coding (I-frames in MPEG terminology).

Modern video codecs apply intra-prediction before the transform: they predict each block from its neighbors within the same frame, then transform and quantize the prediction error (residual).

Intra-prediction directions (H.264/H.265):

For each block, try predicting from:
  ┌─────────────────────────────────────────┐
  │  ←─ left ─→  ↑─ above ─↑  ↖diagonal   │
  │  ↙ down-left ↘ down-right              │
  │  DC (flat, average of neighbors)        │
  │  + many more in H.265 (33 directions)  │
  │  + even more in AV1 (56 directions)    │
  └─────────────────────────────────────────┘

Pick the prediction that leaves the smallest residual.
Encode: prediction mode + residual.

This gives modern codecs a significant advantage over JPEG even on still images — JPEG has no intra-prediction.

Temporal Compression: Inter-Frame Coding

The big win. Between consecutive video frames, most pixels are the same or have just moved slightly. Instead of encoding each frame independently, encode only the difference from a previous frame.

Temporal redundancy visualization:

Frame N:      Frame N+1 (person moves slightly right):

┌──────────────────────┐   ┌──────────────────────┐
│░░░░░░░░░░░░░░░░░░░░░░│   │░░░░░░░░░░░░░░░░░░░░░░│
│░░░░░[PERSON]░░░░░░░░░│   │░░░░░░░[PERSON]░░░░░░░│
│░░░░░░░░░░░░░░░░░░░░░░│   │░░░░░░░░░░░░░░░░░░░░░░│
│░░░░░░░░░░░░░░░░░░░░░░│   │░░░░░░░░░░░░░░░░░░░░░░│
└──────────────────────┘   └──────────────────────┘

Background: identical (background compression is essentially free)
Person: same pixels, shifted 2 pixels right
        → encode as "motion vector (2,0), copy region"

The difference frame: almost entirely zero.

Motion Estimation and Compensation

Motion estimation: Find where each block of the current frame came from in the reference frame. Motion compensation: Use the found match to predict the current block.

The encoder tries to find, for each block, a motion vector (dx, dy) such that the block in the reference frame at (x+dx, y+dy) is as similar as possible to the current block.

Motion estimation search pattern:

For block at current position (cx, cy),
search reference frame within search range:

Reference frame search window:
  ┌─────────────────────────────┐
  │  . . . . . . . . . . . .   │  Each '.' is a candidate
  │  . . . . . . . . . . . .   │  position to check.
  │  . . . . ┌─────┐ . . . .   │
  │  . . . . │BLOCK│ . . . .   │  Find the candidate with
  │  . . . . └─────┘ . . . .   │  smallest difference (SAD,
  │  . . . . . . . . . . . .   │  Sum of Absolute Differences)
  │  . . . . . . . . . . . .   │
  └─────────────────────────────┘

Exhaustive search: O(search_range²) per block position → too slow
Fast search algorithms:
  Three-step search: O(log(search_range)) candidates
  Diamond search: starts large, zooms in
  Hierarchical (multi-resolution): search at coarse scale first

After motion compensation, the residual (actual block minus predicted block) is typically small — it can be transformed, quantized, and encoded like an intra-frame block, but with fewer bits needed because the values are close to zero.

Sub-pixel Motion

Real motion isn't always aligned to integer pixel boundaries. A block might have moved 2.5 pixels to the right. Modern codecs support sub-pixel motion vectors:

Half-pixel precision: Intermediate values computed by bilinear interpolation of adjacent pixels
Quarter-pixel precision: Bicubic or similar interpolation (H.264 and later)
1/8 pixel precision: Used in AV1

Better sub-pixel accuracy → better motion compensation → smaller residuals → better compression. Cost: more computation for both encoder and decoder.

Frame Types

Video streams use three types of frames:

GOP (Group of Pictures) structure:

I  B  B  P  B  B  P  B  B  P  B  B  I
│  │  │  │  │  │  │                  │
│  │  │  └──┘  └──┘  └──┘            └── next I-frame (keyframe)
│  │  │  │
│  └──┴──┘
│
└── Intra-coded frame (keyframe)
    Self-contained; decoder can start here.
    Largest frames (no inter-prediction).

P-frame (Predicted):
  Predicted from ONE past reference frame.
  Contains motion vectors + residuals.
  Much smaller than I-frames.

B-frame (Bi-directional):
  Predicted from BOTH past AND future frames.
  Smallest of the three — uses best of both references.
  Requires decoder to buffer future frames → adds latency.

I-frames (keyframes) are where seeking can land. Every I-frame is fully self-contained. Streaming services insert I-frames every few seconds to allow seeks and fast channel-changing.

P-frames reference one past frame. A decoder must have correctly decoded the reference frame to decode a P-frame.

B-frames reference frames in both directions. Better compression, but adds decode complexity and latency. Live streaming often avoids B-frames; stored video uses them extensively.

H.264 / AVC: The Universal Codec

H.264 (Advanced Video Coding, 2003) is the codec that made HD video streaming practical. It's in every smartphone, streaming service, browser, and camera. Even as newer codecs have emerged, H.264 remains dominant because it has 15+ years of hardware decoder support.

H.264 Key Features

Entropy coding: Two options:

CAVLC (Context-Adaptive Variable-Length Coding): Simpler, faster, baseline profile
CABAC (Context-Adaptive Binary Arithmetic Coding): Better compression (~10-15%), requires more computation, used in main/high profiles

Reference frames: Can reference up to 16 past frames (not just the immediately previous one), enabling better motion compensation for complex motion.

Deblocking filter: Applied after reconstruction to reduce 8×8 blocking artifacts. H.264's deblocking filter is in-loop — it runs before reference frames are stored, so future frames use deblocked reference frames.

Profiles and levels: H.264 defines multiple profiles (Baseline, Main, High) with increasing feature sets, and levels controlling maximum resolution, bitrate, and buffer sizes.

H.264 profiles:

Baseline: No B-frames, CAVLC only. Low-power decoders.
          Used for: video calling, older devices
Main:     B-frames, CABAC. Better compression.
          Used for: standard SD/HD video, some streaming
High:     8×8 intra DCT, adaptive quantization, CABAC.
          Best compression. Used for: Blu-ray, streaming HD
High 10:  10-bit depth. HDR content.
High 4:2:2, High 4:4:4: Professional production.

H.264 Encoder Presets (x264)

The x264 encoder has presets from ultrafast to veryslow. The slower the preset, the more encoding time is spent looking for better compression, with diminishing returns.

x264 preset comparison (1080p at CRF 23):

Preset       Encode speed    File size    PSNR
──────────────────────────────────────────────
ultrafast     ~1500 fps       Large        Low
fast           ~350 fps       Medium       Good
medium         ~150 fps       Medium       Good
slow            ~70 fps       Small        Better
veryslow        ~20 fps       Smaller      Best

CRF (Constant Rate Factor) = quality target.
Lower CRF = higher quality, larger file.
CRF 23 = default; CRF 18 ≈ visually lossless.

H.265 / HEVC: Twice the Compression

H.265 (High Efficiency Video Coding, 2013) targets the same quality as H.264 at half the bitrate, or better quality at the same bitrate.

Key Differences from H.264

CTUs instead of macroblocks: H.264 uses 16×16 macroblocks. H.265 uses Coding Tree Units (CTUs) of up to 64×64 pixels, subdivided recursively using a quadtree structure.

CTU quadtree structure:

  64×64 CTU:
  ┌──────────────────────────────────────┐
  │                  │         │CU│CU│   │
  │  Single 32×32 CU │ 16×16   ├──┼──┤   │
  │  (flat region)   │ CU      │CU│CU│   │
  │                  │         ├──┴──┤   │
  │                  ├─────────┤  8×8│   │
  │                  │ 16×16   │     │   │
  │                  │ CU      │     │   │
  └──────────────────┴─────────┴─────┴───┘

Large blocks for uniform regions (efficient).
Small blocks for detailed/complex regions (accurate).

More intra prediction modes: 35 directions (vs 9 in H.264 high profile).

SAO (Sample Adaptive Offset): An additional in-loop filter that reduces banding and ringing artifacts using lookup tables.

Better motion compensation: Asymmetric partition shapes, more reference frames.

H.265 Licensing Problem

H.265 achieves excellent compression but has a complex patent licensing situation with multiple patent pools. This fragmentation discouraged browser support and pushed Google, Mozilla, Cisco, and others to form the Alliance for Open Media to develop AV1 as a royalty-free alternative.

AV1: The Open Codec

AV1 (2018) was designed by the Alliance for Open Media specifically to be:

Royalty-free (no per-unit licensing fees)
~30% better compression than H.264 (similar to H.265, sometimes better)
Better than VP9 (its predecessor from Google)

AV1 Innovations

Larger, more flexible block partitions: Up to 128×128, with 10 different split patterns (including T-shaped and L-shaped).

More intra prediction modes: 56 directional angles, plus "smooth" predictors, "palette mode" for flat-colored regions, and "CfL" (Chroma from Luma).

Frame-level film grain synthesis: Instead of encoding film grain in each frame (expensive), AV1 can encode grain parameters and resynthesize plausible grain on decode. Saves bits, preserves perceptual "texture."

Loop Restoration filter: More advanced post-processing than H.265's SAO.

Constrained Directional Enhancement Filter (CDEF): Reduces ringing artifacts along directional edges.

Codec comparison at equivalent quality:

  Bitrate for comparable quality (normalized to H.264):

  H.264:   100%  (baseline)
  VP9:      65%  (~35% savings)
  H.265:    55%  (~45% savings)
  AV1:      50%  (~50% savings)

  At 4K/60fps content:
  H.264 needs ~50 Mbps
  H.265 needs ~25 Mbps
  AV1   needs ~18 Mbps

  Encoding speed (software, 1080p):
  H.264 (x264 fast):    ~600 fps
  H.265 (x265 fast):    ~150 fps
  AV1 (libaom fast):      ~8 fps  ← Very slow software encoder
  AV1 (SVT-AV1 fast):   ~120 fps ← Production-grade fast encoder

AV1 hardware encoders (in Intel Arc, NVIDIA RTX 40xx, AMD RDNA3) bring encoding speed to real-time. Hardware decoders are in smartphones (since 2021), smart TVs, and recent Intel/AMD/NVIDIA graphics cards. Netflix, YouTube, and Twitch all use AV1.

The Complete Video Compression Pipeline

Video codec encoding pipeline (H.264/H.265/AV1):

  Raw video frames
        │
        ▼
  Scene change detection
  → Insert I-frame at scene changes
        │
        ▼
  For each frame, for each block:
  ┌──────────────────────────────────────────────┐
  │  1. INTRA PREDICTION or INTER PREDICTION     │
  │     Intra: predict from neighbors in frame   │
  │     Inter: motion estimation against refs,   │
  │            find best motion vector           │
  │                                              │
  │  2. TRANSFORM (DCT or integer approximation)│
  │     Convert residual to frequency domain     │
  │                                              │
  │  3. QUANTIZATION                             │
  │     Divide by quantization step (lossy!)     │
  │     Rate-distortion optimization decides     │
  │     which blocks get more/fewer bits         │
  │                                              │
  │  4. IN-LOOP FILTERING                        │
  │     Deblocking, SAO (H.265), CDEF (AV1)    │
  │                                              │
  │  5. ENTROPY CODING                           │
  │     CABAC (H.264/H.265) or ANS (AV1)       │
  └──────────────────────────────────────────────┘
        │
        ▼
  Video bitstream (NAL units / OBUs for AV1)
        │
        ▼
  Container: MP4 (H.264/H.265), WebM (AV1/VP9), MKV, TS, ...

Rate Control: The Invisible Algorithm

Rate control decides how many bits to allocate per frame. It's as important as the codec itself for real-world quality.

Rate control modes:

CRF (Constant Rate Factor): Target a quality level, let bitrate vary.
  Best for: archiving, local storage
  Downside: unknown output file size

CBR (Constant Bit Rate): Exactly N bits per second, always.
  Best for: live streaming, broadcast with fixed bandwidth
  Downside: wastes bits on simple scenes, degrades quality on complex ones

ABR (Average Bit Rate): Target average, let it vary.
  Middle ground. Used by most streaming services.

CQ (Constant Quantizer): Same QP for every frame.
  Simple, rarely optimal. Educational use mainly.

2-pass encoding:
  Pass 1: Analyze video, estimate complexity per scene.
  Pass 2: Allocate bits based on analysis.
  Best quality at target filesize. 2× encoding time.

Why B-Frames Hurt Live Streaming

B-frames reference future frames — which means the encoder can't output a B-frame until it has seen the future frames it references, and the decoder can't display it until it has received those future frames.

B-frame latency:

Without B-frames (low latency):
  Capture → Encode → Transmit → Decode → Display
  Latency: ~1–2 frames

With B-frames (better compression):
  Capture → [wait for future frames] → Encode → Transmit
         → [buffer future frames] → Decode → Display
  Latency: 3+ frames minimum

For interactive video (gaming, video calling):
  B-frames are disabled. Latency matters more than compression.

For stored video / VOD streaming:
  B-frames are used. Viewers don't notice 2-second buffer.

Codec Support Matrix

Codec support in major contexts (as of 2025):

Context           H.264    H.265    AV1    VP9
────────────────────────────────────────────────────
Chrome/Firefox      ✓        ✓       ✓      ✓
Safari              ✓        ✓       ✓      —
Mobile (iOS)        ✓        ✓      ✓(hw)   —
Mobile (Android)    ✓        ✓      ✓(hw)   ✓
Smart TVs           ✓        ✓      Varies  —
YouTube             ✓        —       ✓      ✓
Netflix             ✓        ✓       ✓      —
Blu-ray             ✓        ✓       —      —
FFmpeg encode       ✓        ✓       ✓      ✓

Summary

Video compression combines spatial (intra-frame, like images) and temporal (inter-frame) compression.
Motion estimation finds where blocks moved between frames; motion compensation predicts current blocks from past frames.
I-frames are keyframes (self-contained), P-frames predict from past, B-frames predict from past and future.
H.264 is the universal standard — hardware support everywhere, reasonable compression.
H.265/HEVC is ~50% more efficient than H.264, hampered by patent complexity.
AV1 is royalty-free, ~50% more efficient than H.264, hardware acceleration arriving. The future for streaming.
Rate control and encoder presets are as important as codec choice for real-world quality.
Live streaming avoids B-frames to minimize latency; stored video uses them for compression.

Next: how databases compress data, where the same LZ and Huffman ideas run at a completely different scale — and where column-oriented storage changes everything.

Chapter 11: Database and Columnar Compression

Databases have a compression problem that's different from the problems we've seen so far. A video compressor can take minutes to encode a single frame. A database compressor runs on every write, every read, potentially millions of times per second. The constraint shifts from "best possible ratio" to "best ratio achievable in microseconds."

But databases have an advantage too: they know something about their data. A database knows that a column contains timestamps, or integers, or strings of bounded length. It knows the data type, the sort order, the cardinality. This domain knowledge enables compression techniques that generic algorithms can't match.

Two Storage Models

Before compression, we need to understand the fundamental distinction between row-oriented and column-oriented storage, because it completely determines which compression techniques work.

The same data, stored two ways:

Logical table:
  user_id  country  age  plan
  ─────────────────────────────
  1001     US       34   pro
  1002     UK       27   free
  1003     US       41   pro
  1004     DE       33   free
  1005     US       29   pro

Row-oriented (traditional OLTP):
  [1001,US,34,pro][1002,UK,27,free][1003,US,41,pro][1004,DE,33,free]...

Column-oriented (OLAP / analytics):
  user_id:  [1001][1002][1003][1004][1005]...
  country:  [US][UK][US][DE][US]...
  age:      [34][27][41][33][29]...
  plan:     [pro][free][pro][free][pro]...

Why Columns Compress Better

When data is stored in columns, each column contains only one type of data — and real-world data has low cardinality (few distinct values) and high correlation within columns:

All country values are strings from a 195-element set
All age values are integers in [0, 150]
All plan values are one of {free, pro, enterprise}
Adjacent rows in a time-sorted table have similar timestamps

A generic compressor on row data sees: 1001, US, 34, pro, 1002, UK, 27, free — a heterogeneous mix. A columnar compressor on the plan column sees: pro, free, pro, free, pro — highly compressible.

Encoding Schemes for Columnar Data

Dictionary Encoding

Replace string values with integer codes. Store the code → string mapping once.

Dictionary encoding of the 'country' column:

Dictionary: {0: "US", 1: "UK", 2: "DE", 3: "FR", ...}

Raw strings:   "US", "UK", "US", "DE", "US", "US", "FR", "US"
Encoded:        0,    1,    0,    2,    0,    0,    3,    0

Savings:
  Raw: 8 × 2 bytes = 16 bytes (average country code is 2 chars)
  Dict: 8 × 1 byte + 4 × 2 bytes (dict) ≈ 16 bytes
  → No savings for small cardinality strings!

But for longer strings ("United States", "United Kingdom"):
  Raw: 8 × 13 bytes average = 104 bytes
  Dict: 8 × 1 byte + 195 × 13 bytes dict = 8 + 2535...

The real win: dict encoding enables faster comparisons and aggregations.
Filter "country = 'US'" becomes "code = 0" — integer comparison on
compact integer arrays. No string hashing or comparison needed.

Dictionary encoding is the foundation of most columnar formats. It enables:

Compact representation when cardinality is low
Fast filtering without decoding (compare integers to integer codes)
Late materialization (operate on codes, decode only when outputting results)

Run-Length Encoding (Columnar Style)

After sorting, columns often have long runs of identical values. RLE stores (value, count) pairs:

RLE on sorted 'plan' column:

Sorted data: free, free, free, ...(10000×)..., pro, pro, ...(5000×)..., enterprise, ...

RLE: [(free, 10000), (pro, 5000), (enterprise, 500)]

Query "count where plan = 'pro'": → read single RLE entry → 5000
No decompression needed!

This is "predicate pushdown into the encoding":
The compression format IS the index.

Parquet, ORC, and most columnar databases use RLE on sorted data extensively.

Bit-Packing

If a column's values fit in fewer bits than the storage type, pack them tightly.

Bit-packing a column of values 0–15 (4 bits needed, stored in 8):

Naive storage (8 bits each):
  0x03 0x07 0x0B 0x0F 0x01 0x09 0x02 0x0E
  = 8 bytes for 8 values

Bit-packed (4 bits each):
  0x37 0xBF 0x19 0x2E
  = 4 bytes for 8 values (50% savings)

Vectorized unpacking:
  Modern CPUs can unpack bit-packed integers using SIMD instructions
  at rates exceeding 1 billion integers per second.

PFOR (Patched Frame of Reference) is a sophisticated variant: most values in a frame fit in B bits, with occasional outliers ("exceptions"). Store the base values in B bits, exceptions separately.

Delta Encoding for Sorted Numerics

Sorted ID or timestamp columns benefit enormously from delta encoding:

User ID column (sorted, sequential):
  Raw:    10000001, 10000002, 10000003, ..., 10000000+N
  Delta:  10000001, 1, 1, 1, ..., 1

  After delta: bit-pack the 1s into 1 bit each.
  Savings: 32-bit integers → effectively 1 bit each.

Timestamp column (milliseconds, sorted):
  Raw:    1700000000000, 1700000000060, 1700000000120...
  Delta:  1700000000000, 60, 60, 60...

  After delta: values fit in 6 bits (60 < 64).
  Savings: 64-bit timestamps → 6 bits each.

Parquet: Columnar Storage Done Right

Apache Parquet (2013) is the de facto standard for columnar data at rest in the Hadoop ecosystem. It's used by Spark, Hive, Athena, Redshift Spectrum, BigQuery (via external tables), and most modern data lake tools.

Parquet File Structure

Parquet file layout:

  ┌──────────────────────────────────────────────┐
  │  Magic: PAR1 (4 bytes)                      │
  ├──────────────────────────────────────────────┤
  │  Row Group 1 (e.g., 128MB of row data)      │
  │  ┌─────────────────────────────────────────┐│
  │  │ Column Chunk: user_id                   ││
  │  │  Page 1 (Dictionary page, if any)       ││
  │  │  Page 2 (Data pages)                    ││
  │  │  Page 3 ...                             ││
  │  ├─────────────────────────────────────────┤│
  │  │ Column Chunk: country                   ││
  │  │  Dictionary page + data pages           ││
  │  ├─────────────────────────────────────────┤│
  │  │ Column Chunk: age                       ││
  │  │  Data pages (RLE/bit-packed)            ││
  │  └─────────────────────────────────────────┘│
  ├──────────────────────────────────────────────┤
  │  Row Group 2...                             │
  ├──────────────────────────────────────────────┤
  │  File Footer (Thrift-encoded metadata)      │
  │  - Schema (column names, types)             │
  │  - Row group offsets and sizes              │
  │  - Column statistics (min, max, null count) │
  │  - Encoding types used per column           │
  ├──────────────────────────────────────────────┤
  │  Footer length (4 bytes)                    │
  │  Magic: PAR1 (4 bytes)                      │
  └──────────────────────────────────────────────┘

Parquet Encodings

Parquet supports multiple encoding schemes per column chunk, chosen by the writer:

Parquet encoding types:

PLAIN:              Raw values, no encoding. Fallback.
DICTIONARY:         Dict page + integer codes. Default for strings.
RLE_DICTIONARY:     RLE on top of dict codes. Sorted low-cardinality.
PLAIN_DICTIONARY:   Deprecated (same as RLE_DICTIONARY).
RLE:                Run-length encoding.
BIT_PACKED:         Bit-packing. Used for repetition/definition levels.
DELTA_BINARY_PACKED: Delta + bit-packing for sorted integers.
DELTA_LENGTH_BYTE_ARRAY: Delta coding of string lengths.
DELTA_BYTE_ARRAY:   Delta coding of byte arrays (prefix compression for sorted strings).
BYTE_STREAM_SPLIT:  Split float bytes across separate streams for better GZIP compression.

The most important: DELTA_BINARY_PACKED for timestamp columns. A column of 1 billion sorted timestamps takes roughly 8GB naive; with delta + bit-packing it can compress to under 1GB before any additional gzip/Snappy compression.

Parquet + Compression Codec

After columnar encoding, Parquet applies a general-purpose compressor to each page:

Encoding pipeline for a string column:

  "US", "UK", "US", "US", "DE", ...
       │
       ▼ Dictionary encoding
  Dict: {0:"US", 1:"UK", 2:"DE"}
  Codes: 0, 1, 0, 0, 2, ...
       │
       ▼ RLE_DICTIONARY
  (run-length encode the codes)
  (0, 3), (1, 1), (0, 47), (2, 2), ...
       │
       ▼ Snappy / GZIP / Zstd / LZ4
  Generic compression on the encoded bytes
       │
       ▼ Parquet page bytes

Choosing the Parquet compression codec:

Parquet codec comparison:

Codec       Compress speed    Decompress    Ratio    Use case
───────────────────────────────────────────────────────────────
UNCOMPRESSED  —              Fastest       1.0×     Development/debug
SNAPPY      Very fast        Very fast     1.5×     General Hadoop
LZ4_RAW     Very fast        Fastest       1.5×     Performance-critical
GZIP        Slow             Fast          2.0×     Cold storage, S3
ZSTD        Fast             Very fast     2.0×     Modern default choice
BROTLI      Very slow        Fast          2.2×     Static web data

Zstd has become the recommended default for new Parquet files in 2023+: better compression than Snappy, much faster than GZIP, good decompression speed.

LZ4: Speed Above All Else

LZ4 was designed by Yann Collet (who also wrote Zstandard) with a single goal: be the fastest possible LZ-family compressor. Not the best ratio — the fastest.

LZ4 design philosophy:

"The minimum information needed to decompress is:
  - A stream of literal bytes (copy these literally)
  - A stream of (offset, length) match references

Encode these as simply as possible.
No Huffman. No arithmetic coding. Just token headers."

LZ4 token format:
  ┌──────────────────────────────────────────────────┐
  │  1 byte token:                                   │
  │  [literal_len: 4 bits][match_len: 4 bits]        │
  │                                                  │
  │  If literal_len == 15: read more bytes (255+)    │
  │  Then: literal_len literal bytes                 │
  │  Then: 2-byte offset (little-endian)             │
  │  If match_len == 15: read more bytes (255+4)     │
  └──────────────────────────────────────────────────┘

No entropy coding = decompressor does almost nothing.
Modern CPUs can decompress LZ4 at 4+ GB/s.

LZ4 compresses roughly 2:1 on typical data — much less than Zstd or gzip. But its decompression speed is so high that using LZ4 to compress RAM can actually increase effective memory bandwidth:

LZ4 in-memory caching:

Without compression:
  Read 100MB from disk → store 100MB in RAM → serve 100MB
  Network → 100MB/s → Cache hit reads: 100MB/s

With LZ4 compression:
  Read 100MB from disk → compress to 50MB → store 50MB in RAM
  Network → 100MB/s → Cache hit reads: 50MB at 4GB/s decompression
  Effective cache size: 2× (same RAM, more data)
  Latency: slightly higher (decompress time ~12ms for 50MB)

This is why Redis, Memcached, and many in-memory databases offer LZ4 as an optional compression layer.

Snappy: Google's In-Memory Compressor

Snappy (originally Zippy, by Google, 2011) is similar to LZ4 — designed for speed over ratio. It's the default codec in Hadoop, Cassandra, and Leveldb/RocksDB.

Snappy vs. LZ4:

Metric              Snappy       LZ4
─────────────────────────────────────
Compression speed   ~500 MB/s    ~700 MB/s
Decompression       ~1800 MB/s   ~4000 MB/s
Compression ratio   ~1.7×        ~2.1×
Framing format      Yes (spec)   Flexible

LZ4 has mostly superseded Snappy in new systems — it's faster and compresses better. Snappy is still dominant in existing deployments due to ecosystem inertia.

Columnar Compression in Practice

Reading Only What You Need

The biggest benefit of columnar storage isn't compression ratio — it's I/O efficiency:

Query: SELECT avg(age) FROM users WHERE country = 'US'

Row-oriented storage:
  Must read ALL bytes of ALL rows (user_id, country, age, plan, ...)
  to extract country and age for each row.
  For 10M rows, 4 columns, 8 bytes each: 320MB read.

Column-oriented storage:
  Read ONLY the 'country' column and 'age' column.
  For 10M rows, 2 columns, 4 bytes each: 80MB read.
  4× less I/O before even considering compression.

With compression:
  Country column (dictionary): ~5MB (dictionary codes are tiny)
  Age column (bit-packed): ~20MB (values 0-150 fit in 8 bits)
  Total: 25MB read — 13× less than row-oriented.

Predicate Pushdown and Bloom Filters

Parquet stores column statistics (min, max, null count) and optional Bloom filter data in the row group footer. Query engines use this to skip entire row groups without reading the column data:

Row group pruning example:

Query: SELECT * FROM orders WHERE order_date = '2024-01-15'

Row group 1:  date range [2024-01-01, 2024-01-10] → SKIP
Row group 2:  date range [2024-01-11, 2024-01-20] → READ
Row group 3:  date range [2024-01-21, 2024-01-31] → SKIP
Row group 4:  date range [2024-02-01, 2024-02-15] → SKIP

Result: read 25% of the data, skip 75%.
Combined with columnar projection, potentially read 3-5% of bytes.

Bloom filters provide set membership tests ("is this order_id in this row group?") without false negatives, at the cost of ~1% false positives. Combined with statistics, they enable very high skip rates for point queries.

RocksDB and LSM-Tree Compression

RocksDB (Facebook, 2012) is a key-value store used by Cassandra, MySQL (MyRocks), Kafka (for log compaction), and many others. It uses an LSM (Log-Structured Merge) tree, which has interesting compression characteristics.

RocksDB LSM structure:

  Level 0: Small SSTables (sorted string tables)
           Written by memtable flush, not sorted between files
  Level 1: Larger SSTables, sorted, non-overlapping
  Level 2: 10× larger, same rules
  ...

Compression per level (typical config):
  L0, L1: Snappy (fast, writes happen often, frequent compaction)
  L2+:    Zstd or ZLIB (better ratio, these levels change less often)

Data size at each level:
  L0: 256MB
  L1: 256MB
  L2: 2.5GB
  L3: 25GB
  L4+: most of the data

Better compression on lower levels = smaller footprint.

The tiered compression strategy makes sense: levels that are frequently compacted (rewritten) benefit from fast compression; levels that are rarely compacted benefit from better ratio.

Summary

Columnar storage puts values of the same type together, making them dramatically more compressible.
Dictionary encoding replaces string values with integer codes; enables fast filtering without decoding.
RLE on sorted columnar data is the most powerful technique — can skip entire operations.
Bit-packing and delta encoding exploit bounded ranges and sorted numeric columns.
Parquet stacks columnar encodings + general-purpose compression (GZIP, Snappy, Zstd) per column chunk.
LZ4 and Snappy prioritize decompression speed for in-memory and real-time use cases.
Predicate pushdown and statistics-based row group pruning multiply the benefit of columnar formats beyond compression ratio alone.

Next: network compression — where we add the constraint that both ends of the wire must agree on the algorithm, and examine when not to compress at all.

Chapter 12: Network Compression

Network compression has a constraint that no other compression context has: both sender and receiver must agree on the algorithm, negotiate it at connection time, and do it all quickly enough not to increase latency. The compression must be faster than the network would take without it — otherwise you've made things slower.

This creates a different set of tradeoffs. The question isn't just "what compresses best?" but "what compresses fast enough that the savings in transmission time outweigh the cost of compression itself?"

The Break-Even Analysis

For compression to help, the time saved transmitting fewer bytes must exceed the time spent compressing:

Break-even calculation:

Original size:      S bytes
Compressed size:    C bytes (where C < S)
Network bandwidth:  B bytes/second
Compression speed:  V bytes/second (throughput of compressor)

Time to send without compression:  S / B
Time with compression:             S / V  +  C / B

Compression helps when:
  S / V  +  C / B  <  S / B
  S / V  <  S/B - C/B = (S-C)/B
  S / V  <  (S-C) / B
  B      <  V × (S-C)/S = V × compression_savings_ratio

Compression always helps if B << V (slow network, fast compressor).
Compression never helps if B >> V (fast network, slow compressor).

For a 1 Gbps LAN with gzip at 25 MB/s: compression speed (200 Mbps) << network speed (1 Gbps). Gzip would hurt. For a 10 Mbps WAN connection with gzip at 25 MB/s (200 Mbps): network is the bottleneck, gzip helps dramatically.

This is why HTTP compression is standard for internet-facing services but rare for internal datacenter traffic (unless the data is very large or the network is saturated).

HTTP Content Negotiation

HTTP compression is opt-in and negotiated per-request. The client advertises what it supports; the server picks what it sends.

HTTP compression negotiation:

Client request:
  GET /api/data HTTP/1.1
  Host: example.com
  Accept-Encoding: gzip, deflate, br, zstd

Server response:
  HTTP/1.1 200 OK
  Content-Encoding: br
  Content-Type: application/json
  Content-Length: 4832

  [Brotli-compressed body]

Client: decompress Brotli body, present to application.

Content-Encoding specifies the compression applied to the response body. Multiple encodings can be chained (rarely used):

Content-Encoding: gzip, identity

Accept-Encoding priorities: Clients can give quality factors: Accept-Encoding: gzip;q=1.0, br;q=0.9, identity;q=0.5. In practice, most servers ignore q-factors and just pick the best supported algorithm.

gzip in HTTP

gzip is the universal HTTP compression format — supported since HTTP/1.1 and by essentially every client and server in existence. Nginx and Apache enable it with:

# nginx gzip configuration
gzip on;
gzip_comp_level 6;          # Level 1-9; 6 is the sweet spot
gzip_min_length 1000;       # Don't compress tiny responses
gzip_types
    text/plain
    text/css
    text/javascript
    application/javascript
    application/json
    application/xml
    image/svg+xml;

# Don't compress responses that are already compressed
# (JPEG, PNG, MP4 etc. are not in the list above)

Pitfall: Compressing already-compressed data wastes CPU and sometimes increases size. The types list should include only text-based formats.

Brotli in HTTP

Brotli (br) is supported by all major browsers (since 2016/2017) and produces ~15-25% better compression than gzip on typical web content.

# nginx Brotli configuration (requires ngx_brotli module)
brotli on;
brotli_comp_level 6;        # 0-11; 6 is reasonable default
brotli_min_length 1000;
brotli_types text/plain text/css application/javascript application/json;

Brotli is slower to compress than gzip at equivalent levels, making it more suitable for:

Pre-compressed static assets: Compress at build time, serve from cache. file.js.br stored alongside file.js.gz.
Moderately dynamic content: Where the compression overhead is acceptable.

# Pre-compress static assets
for f in dist/*.js dist/*.css; do
  brotli --best --output "${f}.br" "$f"
  gzip --best --keep --output "${f}.gz" "$f"
done

The server then serves .br or .gz based on Accept-Encoding, without compressing on the fly.

Zstd in HTTP

Zstd HTTP compression is specified in RFC 8878 (2021) and increasingly supported:

Chrome 118+ (2023)
Firefox 116+ (2023)
Safari (as of 2024)

Compression ratio comparison on a typical 100KB JSON API response:

Encoding     Size      Ratio    Encode time
──────────────────────────────────────────────
None         100KB     1.0×     —
gzip-6       28KB      3.6×     4ms
br-6         24KB      4.2×     12ms
zstd-6       26KB      3.8×     1ms

For dynamic API responses where compress-on-the-fly matters:
  zstd wins: better than gzip, much faster than Brotli.

For pre-compressed static assets:
  Brotli wins: compress once offline, better ratio.

Zstd also has a killer feature for HTTP: trained dictionaries. If you pre-train a dictionary on samples of your API responses, Zstd can compress small JSON responses to a fraction of their gzip size.

# Zstd shared compression dictionary for API responses
import zstandard as zstd
import json

# Training (offline, during deployment)
samples = [json.dumps(resp).encode() for resp in sample_api_responses]
dict_data = zstd.train_dictionary(32768, samples)  # 32KB dictionary
dict_data.write_to_file('api_response.dict')

# Server-side compression (with dictionary)
cdict = zstd.ZstdCompressionDict(dict_data.as_bytes())
compressor = zstd.ZstdCompressor(dict_data=cdict, level=3)

def compress_api_response(data: dict) -> bytes:
    return compressor.compress(json.dumps(data).encode())

# Client must have the same dictionary (served once, cached)
# This is nonstandard - Zstd dict IDs would need custom negotiation
# In practice, shared dicts are used for specific known clients (mobile apps)

HPACK: HTTP/2 Header Compression

HTTP headers are surprisingly expensive. A typical HTTP request has 500-800 bytes of headers (cookies, User-Agent, Accept, Authorization, etc.), and for short API responses, headers can be larger than the body.

HTTP/2 introduced HPACK (Header Compression for HTTP/2, RFC 7541) specifically for this problem.

HPACK's Approach

HPACK is not a generic compression algorithm — it's purpose-built for HTTP headers. It uses:

Static table: 61 predefined header name-value pairs (:method GET, :status 200, content-type application/json, etc.)
Dynamic table: A FIFO buffer of recently seen header name-value pairs, shared between encoder and decoder
Huffman coding: A static Huffman table for header values (based on frequency of characters in HTTP headers)

HPACK encoding examples:

:method: GET (static table entry #2):
  Representation: 0x82  (1 byte! Full header in 1 byte)

content-type: text/html (static table entry #31 for name only):
  Literal header, indexed name, literal value
  0x5f 0x1d [huffman-encoded "text/html"]

Authorization: Bearer eyJhbGc...  (not in tables):
  Literal header, incremental indexing (add to dynamic table)
  [huffman-encoded "authorization"] [huffman-encoded "Bearer eyJ..."]
  Future requests reuse this from the dynamic table.

The result: many common headers compress to 1-3 bytes. A typical HTTPS API request that would have 800 bytes of headers in HTTP/1.1 might have 100-200 bytes in HTTP/2 after the first request (once the dynamic table is populated).

HPACK Security: CRIME and BREACH

HPACK doesn't allow request headers and response bodies to share a context — this prevented the CRIME attack (2012), where a TLS+gzip combination allowed attackers to recover session cookies by observing compressed request sizes.

If the compression context is shared between attacker-controlled input and secret data, an attacker can determine the secret by crafting inputs that compress differently based on whether they share patterns with the secret. HPACK addresses this by separating header tables from body compression.

HTTP/3 and QPACK

HTTP/3 runs over QUIC (UDP-based) instead of TCP. HPACK had a limitation: it required headers to be processed in order (head-of-line blocking). QPACK (RFC 9204) fixes this for HTTP/3, allowing headers to be processed out of order while maintaining equivalent compression.

When NOT to Compress in Transit

Already-Compressed Content

Never compress data that's already compressed. The overhead of attempting to compress JPEG, MP4, ZIP, GZIP, or any already-compressed format is pure waste:

Attempting to gzip an already-gzipped response:

Original JSON: 50KB
gzip compressed: 15KB
"Re-gzip" the gzip stream: 15.1KB  (slightly larger due to headers!)
CPU cost: 2× (compress + decompress the inner layer)

Configure your server to not compress Content-Type: image/*, video/*, audio/*, .zip, .gz, .br, .zst, .7z, and similar binary formats.

Small Responses

For responses under ~1KB, the overhead of compression headers and minimum compressor block sizes can exceed the savings:

Compression overhead for small responses:

gzip headers + trailer: ~18 bytes
For a 200-byte response:
  Compressed: 150 bytes + 18 overhead = 168 bytes
  Improvement: 200 → 168 (16% savings, cost: CPU)

For a 50-byte response:
  Compressed: 45 bytes + 18 overhead = 63 bytes
  LARGER than original! Don't compress.

gzip_min_length 1000 (nginx default) is a reasonable threshold.

High-Traffic Real-Time APIs

For APIs handling thousands of requests per second with sub-10ms response time requirements, on-the-fly compression at gzip-6 can add 2-5ms of latency. Options:

Pre-compress: For cacheable responses, compress once and cache the compressed response.
Lower compression level: gzip-1 is 3-5× faster than gzip-6 with 10-20% worse ratio.
Zstd-1: Faster than gzip-1, better ratio than gzip-6.
No compression: For internal microservice communication on high-speed networks.

TLS and Compression (CRIME)

After the CRIME and BREACH vulnerabilities, compressing HTTPS responses that include both attacker-controlled content and sensitive cookies/tokens requires care. BREACH showed that even response body compression (not just header compression) can leak secrets if:

The response contains a secret (CSRF token, session ID)
The response also contains attacker-controlled data (reflected parameter)
The response is compressed

BREACH mitigations: disable compression on mixed-content responses, or use response padding/masking. For most modern applications that don't reflect user input in responses containing secrets, the risk is low — but understanding it matters for security-sensitive APIs.

Compression in Transit vs. At Rest

A common confusion: content-encoding (transit compression) and storage compression are orthogonal:

Content lifecycle and compression:

  Origin server                  CDN / Cache               Client
  ┌──────────┐                  ┌──────────┐              ┌──────────┐
  │ file.js  │ ────Brotli────▶  │ file.js  │ ─Brotli──▶  │ file.js  │
  │(at rest) │   (in transit)   │ (stored  │  (in transit)│(in memory│
  │   zstd   │                  │ as .br   │              │ uncompressed)
  └──────────┘                  └──────────┘              └──────────┘

At rest:       zstd (storage format, high compression, rarely accessed)
In transit:    Brotli (browser understands this encoding)
In memory:     uncompressed (browser needs raw JS to parse and execute)

A CDN might:

Store assets at rest in an efficient format (Zstd, high level)
Decompress and re-compress as Brotli when serving to browsers that support it
Decompress and re-compress as gzip when serving to older browsers

Or better: pre-generate .br and .gz versions at build time, store all three, serve the appropriate one based on Accept-Encoding.

Summary

Compress when network bandwidth << compressor throughput. Don't compress when bandwidth is fast or compressor is slow.
gzip is universal for HTTP; use level 6 as the default.
Brotli is 15-25% better than gzip on web content; ideal for pre-compressed static assets.
Zstd is the modern choice for dynamic API compression — faster than gzip, better than Brotli at equivalent speed.
HPACK compresses HTTP/2 headers using static+dynamic tables and Huffman coding; common headers drop to 1-3 bytes.
Never compress already-compressed data types (JPEG, MP4, ZIP, etc.).
Small responses (<1KB) often expand under compression due to format overhead.
Understand CRIME/BREACH before enabling compression on responses mixing user input with secrets.

Next: putting it all together with a decision framework for choosing the right algorithm for your context.

Chapter 13: Choosing the Right Algorithm

After twelve chapters of algorithms, the practical question remains: when you sit down with a compression problem, how do you pick? The answer depends on your constraints, and compression problems have many possible constraints: speed, ratio, memory, compatibility, streaming, random access, data type, and more.

This chapter offers a structured decision framework, a set of guiding principles, and a reference matrix you can return to.

The Four Dimensions

Every compression decision lives in a four-dimensional space:

Compression decision dimensions:

  ┌─────────────────────────────────────────────────────────┐
  │  1. SPEED                                               │
  │     Compression speed: how fast can you compress?       │
  │     Decompression speed: how fast must you decompress?  │
  │     Are they the same process? (usually not)            │
  │                                                         │
  │  2. RATIO                                               │
  │     How important is file size?                         │
  │     Is there a hard size budget?                        │
  │     What's the cost of additional bytes? (bandwidth $)  │
  │                                                         │
  │  3. COMPATIBILITY                                       │
  │     Who reads the compressed data?                      │
  │     Can you control both compressor and decompressor?   │
  │     Is this a public format or internal?                │
  │                                                         │
  │  4. DATA CHARACTERISTICS                                │
  │     Text? Binary? Images? Video? Sensor data?           │
  │     Already compressed?                                 │
  │     Regular structure? Random?                          │
  └─────────────────────────────────────────────────────────┘

The Decision Tree

START HERE: What kind of data?
│
├── MEDIA (images, audio, video)
│   │
│   ├── Images
│   │   ├── Must be lossless → PNG (photos: use WebP lossless)
│   │   ├── Photographs, quality matters → AVIF (WebP lossy fallback)
│   │   ├── Photographs, compatibility → JPEG quality 80-90
│   │   ├── Icons, UI, transparency → PNG or WebP lossless
│   │   └── Already compressed → don't re-compress
│   │
│   ├── Audio
│   │   ├── Lossless archival → FLAC
│   │   ├── Distribution/streaming → Opus (or AAC-LC)
│   │   ├── Voice/communications → Opus
│   │   └── Legacy compatibility → MP3 VBR V2
│   │
│   └── Video
│       ├── New production/streaming → AV1 (H.265 fallback)
│       ├── Broad compatibility required → H.264
│       ├── Archival/lossless → FFV1 or uncompressed
│       └── Screen recording → VP9 or AV1 (high-ratio intra)
│
├── GENERAL PURPOSE (files, archives, backups)
│   │
│   ├── Need maximum compression, time doesn't matter
│   │   └── 7-Zip / xz (LZMA2) with solid archive
│   │
│   ├── Good compression, reasonable speed
│   │   └── Zstd level 6-9, or Brotli level 6 for web content
│   │
│   ├── Fast compression (logs, telemetry, backup streams)
│   │   └── Zstd level 1-3
│   │
│   ├── Legacy tools / universal compatibility
│   │   └── gzip level 6
│   │
│   └── Already compressed content (JPEG, MP4, ZIP, etc.)
│       └── Store, don't compress
│
├── DATABASE / IN-MEMORY
│   │
│   ├── Fastest possible decompression (cache, OLTP)
│   │   └── LZ4 or Snappy
│   │
│   ├── Good compression + fast decompress (OLAP, analytics)
│   │   └── Zstd level 3 + columnar encodings (Parquet)
│   │
│   ├── Maximum compression, read-heavy (cold storage)
│   │   └── GZIP or Zstd level 9+ in Parquet
│   │
│   └── Time-series / sorted numerics
│       └── Delta + bit-packing + Zstd (columnar format)
│
└── NETWORK / HTTP
    │
    ├── Static web assets (compress once, serve many)
    │   └── Brotli level 9-11 (+ gzip fallback)
    │
    ├── Dynamic API responses
    │   └── Zstd level 3 (or gzip level 6 for compatibility)
    │
    ├── Internal microservice traffic
    │   └── Zstd level 1 (if network is bottleneck), else none
    │
    └── Already-compressed responses (images, video)
        └── No Content-Encoding (pass through as-is)

Guiding Principles

1. Measure on Your Actual Data

The most important advice in this chapter. Every benchmark in this book uses synthetic or corpus data. Your data is different. Run actual measurements.

# Quick benchmark: compare Zstd levels on your data
for level in 1 3 6 9 12 19; do
  echo -n "Level $level: "
  time zstd -$level -q -f your_data.bin -o /tmp/test.zst
  ls -lh /tmp/test.zst | awk '{print $5}'
done

# Compare decompression speeds
for f in your_data.gz your_data.br your_data.zst; do
  echo -n "$f: "
  time zcat $f > /dev/null 2>&1  # adjust command per format
done

# The lzbench tool benchmarks many algorithms at once:
lzbench -t16,16 your_data.bin

lzbench (https://github.com/inikep/lzbench) is the standard tool for this — it benchmarks 70+ compression algorithms and levels on any file you provide.

2. Compress Once, Decompress Many

The compression/decompression speed asymmetry matters enormously for economics. If a file will be downloaded 100,000 times:

Economic calculation for a 100MB file downloaded 100K times:

Option A: gzip-9 (slow compress, good ratio)
  Compress time: 200 seconds (one-time cost)
  Compressed size: 30MB → save 70MB per download
  Total bandwidth saved: 70MB × 100K = 7TB
  At $0.09/GB: $630 saved

Option B: gzip-6 (faster compress, slightly worse ratio)
  Compress time: 40 seconds (one-time cost)
  Compressed size: 32MB → save 68MB per download
  Total bandwidth saved: 68MB × 100K = 6.8TB
  At $0.09/GB: $612 saved

The 2MB per download difference is worth it for popular content.
For rarely-accessed files: faster compression is better economics.

3. Know When Compatibility Is Non-Negotiable

If you control both the writer and the reader, you can use any format you want. If the reader is a third party (a browser, a legacy system, a standards-mandated format), you're constrained.

Format compatibility grid (a reminder):

Format    gzip    br     zstd   lzma   zlib
──────────────────────────────────────────────
curl        ✓      ✓      ✓      —      —
Chrome      ✓      ✓      ✓      —      —
Safari      ✓      ✓      ✓      —      —
Python      ✓      ✓      ✓      ✓      ✓
Java (JDK)  ✓      —      —      —      ✓
Node.js     ✓      ✓      ✓      —      ✓
Most Unix   ✓      —      ✓      ✓      —
Windows     ✓      —      —      —      —

For files that must be openable by non-technical users without installing software: ZIP or GZIP. For everything else: Zstd.

4. Pre-Compress When You Can

Pre-compressing at high levels (and storing the result) always beats on-the-fly compression:

Better ratio (more CPU time available)
No latency penalty when serving
Consistent performance under load

Static site generators, CDN pipelines, and build systems should always pre-compress at the highest practical level.

5. Don't Compress Already-Compressed Data

This deserves repeating because it's a common mistake. Adding a gzip step after a JPEG, MP4, or ZIP step is pure waste. The result will be approximately the same size, and you've wasted CPU.

Signs you're doing this wrong:

Your archive contains .jpg, .mp4, or .pdf files and they compress to ~100% of their original size
Your gzip ratio on "all files" is 1.01:1

6. Test Extreme Cases

Whatever algorithm you pick, test what happens when it gets input that's worst-case for that algorithm:

Zstd on already-compressed data (should be ~1.0 ratio, not expand)
RLE on random data (should not expand catastrophically)
LZMA on a 10GB file (memory usage OK?)
LZ4 on a 1-byte file (what's the overhead?)

Algorithm Reference Matrix

Algorithm      Best for                Ratio    Speed(c)  Speed(d)  Memory
────────────────────────────────────────────────────────────────────────────
PLAIN/None     Random, already cmp     1.0×     ∞         ∞         0
RLE            Highly repetitive       10-100×  Very fast Very fast  Tiny
Delta          Time-series, sorted     2-20×    Very fast Very fast  Tiny
BWT+MTF        Text (bzip2)            3-4×     Slow      Slow       8MB
LZ4            In-memory, database     2-3×     700 MB/s  4000 MB/s  KB
Snappy         Database (Hadoop)       1.7-2×   500 MB/s  1800 MB/s  KB
Zstd-1         Fast general purpose    3-4×     500 MB/s  1200 MB/s  1MB
Zstd-3         General purpose         3-5×     350 MB/s  1200 MB/s  4MB
Zstd-9         High ratio + speed      4-6×     50 MB/s   1200 MB/s  16MB
Zstd-19        Near-max lossless       4-7×     8 MB/s    1200 MB/s  64MB
gzip-6         Universal compat        2.5-4×   25 MB/s   200 MB/s   256KB
Brotli-6       Web content             3-5×     15 MB/s   200 MB/s   8MB
Brotli-11      Pre-compressed web      3.5-5.5× 1 MB/s    200 MB/s   8MB
xz-6 (LZMA)   Archival, distrib.      4-8×     4 MB/s    50 MB/s    97MB
xz-9 (LZMA)   Maximum lossless        5-9×     1 MB/s    50 MB/s    683MB
PNG            Lossless images         1.5-3×   Medium    Fast       Medium
JPEG-90        Photographs             5-10×    Fast      Fast       Tiny
WebP-lossy     Web photographs         8-15×    Medium    Fast       Tiny
AVIF           Modern web images       15-30×   Very slow Fast       Medium
FLAC           Lossless audio          ~2×      Medium    Fast       Small
MP3-192        Lossy audio             7×       Medium    Fast       Small
Opus-128       Lossy audio (modern)    10×      Fast      Fast       Small
H.264          Video (universal)       100-200× Slow      Fast       Medium
H.265          Video (high eff.)       150-300× Very slow Fast       Medium
AV1            Video (open, modern)    200-400× Very slow Fast       Large

Speed values are order-of-magnitude estimates on commodity hardware; actual speeds vary significantly by data type, CPU, and implementation.

The Compression Ratio vs. Speed Tradeoff Curve

One of the most useful mental models: every algorithm lives somewhere on a ratio-vs-speed tradeoff curve. Faster algorithms sacrifice ratio; better-ratio algorithms sacrifice speed.

Compression ratio vs. speed landscape:

  Ratio
   9× │                           ·xz-9
      │                      ·xz-6
   7× │                ·Zstd-19
      │            ·Brotli-11
   5× │       ·Brotli-6  ·Zstd-9
      │    ·gzip-6
   3× │  ·Zstd-3         (Modern sweet spot:
      │ ·Zstd-1           ·Zstd-3)
   2× │·Snappy
      │·LZ4
   1× │·None
      └────────────────────────────────────────── Speed
      0.1 MB/s    10 MB/s    100 MB/s   1000 MB/s
      (xz-9)      (Brotli)   (gzip)     (LZ4)

The "efficient frontier": no algorithm is simultaneously better
at ratio AND speed than the frontier algorithms.
Choosing an algorithm is choosing where on this curve you live.

Zstd's genius is that it offers points along this curve with a single tool: -1 for LZ4-comparable speed, -19 for xz-comparable ratio, and everything in between with the same binary and API.

A Worked Example: Designing a Log Storage System

Problem: Store application logs. 50GB/day written, kept for 90 days (4.5TB). Queries: recent logs accessed frequently, old logs accessed rarely. Must compress to keep storage costs under $50/month.

Step 1: Characterize the data

Logs are structured text (JSON or key=value), with high repetition of field names and common values. Very compressible.

Step 2: Define the constraints

Write speed: must handle ~600KB/s sustained write rate
Read speed: recent logs must be queryable with <100ms latency
Cost: must fit in ~5TB storage (4.5TB uncompressed → compress to < 1.5TB)
Compatibility: in-house tools, no compatibility constraint

Step 3: Tiered approach

Log storage tiered compression:

Hot tier (< 24 hours): Zstd level 1
  - 600KB/s write → easily within Zstd-1 speed (500+ MB/s)
  - Ratio: ~4× → 12.5GB/day stored
  - Fast decompression for recent queries

Warm tier (1-7 days): Zstd level 6
  - Recompressed during off-peak (batch job)
  - Better ratio: ~5× → 10GB/day stored
  - Still fast decompress for query latency

Cold tier (> 7 days): Zstd level 19
  - Recompressed during off-peak weekly
  - Best ratio: ~6× → 8.3GB/day stored
  - Slower decompress acceptable for historical queries

Total storage:
  Day 1:    12.5GB (hot)
  Days 2-7:  10GB/day × 6 = 60GB (warm)
  Days 8-90: 8.3GB/day × 83 = 689GB (cold)
  Total: ~762GB for 90 days (vs 4.5TB uncompressed = 83% savings)

Cost at $0.02/GB/month: ~$15/month. ✓

Step 4: Validate on real data

Run lzbench on a sample of actual log files. Adjust tier boundaries based on actual ratios and speeds.

Summary

The four dimensions of every compression decision: speed, ratio, compatibility, and data characteristics.
Measure on your actual data. Benchmarks on synthetic data are guides, not guarantees.
Compress once, decompress many: justify expensive compression for frequently-accessed data.
Pre-compress when possible: build-time compression beats run-time compression in almost every case.
Never compress already-compressed formats: waste of CPU, rarely any size savings.
Zstd is the modern general-purpose default: one tool, configurable from LZ4-speed to xz-ratio.
Brotli for static web assets, gzip for universal compatibility, LZ4/Snappy for in-memory.
LZMA/7-Zip when maximum compression ratio is the only goal and time doesn't matter.
Tiered compression (different algorithms for hot/warm/cold data) is the practical answer for large-scale systems.

The next (and final) chapter is something different: a look at the argument that large language models are, at their core, a very sophisticated form of lossy compression.

Chapter 14: LLMs as Lossy Compressors

This chapter is different. It's speculative, philosophical, and only partially serious. Take it as a frame for thinking about LLMs, not as a rigorous technical argument. The frame is useful even if imperfect.

Having spent thirteen chapters understanding how compressors work — what they keep, what they throw away, how they model data, and what the theoretical limits are — we're in a surprisingly good position to think about large language models. Because the argument that LLMs are compressors is not just a metaphor. It holds up under scrutiny, and it explains several things about LLM behavior that are otherwise puzzling.

The Compressor Framing

Here's the core claim: a language model's training process is the act of lossy compression of a large corpus of text into a set of floating-point weights.

Let's break this down piece by piece.

Training = Compression

A large language model is trained on something like the entire public internet — trillions of tokens of text. The model itself contains some number of parameters: GPT-3 had 175 billion parameters; Llama 3 405B has 405 billion; the trajectory is toward hundreds of billions, maybe trillions.

The raw training data is vastly larger than the model's weights. The pre-training corpus for a frontier model is on the order of 10–100 trillion tokens, each a 16-bit value (plus vocabulary etc.). The weights are a fixed-size array of floats.

LLM as a lossy compressor:

  Training corpus: ~10 trillion tokens × 2 bytes = ~20 TB

  GPT-4 weights (estimated ~1.7T params at fp8): ~1.7TB
  Llama 3 70B (fp16): ~140GB
  Llama 3 405B (fp8): ~400GB
  Gemma 2 9B (fp16): ~18GB
  GPT-4 estimated compression ratio: ~12:1 minimum

But the ratio understates the compression. The model doesn't just store the text — it captures something more like the structure of the text. It learns that code tends to have matching brackets, that English sentences tend to have subject-verb agreement, that numbers that appear in certain financial contexts tend to follow certain distributions.

This is better than the training data itself for a compressor's purposes — a good model of the structure enables encoding at near-entropy.

Actually, this exact idea was tested. In 2019, Jack Clark and Dario Amodei observed that language models could be used directly as lossless compressors: use the model's probability estimate for each token to arithmetic-code the next token. A model that assigns high probability to the actual next token → uses few bits to encode it. A perfect model → achieves Shannon entropy.

The result: large language models, used as arithmetic coders, beat every other general-purpose compressor on text.

LLM-as-compressor results (approximate, on enwik8 — 100MB of Wikipedia):

Compressor          Bits per byte
─────────────────────────────────
gzip-6                 3.20
bzip2                  2.43
xz (LZMA)             2.19
Brotli-11             2.11
PPM                   2.04
LSTM language model   1.77  ← early result (2019)
GPT-3 (large)        ~1.30  ← estimated
GPT-4 (estimated)    ~1.10  ← approaching Shannon limit for English
Shannon limit English ~1.0   ← theoretical minimum

An LLM beats every non-LLM compressor on English text,
by a wide margin, because it models the language so much better.

Inference = Decompression

If training is compression, then inference is decompression — roughly. The model "stored" the patterns from the training data in its weights. When you prompt it, you're asking it to "decompress" the relevant portion: generate text that conforms to those stored patterns.

Classical lossy compression:

  Input data        Lossy encoder        Compressed bits
  (photograph)  ─────────────────▶     (JPEG bytes)
                                              │
                                              │ Decoder
                                              ▼
                                        Approximate reconstruction
                                        (slight artifacts, close to original)

LLM "compression":

  Training corpus   Gradient descent     Model weights
  (internet text) ─────────────────▶   (fp16/fp8 params)
                                              │
                                              │ Inference
                                              ▼
                                        Approximate "reconstruction"
                                        (statistically similar to training)

This framing explains something curious: why language models produce text that sounds like their training data. They don't retrieve it — they generate text that's statistically consistent with what was stored. The compression preserved the structure, the patterns, the correlations — not the verbatim content.

What Gets Lost

Every lossy compressor decides what to discard. JPEG discards high-frequency spatial information. MP3 discards sounds masked by louder sounds. A video codec discards temporal detail where motion compensation fails.

What does an LLM discard?

Verbatim Text

The model almost certainly cannot reproduce most of the training data verbatim. It stores patterns, not photographic reproductions. Ask an LLM for the 73rd word of a specific Wikipedia article from 2022, and it will either make something up or fail gracefully.

Exceptions: text that appears many times, or that was highly prominent in training, may be "memorized" — stored in the weights in a way that allows near-verbatim reproduction. This is how LLMs can reproduce famous poems, common code snippets, and heavily cited passages.

Precise Numbers

Numbers present a specific challenge. "The speed of light is approximately 300,000 km/s" is a pattern that appears everywhere and is reliably stored. "The GDP of Paraguay in Q3 2021 was $9.73 billion" is a specific fact that may or may not have appeared often enough to be reliably stored.

LLMs frequently hallucinate specific numbers. This is the compressor discarding precision: the "shape" of the fact (GDP of Paraguay is in the range $8–12 billion) is stored; the exact value is not.

Types of information and LLM retention:

Highly retained:
  Common patterns of language and grammar
  Widely repeated facts (speed of light, capital cities)
  Code patterns and API signatures (if in training data)
  General reasoning structures

Poorly retained (hallucination risk):
  Precise numerical values
  Specific dates and version numbers
  Exact quotations and attributions
  Events after training cutoff
  Obscure facts that appeared rarely in training

Provenance and Attribution

JPEG doesn't know which photographer took the photo it's compressing. It just stores the pixels. Similarly, an LLM during training doesn't (in base form) learn who said what — it learns what kinds of things tend to be said. The author, the source, the publication date, the context — these are not inherently preserved.

This is why LLMs confidently produce text that sounds authoritative but isn't: they're generating according to the statistical pattern "authoritative-sounding text about topic X," without any ground truth to constrain them.

The Lossy Compressor and Confabulation

The JPEG analogy is illuminating here. When you compress a JPEG heavily and then ask "what color is this pixel?", the JPEG decoder gives you an answer — but it might be wrong, because heavy quantization changed the color. It doesn't say "I'm not sure"; it says "blue."

LLMs behave similarly. When prompted about something that was poorly preserved in their weights, they don't (by default) say "I'm not sure about this." They generate text that conforms to the pattern "confident answer to a question about X" — because that's what was in the training data. People asking questions and confident people answering them.

This is confabulation — not deliberate lying, but generating text that's statistically plausible without a mechanism for checking whether it's factually correct.

The Model as a "Compressed Civilization"

There's a more optimistic way to view the same facts. The LLM hasn't just compressed text — it's compressed something like human knowledge and reasoning patterns. When you ask an LLM how to debug a TypeScript error, you're accessing the compressed form of thousands of Stack Overflow threads, documentation pages, blog posts, and GitHub issues about TypeScript debugging. The model synthesizes these into a response that no single document contained.

This is the sense in which LLMs are more useful than retrieval systems for many tasks: they've not just indexed the data, they've compressed it into a form where patterns combine. A user who asks "write a Python function that does X" gets something assembled from many training examples, not retrieved from any one of them.

Retrieval vs. Generation:

Retrieval (search engine / RAG):
  Query → find relevant documents → return verbatim excerpts
  Precise for specific known facts.
  Can't synthesize across sources.

Generation (LLM inference / "decompression"):
  Query → generate statistically appropriate response
  Synthesizes across everything in training.
  Hallucinations possible for specific facts.
  Better for synthesis, explanation, reasoning.

The tradeoff maps neatly to lossless vs. lossy compression:
  Retrieval = lossless (return the original)
  Generation = lossy (return a reconstruction, possibly imperfect)

Why "Really Expensive"

The chapter title calls LLMs "really expensive lossy compressors." Let's be specific about the costs.

Training costs: Frontier LLMs cost hundreds of millions to billions of dollars to train, requiring thousands of specialized accelerators (H100 GPUs, Google TPUs) running for months.

Storage costs: Model weights are large — 140GB for Llama 3 70B, estimated ~1TB+ for GPT-4 — requiring specialized storage and memory hierarchies.

Inference costs: Each forward pass requires enormous compute. Generating 1000 tokens from a 70B model requires approximately 70 billion multiply-accumulate operations per token — 70 trillion operations per request. At scale, this is the dominant cost for AI companies.

Memory bandwidth costs: The weights must be read from memory for every inference. For large models, memory bandwidth (how fast you can move data from GPU memory to compute units) is often the bottleneck. This is the "memory wall" problem.

Cost comparison to classical compression:

Decompress a 100MB gzip file:
  CPU: 1 second on a single core
  Memory: 32KB
  Power: ~10 watt-seconds

"Decompress" (inference) from a 70B LLM, 500 tokens output:
  Hardware: 4× A100 GPUs (required for model to fit in memory)
  Time: ~30 seconds
  Memory: 140GB GPU memory
  Power: ~150,000 watt-seconds (150 kWh equivalent... ok that's extreme,
         more like 150 watt-seconds per query on efficient hardware)

The cost gap is enormous. But the "compression ratio" is too:
  An LLM can synthesize knowledge from 20TB of training data
  in response to a 100-token query.
  No classical compressor does this.

Compression and Intelligence

The compression framing suggests something interesting about what intelligence might be. Marcus Hutter's "AIXI" framework and related theoretical work has long noted a connection between intelligence and compression: an agent that can compress data well implicitly has a model of the world that generated the data.

Hutter went further: he defined the Hutter Prize (the "Hutter Prize for Lossless Compression of Human Knowledge"): a prize for compressing 100MB of Wikipedia as small as possible. The prize is based on the argument that better compression implies better understanding, and the winner of the prize would effectively have a more concise model of human knowledge than anyone else.

LLMs have effectively won this argument by a wide margin — they're the best known compressors of human-generated text, by virtue of having (arguably) the best model of how humans generate text.

The compression-intelligence connection:

  Better world model  →  Better prediction  →  Better compression

  An agent that understands physics will predict the trajectory
  of a ball better than one that doesn't, and therefore can
  compress a video of a thrown ball more effectively.

  An agent that understands English grammar will predict the
  next word in a sentence better than one that doesn't.

  The model quality IS the compression quality.
  Evaluating a compressor IS (partially) evaluating the model.

What This Frame Gets Wrong

Every analogy breaks down somewhere. The LLM-as-compressor frame has real limits:

Training objectives differ: A lossy image compressor explicitly optimizes for minimum distortion at a given bit rate. An LLM optimizes for next-token prediction. These are related but not identical. An LLM trained to minimize perplexity is not the same as an LLM trained to be a good assistant.

Inference is not simply decoding: JPEG decoding is a deterministic algorithm. LLM inference is sampling from a distribution — the same weights can produce different outputs. This is more like a probabilistic generative model than a traditional decompressor.

LLMs can reason in ways compressors can't: A JPEG decoder cannot modify the compression algorithm based on what it decompresses. An LLM, through chain-of-thought reasoning and tool use, can in some sense "look up" information it's uncertain about. The analogy breaks when you extend LLMs with retrieval, tool use, and agentic capabilities.

Fine-tuning adds information: RLHF (Reinforcement Learning from Human Feedback) and fine-tuning modify the weights after pretraining. This is like post-processing a compressed file — not decompression, but further transformation. The compression analogy is cleanest for base models without fine-tuning.

Practical Implications

Whether or not you find the compression frame intellectually satisfying, it generates useful heuristics:

Use LLMs for synthesis, not lookup: Where the compression shines is combining patterns from many sources. Where it fails is verbatim recall of specific facts.

Retrieval Augmented Generation (RAG): RAG combines an LLM's synthesis capabilities with verbatim retrieval from a document store. This is like having both a compressor (for synthesis) and a lossless store (for facts). The combination is more powerful than either alone.

Hallucinations are a compression artifact, not a bug: They're the model generating statistically-plausible content where specific information was lost in compression. Treating them as bugs to be "fixed" misses the point — the solution is architecture (add retrieval), not just training.

Context is decompression guidance: When you provide context in a prompt ("here is the actual text of the relevant document..."), you're giving the "decompressor" ground truth to work from, reducing the chance it generates a plausible-but-wrong reconstruction.

Model size and data affect what's retained: Larger models (more parameters) have more capacity to retain detail. More training data on a topic increases the "effective compression ratio" for that topic — more training signal means the pattern is more reliably stored.

The Deeper Question

The compression frame raises a question that's genuinely unresolved: is a model that has compressed the world's text into weights understanding the world, or just very efficiently predicting patterns?

For classical compressors, we'd never ask this. A gzip stream doesn't understand the text it compressed. But language models do things that look like reasoning, analogy-making, and generalization — behaviors we don't see from gzip.

One answer: the difference is quantitative, not qualitative. gzip's model of the world (LZ77 with a 32KB window) is just too simple to do anything that looks like reasoning. A transformer with 400 billion parameters and 20 trillion training tokens has a rich enough model that the representations it builds internally support what looks like reasoning.

Another answer: "understanding" requires something that statistical compression doesn't provide — grounding in the world, embodied experience, intentionality. In this view, an LLM is forever just a very sophisticated pattern-matcher, not an understander.

This debate isn't resolved. The compression frame doesn't resolve it either. But it does clarify the terms: whatever LLMs are doing, they're doing it by building an exceptionally good model of the statistical structure of human-generated text. Whether that model captures "meaning" is a philosophical question that compression theory can't answer.

Summary

Language model training is a form of lossy compression: the training corpus (trillions of tokens) is compressed into model weights (billions of parameters).
LLMs, used as arithmetic coders, beat every classical lossless compressor on English text — they have a better probability model.
Inference is approximately decompression: generating text that conforms to the stored statistical patterns.
What gets lost: verbatim text, precise numbers, provenance, rare facts. Hallucinations are the model generating plausible-but-imprecise reconstructions of poorly-compressed information.
LLMs are expensive because the "compressor" and "decompressor" are enormous neural networks, not simple algorithms.
Better compression implies a better model of the data; this connects compression quality to something like intelligence.
The frame breaks down for fine-tuned models, reasoning capabilities, and tool use — but it's a useful lens for understanding base model behavior and limitations.

The link between compression and understanding runs throughout the history of information theory. Shannon's work was not just about sending bits efficiently — it was about the nature of information itself. The LLM is, in some sense, the culmination of that thread: a machine that has learned to model the information in human communication well enough to generate new instances of it.

Whether that constitutes understanding, or merely very sophisticated pattern-matching, is left as an exercise for the reader.

The Big Book of Compression Algorithms