Introduction
"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point."
— Claude Shannon, A Mathematical Theory of Communication, 1948
Thanks to Georgiy Treyvus, CloudStreet Product Manager, whose idea started this book.
Compression is everywhere and mostly invisible. When you stream a video, your browser is decoding a stream that was compressed with H.264 or AV1. When you send a file over Slack, it's gzip'd in transit. When you store a photograph, the JPEG algorithm made a hundred small decisions about which visual details your eye probably won't notice. When a Parquet file loads in ten milliseconds instead of ten seconds, it's because the columnar layout lets LZ4 skip over 90% of the bytes.
Most developers interact with compression through flags and library calls: --level 9, Content-Encoding: gzip, COMPRESS=zstd. This book is for the developer who wants to understand what's happening underneath those flags — not just which algorithm to reach for, but why it works, where it breaks down, and what it's actually trading off.
What This Book Covers
We start at the foundation: information theory. Shannon's entropy tells us how much compression is theoretically possible for any given data. Everything else in this book is the story of how humanity has gotten closer and closer to that limit, in different contexts, with different constraints.
From there, we work through the major algorithm families:
- Statistical coders (Huffman, arithmetic coding) — assign shorter codes to more frequent symbols
- Dictionary methods (LZ77, LZ78, LZW, Deflate, Brotli, Zstandard) — replace repeated patterns with references
- Transform coders (JPEG's DCT, audio's MDCT) — convert data to a domain where redundancy is easier to remove
- Predictive coders (FLAC, video inter-frame) — model what comes next and encode only the surprise
We then look at how these primitives combine in real systems: zip files, PNG, JPEG, MP3, H.264, Parquet, HTTP compression. Each domain has its own constraints and exploits different structure in the data.
We close with a chapter that's half-serious, half-provocative: the argument that large language models are, at a deep level, just very expensive lossy compressors. It's a useful frame for understanding what LLMs are good at, what they hallucinate, and why.
How to Read This Book
Each chapter is designed to stand largely on its own after the first two foundational chapters. If you're already comfortable with entropy and information theory, you can skip ahead. If you're primarily interested in image compression, jump to Chapter 8 — but do read Chapters 1–4 first for the vocabulary.
Code examples are primarily in Python (for readability) and C (for the low-level parts that matter). They're illustrative, not production-ready. Where a real implementation differs significantly from the illustrative one, we'll say so.
ASCII diagrams appear throughout because compression algorithms are fundamentally about data structures and data transformations — and those are often clearest as pictures.
A Note on Benchmarks
Compression benchmarks are notoriously sensitive to the data being compressed. A compressor that wins on text will lose on binary. A compressor that wins on small files will lose on large ones. Whenever we cite numbers, we'll note the context. When in doubt: measure on your actual data.
Let's begin.
Chapter 1: What is Compression?
Before we can understand how compression algorithms work, we need to understand what they're working against: the information content of data itself. This is the domain of information theory, and it gives us something remarkable — a mathematical lower bound on how small any compressor can make a given piece of data.
The Basic Idea
Compression is the process of encoding information using fewer bits than the original representation. That's it. Everything else is detail about how to do it well and when the tradeoffs make sense.
Consider the string "AAAAAAAAAA" — ten A's. Stored naively as ASCII, that's 80 bits (10 bytes). But we could also encode it as "10×A" — just three symbols. We've compressed 10 symbols into 3. The decompressor knows how to reverse this: see N×C, output C repeated N times.
Now consider "AJKQMZXBWR" — ten random letters. There's no shorter description. Each character is unpredictable, so each one has to be transmitted in full.
The key insight: compressibility is about predictability. Data that has patterns, regularities, or redundancy can be compressed. Data that is already maximally random cannot be.
Entropy: The Measure of Surprise
Claude Shannon formalized this intuition in 1948 with the concept of entropy — the average amount of information (surprise) in a message, measured in bits.
For a source that produces symbols with probabilities p₁, p₂, ..., pₙ, Shannon's entropy H is:
H = -Σ pᵢ × log₂(pᵢ)
Each term -pᵢ × log₂(pᵢ) is the contribution of symbol i to the total entropy: the probability of seeing it, times how surprising it is when we do.
Worked Example: A Biased Coin
Suppose we're encoding coin flips where heads comes up 90% of the time.
Entropy of a biased coin (p_heads = 0.9):
p_heads = 0.9 → information content = -log₂(0.9) ≈ 0.152 bits
p_tails = 0.1 → information content = -log₂(0.1) ≈ 3.322 bits
H = -(0.9 × 0.152) + -(0.1 × 3.322)
= -0.137 + -0.332
≈ 0.469 bits per flip
A fair coin has entropy of exactly 1 bit per flip (you need a full bit to represent each outcome). Our biased coin only has 0.469 bits per flip — because heads is so predictable, it carries very little information.
If we had a million flips of this coin, a naive encoding would use 1 million bits. An optimal compressor could theoretically use only ~469,000 bits — a 53% reduction, without losing a single flip.
Entropy of English Text
English has 26 letters (plus punctuation, spaces). If all were equally likely, entropy would be log₂(27) ≈ 4.75 bits per character, and we'd need 5 bits per character minimum.
But English letters aren't equally likely:
Letter frequencies in typical English text:
E: 12.7% ████████████▌
T: 9.1% █████████
A: 8.2% ████████▏
O: 7.5% ███████▌
I: 7.0% ███████
N: 6.7% ██████▋
S: 6.3% ██████▎
H: 6.1% ██████
R: 6.0% ██████
...
Z: 0.07% ▏
Shannon estimated English text at roughly 1.0–1.5 bits per character of actual entropy when accounting for context and word structure. ASCII uses 8 bits per character. That's a theoretical 5–8× compression ratio sitting there waiting to be claimed — and modern text compressors do claim most of it.
Shannon's Source Coding Theorem
Shannon proved two fundamental theorems about compression. The source coding theorem (his first theorem) states:
It is impossible to losslessly compress a message to fewer bits than its Shannon entropy, on average.
"On average" is important. For any specific message, you might get lucky — a short message might compress well. But you cannot build a compressor that always compresses. In fact:
The incompressibility theorem: For every compression function, there exist inputs that make the output larger than the input. This is a counting argument: if you map N-bit strings to shorter strings, some N-bit strings must collide (pigeonhole principle), and your compressor must handle those by expanding them.
This is why zip files of already-compressed data (like .jpg or .mp4) don't get smaller — the data is already near its entropy limit. Attempting to compress it just adds overhead.
Entropy ceiling — no compressor can beat this:
Bits per ┌─────────────────────────────────────────┐
symbol │ │
8 │────│─── ASCII (8 bits/char) ─────────────────│
│ │ │
5 │ │─── Uniform 26-letter alphabet ──────────│
│ │ ╲ room for compression │
1-2 │ │─── English entropy (Shannon estimate) ──│
│ │ │
0 └────┴─────────────────────────────────────────┘
Lossless vs. Lossy Compression
This is the fork in the road that divides the entire compression world:
Lossless compression guarantees that decompression produces the exact original data. Not approximately. Not "close enough." Bit-for-bit identical. This is required for:
- Text and source code (a single wrong bit can break a program)
- Executable files and archives
- Medical images and legal documents
- Any data where exact reconstruction is required
Lossy compression accepts some degradation in exchange for dramatically better compression ratios. It's used when:
- Human perception can't detect (or doesn't care about) the lost information
- A "good enough" reconstruction is acceptable
- The original data has more precision than needed
The key insight behind lossy compression: human perception is not a perfect measuring instrument. Our eyes are more sensitive to changes in brightness than color. Our ears can't hear certain frequencies above certain amplitudes. If we discard imperceptible detail, we're still delivering the full perceptual experience — at a fraction of the data.
Lossless round-trip:
Original Compress Compressed Decompress Reconstructed
Data ─────────────▶ Data ─────────────▶ Data
IDENTICAL (bit-for-bit)
Lossy round-trip:
Original Encode Compressed Decode Approximation
Data ─────────────▶ Data ─────────────▶ of Data
(information lost)
The Lossy Tradeoff
Lossy compression creates a tunable tradeoff between quality and size. JPEG, for instance, has a "quality" parameter from 0–100 that controls how aggressively to discard high-frequency image data. At quality 95, the difference from the original is invisible to most people. At quality 30, you can see blocky artifacts but the file is 10× smaller.
This isn't "cheating" — it's engineering. The goal isn't to preserve all the bits; the goal is to preserve all the value.
Compression Ratio and Related Metrics
When comparing algorithms, we use several metrics:
Compression ratio: original size / compressed size. A ratio of 3:1 means the compressed file is one-third the size.
Space savings: (1 - compressed/original) × 100%. A 3:1 ratio means 66.7% space savings.
Bits per symbol: the average number of bits used per input symbol. For text, lower is better.
Compression speed: how fast the compressor runs, usually in MB/s.
Decompression speed: often more important than compression speed in practice (you compress once, decompress many times).
Example: compressing a 100MB text file
Algorithm Comp.Size Ratio Comp.Speed Decomp.Speed
─────────────────────────────────────────────────────────────
gzip -6 36 MB 2.8× 25 MB/s 200 MB/s
gzip -9 34 MB 2.9× 8 MB/s 200 MB/s
Brotli-6 30 MB 3.3× 15 MB/s 200 MB/s
Zstd-6 33 MB 3.0× 220 MB/s 700 MB/s
Zstd-19 28 MB 3.6× 12 MB/s 700 MB/s
xz -6 27 MB 3.7× 4 MB/s 50 MB/s
Zstd at level 6 compresses nearly as well as gzip at level 9, but nearly 30× faster. Decompression is 3.5× faster. This is why Zstd has largely replaced zlib in performance-sensitive applications. More on this in Chapter 5.
The Fundamental Limits
Shannon's entropy gives us the theoretical lower bound. Real compressors can't always reach it because:
-
Context model complexity: To predict what comes next perfectly, you'd need an infinite-order model. Real compressors use finite-order models.
-
Compressor overhead: The compressed format must include metadata for the decompressor (code tables, dictionary entries, etc.). For small inputs, this overhead dominates.
-
Data structure mismatches: A compressor optimized for text will underperform on binary data, and vice versa.
-
Finite block sizes: Compression improves with more context. A compressor working on a 4KB block can't exploit patterns across a 100MB file.
The gap between theoretical entropy and achievable compression is where most of the interesting engineering happens. The chapters ahead are about how different algorithms close that gap for different types of data.
Summary
- Entropy is the theoretical minimum size of any lossless encoding of a data source.
- Shannon's source coding theorem proves no compressor can beat entropy on average.
- Lossless compression preserves all information; lossy compression discards perceptually unimportant information for better ratios.
- Compressibility is about predictability: regular, redundant data compresses; truly random data does not.
- All compression involves modeling the data. Better models → better compression, usually at higher computational cost.
In the next chapter, we'll look at Huffman coding — the first practical algorithm for approaching the entropy limit, and a component that still ships in almost every file format you've ever used.
Chapter 2: Huffman Coding
In 1952, MIT student David Huffman was taking an information theory course and was given an assignment: prove which of two methods for constructing optimal prefix codes was better. He couldn't prove either was optimal — so he invented a third method that was demonstrably optimal, and published it. His professor, Robert Fano, had been working on the problem for years. Huffman's algorithm was simpler and better.
The core idea: give shorter codes to more frequent symbols, longer codes to rarer ones. This is the same intuition behind Morse code (E is ·, Z is −−··). Huffman's contribution was an algorithm that produces provably optimal variable-length prefix codes.
Prefix Codes and the No-Prefix Property
Before the algorithm, let's understand what we're building. A prefix code (also called a prefix-free code) is a set of codewords where no codeword is a prefix of another.
Why does this matter? Because without it, decoding is ambiguous:
Bad code (not prefix-free):
Symbol Code
A 0
B 01
C 10
D 11
Receiving "010":
- Could be A, B (0, 10... wait, is it A then B starting with 0? or 01 then?)
- Ambiguous!
A prefix-free code can be decoded greedily: read bits until you match a codeword, output the symbol, repeat. No backtracking needed.
Good code (prefix-free):
Symbol Code
A 00
B 01
C 10
D 11
Receiving "010011":
- 01 → B
- 00 → A
- 11 → D
- Clear!
Building the Huffman Tree
Huffman codes are built from a binary tree. Leaves are symbols; the path from root to leaf defines the codeword (left branch = 0, right branch = 1).
The algorithm:
- Count the frequency of each symbol.
- Create a leaf node for each symbol, with its frequency as its weight.
- Insert all nodes into a min-priority queue.
- While more than one node remains: a. Extract the two nodes with lowest weight. b. Create a new internal node with these two as children, weight = sum of children's weights. c. Insert the new node back into the queue.
- The remaining node is the root.
Step-by-Step Example
Input: "ABRACADABRA" (11 characters)
Step 1: Count frequencies
Symbol Count Probability
A 5 0.455
B 2 0.182
R 2 0.182
C 1 0.091
D 1 0.091
Step 2: Build initial priority queue (lowest weight first)
[C:1] [D:1] [B:2] [R:2] [A:5]
Step 3: Combine C and D
Extract C:1 and D:1 → create internal node CD:2
CD:2
/ \
C:1 D:1
Queue: [B:2] [R:2] [CD:2] [A:5]
Step 4: Combine B and R
Extract B:2 and R:2 → create internal node BR:4
BR:4
/ \
B:2 R:2
Queue: [CD:2] [A:5] [BR:4]
→ sorted: [CD:2] [BR:4] [A:5]
Step 5: Combine CD and BR
Extract CD:2 and BR:4 → create internal node CDBR:6
CDBR:6
/ \
CD:2 BR:4
/ \ / \
C:1 D:1 B:2 R:2
Queue: [A:5] [CDBR:6]
Step 6: Combine A and CDBR
Extract A:5 and CDBR:6 → create root node ALL:11
ALL:11
/ \
A:5 CDBR:6
/ \
CD:2 BR:4
/ \ / \
C:1 D:1 B:2 R:2
Step 7: Read off codewords (left=0, right=1)
Symbol Path Code Length
A left 0 1
C right,left,left 100 3
D right,left,right 101 3
B right,right,left 110 3
R right,right,right 111 3
Verify the compression:
Original (fixed 3-bit codes, since 5 symbols needs 3 bits):
11 chars × 3 bits = 33 bits
Huffman:
A appears 5 times × 1 bit = 5 bits
B appears 2 times × 3 bits = 6 bits
R appears 2 times × 3 bits = 6 bits
C appears 1 time × 3 bits = 3 bits
D appears 1 time × 3 bits = 3 bits
Total = 23 bits
Savings: 33 - 23 = 10 bits (30% reduction)
How close are we to Shannon's entropy limit?
H = -(5/11)log₂(5/11) - 2×(2/11)log₂(2/11) - 2×(1/11)log₂(1/11)
≈ 0.455×1.14 + 2×0.182×2.46 + 2×0.091×3.46
≈ 0.518 + 0.895 + 0.630
≈ 2.043 bits per symbol
Entropy for 11 symbols: 11 × 2.043 ≈ 22.5 bits
Huffman achieved: 23 bits ✓ (within 2.2% of theoretical minimum)
Python Implementation
Here's a clean, readable implementation of Huffman coding:
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional
@dataclass(order=True)
class HuffNode:
weight: int
symbol: Optional[str] = field(default=None, compare=False)
left: Optional['HuffNode'] = field(default=None, compare=False)
right: Optional['HuffNode'] = field(default=None, compare=False)
def build_huffman_tree(text: str) -> HuffNode:
"""Build a Huffman tree from input text."""
freq = Counter(text)
# Initialize priority queue with leaf nodes
heap = [HuffNode(weight=count, symbol=char)
for char, count in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
internal = HuffNode(
weight=left.weight + right.weight,
left=left,
right=right
)
heapq.heappush(heap, internal)
return heap[0]
def build_codebook(root: HuffNode) -> dict[str, str]:
"""Extract symbol → bitstring mapping from tree."""
codebook = {}
def traverse(node: HuffNode, path: str):
if node.symbol is not None: # leaf node
codebook[node.symbol] = path or '0' # single-symbol edge case
else:
traverse(node.left, path + '0')
traverse(node.right, path + '1')
traverse(root, '')
return codebook
def huffman_encode(text: str) -> tuple[str, dict]:
"""Encode text, return bitstring and codebook."""
if not text:
return '', {}
tree = build_huffman_tree(text)
codebook = build_codebook(tree)
encoded = ''.join(codebook[char] for char in text)
return encoded, codebook
def huffman_decode(bits: str, codebook: dict) -> str:
"""Decode bitstring using codebook."""
# Invert codebook: bitstring → symbol
reverse = {v: k for k, v in codebook.items()}
result = []
current = ''
for bit in bits:
current += bit
if current in reverse:
result.append(reverse[current])
current = ''
return ''.join(result)
# Demo
text = "ABRACADABRA"
encoded, codebook = huffman_encode(text)
print(f"Original: {len(text) * 8} bits (naive 8-bit ASCII)")
print(f"Encoded: {len(encoded)} bits")
print(f"Codebook: {codebook}")
print(f"Decoded: {huffman_decode(encoded, codebook)}")
Output:
Original: 88 bits (naive 8-bit ASCII)
Encoded: 23 bits
Codebook: {'A': '0', 'C': '100', 'D': '101', 'B': '110', 'R': '111'}
Decoded: ABRACADABRA
The Optimality Guarantee
Huffman coding is provably optimal among all prefix codes. The proof is an exchange argument: if you have an optimal code that doesn't satisfy Huffman's tree properties, you can always swap two codewords to make it "more Huffman" without increasing average code length.
Key properties of an optimal Huffman tree:
- More frequent symbols have shorter (or equal) codes.
- The two least frequent symbols have the same length.
- The two least frequent symbols differ only in the last bit.
The algorithm constructs a tree satisfying all three properties by construction.
One caveat: Huffman is optimal among prefix codes, but prefix codes are not always optimal encoders. Arithmetic coding (Chapter 3) can beat Huffman by encoding fractional bits per symbol — relevant when probabilities aren't nice powers of 1/2.
Canonical Huffman Codes
There's a problem with Huffman: the decompressor needs the codebook to decode. Sending the full tree with the compressed data adds overhead, and trees can be large.
Canonical Huffman solves this by defining a standard way to assign codewords given only the code lengths. The decompressor only needs to know the length of each symbol's codeword — not the codeword itself — to reconstruct the same canonical codebook.
The canonical assignment rule:
- Sort symbols by code length, then alphabetically within same length.
- Assign the numerically smallest codeword to the first symbol.
- Each subsequent symbol at the same length gets the next integer.
- When moving to a longer length, left-shift and continue.
From our ABRACADABRA example:
Sorted by length:
A: length 1
B: length 3
C: length 3
D: length 3
R: length 3
Canonical assignment:
A: 0 (0 in binary)
B: 100 (shift 0 to 000, +1 → 001... wait, jump from length 1 to 3)
(last code at length 1 was 0, shift left twice → 000, +1 → 001... no)
Correct algorithm:
Start code = 0, length = 1
A (length 1): code = 0b0 = "0"
Move to length 3: code = (0 + 1) << (3 - 1) = 1 << 2 = 4 = 0b100
B (length 3): code = 4 = "100"
C (length 3): code = 5 = "101"
D (length 3): code = 6 = "110"
R (length 3): code = 7 = "111"
Canonical codebook:
A → "0"
B → "100"
C → "101"
D → "110"
R → "111"
The decompressor only needs to store: the list of code lengths [1, 3, 3, 3, 3] (one per symbol, in sorted order). From this alone, it can reconstruct the entire codebook. This is how DEFLATE (the algorithm inside gzip and PNG) stores its Huffman tables.
def canonical_codebook(lengths: dict[str, int]) -> dict[str, str]:
"""Generate canonical Huffman codes from code lengths."""
# Sort by length, then by symbol
sorted_syms = sorted(lengths.items(), key=lambda x: (x[1], x[0]))
codebook = {}
code = 0
prev_len = 0
for symbol, length in sorted_syms:
if length > prev_len:
code <<= (length - prev_len)
codebook[symbol] = format(code, f'0{length}b')
code += 1
prev_len = length
return codebook
Where Huffman Is Used Today
Huffman coding is not usually used alone — it's a building block inside larger formats:
- DEFLATE (gzip, zlib, PNG, zip): Two Huffman trees, one for literals/lengths, one for distances.
- JPEG: Each coefficient's magnitude is Huffman coded.
- MP3: Scale factors and Huffman tables for spectral data.
- HTTP/2 HPACK: A static Huffman table for header value compression.
Pure Huffman is rare in modern systems because it requires two passes (one to count frequencies, one to encode) and doesn't handle adaptive probability models well. But it remains foundational: almost every format that does entropy coding has Huffman somewhere inside it.
Huffman's Limitations
Two-pass requirement: You need to know all frequencies before building the tree, which means reading the data twice or buffering it all. Adaptive Huffman (not covered here) addresses this but is more complex.
Integer code lengths: Each symbol gets a whole number of bits. If a symbol has probability 0.4, it ideally gets 1.32 bits — but Huffman must assign it 1 or 2. This integer rounding is where arithmetic coding gains its advantage.
Block-based: Huffman builds one tree for the whole block. If the data's statistics change mid-block (common in real files), the single tree is suboptimal for parts of the data.
Despite these limitations, Huffman coding's simplicity, efficiency, and optimality guarantee have kept it relevant for over 70 years. It's one of those rare algorithms that was essentially perfect from day one.
Summary
- Huffman coding assigns shorter binary codewords to more frequent symbols.
- The algorithm builds a binary tree bottom-up using a min-priority queue.
- The resulting prefix-free code is provably optimal among all prefix codes.
- Canonical Huffman reduces decompressor overhead to just storing code lengths.
- Huffman is a building block inside DEFLATE, JPEG, MP3, and dozens of other formats.
Next, we'll look at arithmetic coding — which breaks the "integer bits" barrier and can approach Shannon's limit even more closely.
Chapter 3: Arithmetic Coding
Huffman coding has a fundamental limitation: it assigns a whole number of bits to each symbol. A symbol with probability 0.5 gets 1 bit. A symbol with probability 0.9 gets 1 bit too — even though the Shannon optimal encoding would use only 0.152 bits. This waste accumulates.
Arithmetic coding breaks free from this constraint. Instead of assigning a fixed codeword to each symbol, it encodes an entire message as a single number — a fraction between 0 and 1. The length of that fraction (in bits) approaches the entropy of the message, regardless of individual symbol probabilities.
The price: arithmetic coding is more complex to implement and slightly slower. The gain: it can encode at very close to Shannon's theoretical limit, even for symbols with probabilities that don't map cleanly to powers of 1/2.
The Core Idea: Interval Subdivision
Imagine the number line from 0 to 1. We'll encode a message by progressively narrowing down to a smaller and smaller interval.
Setup: Assign each symbol a subinterval of [0, 1) proportional to its probability.
For a two-symbol alphabet (A, B) where P(A) = 0.7 and P(B) = 0.3:
Initial interval [0, 1):
0.0 0.7 1.0
├──────────────────────┤────────────┤
│ A │ B │
│ [0.0, 0.7) │ [0.7,1.0) │
Encoding: For each symbol in the message, narrow the current interval to the subinterval corresponding to that symbol, scaled to fit within the current interval.
Let's encode the message "ABA":
Step 1: Encode 'A'
Current interval: [0.0, 1.0)
A's sub-interval: [0.0, 0.7) of the current interval
New interval: [0.0 + 0.0×1.0, 0.0 + 0.7×1.0) = [0.0, 0.7)
0.0 0.49 0.7 1.0
├────────────┤─────────┤─────────┤
│ 'A' region │ 'B' region (old) │
Step 2: Encode 'B'
Current interval: [0.0, 0.7)
Width: 0.7
B's sub-interval: [0.7, 1.0) of the current interval
New low = 0.0 + 0.7 × 0.7 = 0.49
New high = 0.0 + 1.0 × 0.7 = 0.70
New interval: [0.49, 0.70)
0.0 0.49 0.553 0.70 1.0
├─────┤────────┤──────┤──────┤
│ A │ B │
│[0.49,0.553)[0.553,0.70)
Step 3: Encode 'A'
Current interval: [0.49, 0.70)
Width: 0.21
A's sub-interval: [0.0, 0.7) of the current interval
New low = 0.49 + 0.0 × 0.21 = 0.490
New high = 0.49 + 0.7 × 0.21 = 0.637
New interval: [0.490, 0.637)
Final interval: [0.490, 0.637)
The output is any value in the final interval. We pick the value that can be represented with the fewest bits. For example, 0.5 is representable as binary 0.1 (1 bit). Unfortunately 0.5 < 0.490 — doesn't work.
Let's try 0.5625 = binary 0.1001 — that's within [0.490, 0.637). Output: 0.1001 → 4 bits.
Compare to Huffman: if we assigned A=0, B=10, message "ABA" would be 0 10 0 = 4 bits. Same here, but arithmetic coding generalizes much better for non-binary-power probabilities.
The Update Formula
For an interval [low, high) and a symbol with cumulative probability range [cdf_low, cdf_high):
new_low = low + (high - low) × cdf_low
new_high = low + (high - low) × cdf_high
The cdf (cumulative distribution function) values come from the probability model. If we have symbols A (prob 0.7) and B (prob 0.3):
Symbol Prob CDF_low CDF_high
A 0.70 0.00 0.70
B 0.30 0.70 1.00
Decoding
Decoding is the reverse: given a value v in [0, 1), repeatedly determine which symbol's interval contains v, output that symbol, and rescale v into the new interval.
Decoding 0.5625 ("0.1001" in binary):
Step 1: Which symbol does 0.5625 fall in?
A covers [0.0, 0.7) — yes, 0.5625 < 0.7
Output: 'A'
Rescale: v = (0.5625 - 0.0) / 0.7 = 0.80357...
Step 2: 0.80357 falls in which symbol's range?
A covers [0.0, 0.7) — no, 0.80357 > 0.7
B covers [0.7, 1.0) — yes
Output: 'B'
Rescale: v = (0.80357 - 0.7) / 0.3 = 0.34524...
Step 3: 0.34524 falls in:
A covers [0.0, 0.7) — yes
Output: 'A'
(Stop when we've decoded the expected number of symbols)
Result: "ABA" ✓
Implementation: Integer Arithmetic Coding
Working with floating-point numbers is numerically unstable for long messages (precision runs out quickly). Real implementations use fixed-precision integers and a renormalization trick to keep the interval values in a workable range.
The key insight: when the high and low values share a leading bit, that bit is "settled" and can be output. We then shift both values left by one bit to maintain precision.
PRECISION = 32
FULL = 1 << PRECISION # 2^32
HALF = FULL >> 1 # 2^31
QUARTER = FULL >> 2 # 2^30
def arithmetic_encode(symbols: list[int], model: list[int]) -> bytes:
"""
Encode symbols using arithmetic coding.
model: list of cumulative frequencies (CDF), length = alphabet_size + 1
model[i] = cumulative count up to (not including) symbol i
model[-1] = total count
"""
low = 0
high = FULL
pending_bits = 0 # bits to output after we resolve the "almost half" case
output_bits = []
def emit_bit(bit):
output_bits.append(bit)
# Emit any pending opposite bits
for _ in range(pending_bits):
output_bits.append(1 - bit)
return 0 # reset pending count
total = model[-1]
for sym in symbols:
range_size = high - low
high = low + (range_size * model[sym + 1]) // total
low = low + (range_size * model[sym]) // total
# Renormalize: emit settled bits
while True:
if high <= HALF:
# Both in lower half: emit 0
pending_bits = emit_bit(0)
elif low >= HALF:
# Both in upper half: emit 1
low -= HALF
high -= HALF
pending_bits = emit_bit(1)
elif low >= QUARTER and high <= 3 * QUARTER:
# Straddling the middle: defer
low -= QUARTER
high -= QUARTER
pending_bits += 1
else:
break
low <<= 1
high <<= 1
# Flush remaining bits
pending_bits += 1
if low < QUARTER:
emit_bit(0)
else:
emit_bit(1)
return output_bits
def build_model(text: str) -> list[int]:
"""Build a simple frequency model (cumulative counts)."""
from collections import Counter
freq = Counter(text)
alphabet = sorted(freq.keys())
# Map characters to indices
char_to_idx = {c: i for i, c in enumerate(alphabet)}
counts = [freq.get(c, 0) for c in alphabet]
cdf = [0]
for c in counts:
cdf.append(cdf[-1] + c)
return cdf, char_to_idx, alphabet
Adaptive Arithmetic Coding
The version above requires knowing the full probability model upfront — which means two passes over the data, or buffering the entire message. Adaptive arithmetic coding solves this by updating the model after each symbol.
The decompressor can mirror the encoder's updates exactly, because it sees the same symbols in the same order. This makes adaptive arithmetic coding a powerful tool for streaming compression.
Adaptive model update (simplified):
Initialize: all symbols have count = 1 (Laplace smoothing)
After encoding/decoding each symbol s:
freq[s] += 1
The probability model continuously refines itself to match
the actual distribution of the data.
Time →
┌──────────────────────────────────────────────┐
│ Early: uniform priors, poor compression │
│ After 100 symbols: model adapting │
│ After 1000 symbols: model closely matches │
│ true distribution │
└──────────────────────────────────────────────┘
Real-world adaptive coders (like those in LZMA and context-mixing compressors) use much more sophisticated context models — they condition the probability of the current symbol on the last N symbols, the current bit position, and other features. This is the heart of state-of-the-art compression.
Range Coding: The Practical Variant
Arithmetic coding as described works on the interval [0, 1) with arbitrary precision. Range coding, introduced by G. Nigel N. Martin in 1979, is an equivalent but practically cleaner implementation that works on integer ranges directly. It's faster and avoids some of the floating-point subtleties.
Range coding is what's used inside LZMA (7-zip, XZ), and is the standard in high-performance entropy coders.
The key difference: instead of maintaining [low, high) ⊂ [0, 1), maintain a range [low, low + range) where range is kept in a fixed-precision integer window by outputting bytes when it becomes too large.
Range coder state machine:
range = 0xFFFFFFFF (initial range, 32-bit)
low = 0
For each symbol:
1. Compute symbol's frequency band within current range
2. Update low and range accordingly
3. If range < BOTTOM_VALUE (e.g., 1<<24):
- Output high byte of low
- Shift low and range left by 8 bits (renormalize)
Probability Models: The Real Differentiator
The arithmetic coder itself is just a mechanism for converting probabilities to bits. What separates good compressors from great ones is the probability model — how accurately it predicts the next symbol.
Order-0 model: Probability of each symbol is its frequency in the file. Simple, fast, works well for uniform text.
Order-N model: Probability of the next symbol conditioned on the last N symbols. An order-4 model for English text uses context like "tion" to predict that the next character is probably a space, punctuation, or another vowel.
Prediction by Partial Matching (PPM): Uses the longest available context, falling back to shorter contexts when the long one hasn't been seen. PPM was state-of-the-art for general compression through the 1990s and still produces excellent compression ratios.
Context Mixing (CM): Combines predictions from multiple models using mixing weights. PAQ, the reference-grade maximum-compression tool, uses this approach and achieves remarkable compression ratios at enormous computational cost.
Context model comparison on Canterbury Corpus (text):
Model Bits/char Decoder complexity
────────────────────────────────────────────────
Order-0 4.6 O(1)
Order-1 3.5 O(1)
Order-4 2.4 O(alphabet)
PPM-C 1.9 O(alphabet × context)
PAQ8 1.7 O(many models × context)
Shannon limit ~1.3 -
Where Arithmetic Coding Is Used
Arithmetic coding never became as ubiquitous as Huffman, partly because of patent issues in the 1990s (IBM and others held key patents), and partly because Huffman is simpler to implement and fast enough for most uses.
But arithmetic coding (specifically range coding) is central to:
- LZMA/7-Zip: Range coding for the entropy stage
- HEVC/H.265 video: CABAC (Context-Adaptive Binary Arithmetic Coding) for syntax element coding
- JPEG 2000: Arithmetic coding replaces Huffman from JPEG
- WebM/VP9: Optimized boolean arithmetic coder
- Zstandard's ANS backend: A modern alternative (see below)
ANS: Asymmetric Numeral Systems
Asymmetric Numeral Systems (ANS), invented by Jarosław Duda in 2009 and published in 2014, is the modern successor to arithmetic coding for high-performance systems. It achieves the same near-optimal compression as arithmetic coding but is faster because it can be implemented without division operations.
ANS comes in two main flavors:
- rANS (range ANS): Processes symbols serially, similar to arithmetic coding
- tANS (tabled ANS): Uses precomputed tables for fast lookup-based coding
Zstandard, Facebook's widely-deployed compression library, uses tANS (called FSE — Finite State Entropy — in its implementation). It encodes at speeds comparable to Huffman while approaching arithmetic coding's compression ratio.
Rough comparison of entropy coder performance:
Coder Speed (encode) Compression ratio Complexity
──────────────────────────────────────────────────────────────────
Huffman Very fast Good Low
Arithmetic Moderate Excellent High
Range coding Moderate Excellent High
rANS Fast Excellent Medium
tANS/FSE Very fast Excellent Medium
ANS is why Zstandard can be both fast and compress well — it uses tANS where DEFLATE uses Huffman, and that's a meaningful part of why Zstd beats gzip at comparable speeds.
Summary
- Arithmetic coding encodes an entire message as a single number in [0, 1).
- Each symbol narrows the interval proportionally to its probability.
- It can encode at arbitrarily close to Shannon's limit, including fractional bits.
- Adaptive variants update the model on-the-fly — no two-pass requirement.
- Range coding is the practical integer implementation used in real compressors.
- ANS (especially tANS/FSE) is the modern high-speed alternative, used in Zstandard.
- The probability model determines compression quality; the entropy coder just converts probabilities to bits efficiently.
In the next chapter, we shift from statistical coding to dictionary methods — a completely different approach that replaces repeated patterns with compact references rather than adjusting code lengths.
Chapter 4: Dictionary Methods — LZ77, LZ78, LZW
In 1977 and 1978, Abraham Lempel and Jacob Ziv published two papers that fundamentally changed compression. Their insight was different from Huffman's: instead of assigning shorter codes to frequent symbols, replace repeated strings with references to where they appeared before.
This is the dictionary approach: build a dictionary of previously seen data and encode new data as references into that dictionary. It's the foundation of zip, gzip, zlib, PNG, WebP, and dozens more formats.
Why Dictionary Methods Work
Text, code, and many binary formats are full of repetition:
- Words and phrases repeat in text
- Function names repeat in source code
- Packet headers repeat in network captures
- UI element descriptions repeat in HTML
Statistical coders (Huffman, arithmetic) can only exploit symbol frequency — if 'e' is common, give it a short code. But they can't directly exploit the pattern "the " appearing 10,000 times in a document. Dictionary methods can: encode "the " as a single token referencing its first occurrence.
Example: "abcdeabcde"
Statistical coder sees: 10 mostly-equal-frequency symbols.
Not much to gain; each symbol might get ~3 bits.
Total: ~30 bits.
Dictionary coder sees:
"abcde" → encode literally (5 symbols)
"abcde" → reference to 5 characters back, length 5
Output: literal "abcde", then reference (offset=5, length=5)
Potentially much shorter!
LZ77: The Sliding Window
LZ77 (1977) uses a sliding window: a buffer of recently seen data that serves as the dictionary. When encoding, the algorithm looks backward in the window for the longest match to the current position.
LZ77 window structure:
◄──── Search buffer (past data) ────►◄── Lookahead buffer ──►
│ already encoded / "dictionary" │ data to encode next │
└───────────────────────────────────┴────────────────────────┘
typically 32KB or 64KB typically 258 bytes
Current position is the boundary between the two.
Encoding a position: Find the longest string in the search buffer that matches the beginning of the lookahead buffer. Output a triple:
(offset, length, next_char)
offset: how far back the match starts
length: how long the match is
next_char: the first character after the match (that didn't match)
If no match found, output (0, 0, current_char) — a literal.
Step-by-Step Example
Encoding "AABCBBABC" with a window size of 8:
Position 0: Lookahead = "AABCBBABC"
Search buffer is empty. No match possible.
Output: (0, 0, 'A') [literal A]
Position 1: Lookahead = "ABCBBABC"
Search buffer: "A"
Match: 'A' at offset 1, length 1
Next char after match: 'B'
Output: (1, 1, 'B') [go back 1, copy 1 char, then 'B']
Position 3: Lookahead = "CBBABC"
Search buffer: "AAB"
No match for 'C' in buffer.
Output: (0, 0, 'C') [literal C]
Position 4: Lookahead = "BBABC"
Search buffer: "AABC"
'B' found at offset 2 (buffer position 2 from right).
Match length 1. Next char: 'B'
Output: (2, 1, 'B')
Position 6: Lookahead = "ABC"
Search buffer: "AABCBB"
'A' at various positions. Let's check for "AB":
"AB" found starting at offset 5 (the "AB" in "AABC").
Then 'C' follows — does the match extend to "ABC"? Yes!
Output: (5, 3, '') — but we're at the end, so maybe (5, 3)
(In practice, a special end-of-stream marker is used)
Full encoded sequence: (0,0,'A'), (1,1,'B'), (0,0,'C'), (2,1,'B'), (5,3,...)
Without accounting for encoding overhead, we encoded 9 characters using 5 tokens — some savings, more dramatic on longer repeated sequences.
Finding Matches Efficiently
Naively, finding the longest match requires scanning the entire search buffer for each position — O(window_size × lookahead_size) per position, which is very slow for a 32KB window.
Real LZ77 implementations use hash tables to find candidate match positions in O(1), then verify and extend. DEFLATE uses a hash on the first 3 characters of the lookahead to find candidates.
def lz77_encode(data: bytes, window_size: int = 255, max_match: int = 255):
"""Simple LZ77 encoder (pedagogical, not optimized)."""
i = 0
tokens = []
while i < len(data):
best_offset = 0
best_length = 0
# Search backward in the window
search_start = max(0, i - window_size)
for j in range(search_start, i):
# How long is the match starting at j vs i?
length = 0
while (length < max_match
and i + length < len(data)
and data[j + length] == data[i + length]):
length += 1
# Handle overlapping matches (length can exceed i - j)
if j + length >= i:
# Wrap around: copy from j modulo current i
pass # simplified: stop at boundary for clarity
if length > best_length:
best_length = length
best_offset = i - j
if best_length > 2: # Only use reference if it saves bits
next_char = data[i + best_length] if i + best_length < len(data) else None
tokens.append(('ref', best_offset, best_length, next_char))
i += best_length + (1 if next_char is not None else 0)
else:
tokens.append(('lit', data[i]))
i += 1
return tokens
LZ78: The Growing Dictionary
LZ78 (1978) takes a different approach: instead of a sliding window over past data, it builds an explicit dictionary of phrases encountered so far.
The dictionary grows with the data. When a new string is seen:
- Look it up in the dictionary.
- If found, extend the lookup by one more character.
- When the extension isn't found, output the longest found phrase's index plus the new character, and add the new phrase to the dictionary.
LZ78 encoding of "AABABCABABCABC":
Dictionary starts empty.
Char 'A': Not in dict. Output (0, 'A'). Add "A" → entry 1.
Dict: {1:"A"}
Char 'A': In dict (entry 1). Try to extend.
Char 'B': "AB" not in dict. Output (1, 'B'). Add "AB" → entry 2.
Dict: {1:"A", 2:"AB"}
Char 'A': Entry 1. Extend.
Char 'B': "AB" is entry 2. Extend.
Char 'C': "ABC" not in dict. Output (2, 'C'). Add "ABC" → entry 3.
Dict: {1:"A", 2:"AB", 3:"ABC"}
Char 'A': Entry 1. Extend.
Char 'B': "AB" is entry 2. Extend.
Char 'A': "ABA" not in dict. Output (2, 'A'). Add "ABA" → entry 4.
...
Output sequence: (0,'A'), (1,'B'), (2,'C'), (2,'A'), (3,'B'), (3,'C')
LZ78's advantage: the dictionary can capture patterns from anywhere in the past, not just the most recent window. Its disadvantage: the dictionary can grow unbounded (or must be reset), and it's harder to implement efficiently.
LZW: LZ78 Without the Literal
LZW (Lempel-Ziv-Welch, 1984) simplifies LZ78 by initializing the dictionary with all single characters, eliminating the need to transmit literals.
Key change: Output is just the dictionary index, never a raw character. Since all single characters are pre-loaded, every output is an index reference.
LZW became famous because it was used in:
- GIF (the primary reason GIF images exist)
- TIFF (optional compression)
- PDF (early versions)
- Unix compress utility
It was also famously patented by Unisys, which caused the GIF controversy of the 1990s and drove developers toward PNG (which uses DEFLATE instead).
LZW encoding of "ABABABAB":
Initialize dict: {'A':0, 'B':1, 'C':2, ..., 'Z':25}
(all 256 ASCII values for binary data)
Next code: 256
i=0: Read 'A'. Match so far: "A"
Read 'B'. "AB" not in dict.
Output: 0 (code for 'A')
Add "AB" → 256. Match = "B"
i=1: Match: "B". Read 'A'. "BA" not in dict.
Output: 1 (code for 'B')
Add "BA" → 257. Match = "A"
i=2: Match: "A". Read 'B'. "AB" is in dict! (code 256). Match = "AB"
Read 'A'. "ABA" not in dict.
Output: 256 (code for "AB")
Add "ABA" → 258. Match = "A"
...
Output: 0, 1, 256, 256, ...
The elegant trick in LZW: the decompressor can reconstruct the dictionary in sync with the encoder, because each dictionary entry is added exactly when the encoder adds it. The decompressor needs no dictionary transmitted separately.
How ZIP Actually Works
ZIP files use DEFLATE compression (Chapter 5), but understanding how a zip container is structured illuminates how dictionary compression is packaged for real use.
ZIP file structure:
┌─────────────────────────────────────────────────────────────┐
│ Local file header + compressed data (for each file) │
│ ┌──────────────────┬──────────────────────────────────┐ │
│ │ Local File Header│ Compressed file data │ │
│ │ - Magic: PK\x03\x04 │ │
│ │ - Version needed │ │
│ │ - Compression method (0=store, 8=DEFLATE) │ │
│ │ - Last-modified time/date │ │
│ │ - CRC-32 │ │
│ │ - Compressed size │ │
│ │ - Uncompressed size │ │
│ │ - Filename length + filename │ │
│ └──────────────────┴──────────────────────────────────┘ │
│ ... (repeated for each file) │
│ │
│ Central Directory (one entry per file) │
│ - Same metadata as local headers │
│ - Offset to local header in archive │
│ - Extra: file comment, external attributes │
│ │
│ End of Central Directory Record │
│ - Offset and size of central directory │
│ - Archive comment │
└─────────────────────────────────────────────────────────────┘
Key design decisions in ZIP:
- Per-file compression: Each file is compressed independently. This allows random access to individual files without decompressing the whole archive.
- No archive-level dictionary: Files don't share a dictionary. This is why
.tar.gz(compress all, then archive) often beats.zipfor collections of similar files — tar first concatenates everything, then gzip can find cross-file patterns. - Central directory at the end: Allows ZIP writers to stream (compress file, then seek back and write metadata), and allows quick listing without decompressing data.
import zipfile
import io
# Creating a ZIP in Python
buf = io.BytesIO()
with zipfile.ZipFile(buf, 'w', compression=zipfile.ZIP_DEFLATED) as zf:
zf.writestr('hello.txt', 'Hello, world! ' * 1000)
zf.writestr('data.csv', 'id,name,value\n' * 500)
# Reading it back
buf.seek(0)
with zipfile.ZipFile(buf, 'r') as zf:
print(zf.namelist()) # ['hello.txt', 'data.csv']
print(zf.getinfo('hello.txt').compress_size) # much smaller than original
Comparing the LZ Variants
Algorithm Dictionary Window Best for
──────────────────────────────────────────────────────────────────
LZ77 Sliding Recent data Streaming, simple decoder
LZ78 Growing All past Better adaptation over time
LZW Growing All past Simple implementation, patents
DEFLATE Sliding+Hash 32KB Universal (see Chapter 5)
LZMA Sliding+PPM Up to 4GB Maximum compression (Ch 6)
The LZ77 family (sliding window) is much more common in practice because:
- Bounded memory: a fixed-size window means predictable memory use
- Simple decoder: just copy bytes from the output buffer
- Better for streaming: doesn't require dictionary reset
The LZ78 family excels when data has patterns that are far apart, but the unbounded dictionary is a practical challenge.
The Asymmetry That Matters
One of the most important properties of LZ-based compression: decompression is much faster than compression.
Compression requires searching for matches — expensive. Decompression just follows back-references — basically a memcpy. This asymmetry is intentional and valuable: you compress once (or slowly in batch), you decompress many times (fast, on demand).
Typical LZ77-family speed asymmetry:
gzip compress: ~25 MB/s decompress: ~250 MB/s (10× asymmetry)
Zstd compress: ~400 MB/s decompress: ~1200 MB/s (3× asymmetry)
LZ4 compress: ~700 MB/s decompress: ~4000 MB/s (6× asymmetry)
LZ4 is interesting: it's designed so that decompression is the bottleneck for nothing — it's fast enough that you can decompress data faster than RAM can deliver it on most systems. This is the design goal for in-memory and database compression.
Summary
- LZ77 uses a sliding window as a dictionary; references are
(offset, length)pairs. - LZ78 builds an explicit, growing dictionary of seen phrases.
- LZW pre-initializes the dictionary and eliminates explicit literals.
- ZIP uses DEFLATE (LZ77 + Huffman) per file, enabling random access at the cost of cross-file compression.
- The asymmetry between compression and decompression speed is a feature, not a bug.
The next chapter shows how modern compressors — DEFLATE, Brotli, and Zstandard — took the LZ77 core and built major improvements around it.
Chapter 5: Modern Lossless — Deflate, Brotli, Zstandard
The LZ family gave us the dictionary method. Huffman coding gave us the entropy stage. The insight of DEFLATE was: combine them. Use LZ77 to remove repetition, then Huffman to compress the residue. The result became the compression workhorse of the internet — used inside gzip, zlib, PNG, PDF, and zip files.
Brotli and Zstandard improved on DEFLATE with better probability modeling, larger dictionaries, and faster algorithms. Zstd in particular has become the default choice for new systems, with a design that's both faster and more compressible than gzip.
DEFLATE: LZ77 + Huffman
DEFLATE was defined by Phil Katz in 1991 (for PKZIP) and later formalized in RFC 1951. It's not one algorithm — it's a carefully designed combination:
DEFLATE pipeline:
Raw data
│
▼
LZ77 matching
(find back-references in 32KB sliding window)
│
▼
Token stream: literals and (offset, length) pairs
│
▼
Huffman encoding
(two trees: one for literals/lengths, one for offsets)
│
▼
Compressed bitstream
The Two Huffman Trees
DEFLATE uses two Huffman trees simultaneously:
Literal/Length tree: Encodes 286 possible values.
- 0–255: literal byte values
- 256: end-of-block marker
- 257–285: match lengths (lengths 3–258, with some range-coded)
Distance tree: Encodes 30 possible distance codes (representing the back-reference offset). Short distances get short codes; long distances (near 32KB) get longer codes.
DEFLATE symbol types:
Literal 'A' (0x41):
┌──────────┐
│ 0x41 │ → Literal/Length Huffman code for 0x41
└──────────┘
Back-reference (offset=7, length=5):
┌──────────┬─────────────┐
│ Len code │ Dist code │
│ (261) │ (4) │
└──────────┴─────────────┘
Length 261 means "length 5" (with no extra bits)
Distance 4 means "offset 7-8" (with 1 extra bit)
Block Types
DEFLATE has three block types, and the compressor chooses per-block which to use:
- Type 0 (stored): No compression. Used when the input is already compressed or random.
- Type 1 (fixed Huffman): Uses predefined, hardcoded Huffman codes. Fast to encode, slightly suboptimal.
- Type 2 (dynamic Huffman): Stores custom Huffman trees at the start of each block, then uses them. Optimal but has header overhead.
gzip file structure:
┌─────────────────────────────────────────────────────┐
│ gzip header (10 bytes) │
│ - Magic: \x1f\x8b │
│ - Compression method: 8 (DEFLATE) │
│ - Flags, modification time, OS │
├─────────────────────────────────────────────────────┤
│ DEFLATE compressed data (multiple blocks) │
│ Block 1: [type][Huffman tables if type 2][data] │
│ Block 2: [type][Huffman tables if type 2][data] │
│ ... │
│ Final block: BFINAL bit set │
├─────────────────────────────────────────────────────┤
│ CRC-32 of original data (4 bytes) │
│ Original size mod 2^32 (4 bytes) │
└─────────────────────────────────────────────────────┘
DEFLATE's Limitations
DEFLATE was designed in 1991 and it shows:
- 32KB window: Modern files are often megabytes or gigabytes. Patterns farther than 32KB apart can't be referenced.
- Huffman entropy coding: As we saw in Chapter 3, arithmetic coding and ANS can do better for skewed probabilities.
- Static two-tree model: The Huffman trees are trained on the block, not on the specific statistics of literals vs. back-references.
- Sequential only: No parallelism; compressor must process the stream serially.
Despite these limitations, DEFLATE is everywhere. The decompressor is simple enough to implement in firmware, the format is standardized and stable, and it's good enough for most uses.
Brotli: Google's Web Compressor
Brotli was released by Google in 2013 and standardized in RFC 7932 (2016). It's designed primarily for compressing web content (HTML, CSS, JS), and it has a trick that DEFLATE doesn't: a built-in static dictionary.
The Static Dictionary
Brotli ships with a 120KB dictionary of common web strings. Things like "<!DOCTYPE html>", "Content-Type: ", "text/javascript", "function ", and thousands of common JavaScript, HTML, and CSS tokens.
References to this static dictionary can be made without the dictionary appearing in the compressed stream at all — the decompressor already has it. This gives Brotli a massive advantage on small web resources where a traditional compressor would waste most of its window learning the format.
Brotli static dictionary example:
Compressing: '<html lang="en">'
Without static dict:
Encode each character literally, or hope it appears earlier in the window.
With Brotli static dict:
'<html ' → static dict entry #1234 (immediate reference, no prior occurrence needed)
'lang="en"' → static dict entry #5678
'>' → literal
Much better compression for the first occurrence of common patterns.
Brotli's Other Improvements
Beyond the static dictionary, Brotli improves on DEFLATE with:
- Larger sliding window: Up to 16MB (vs 32KB for DEFLATE)
- Context modeling: Uses 64 context maps, conditioning symbol probabilities on the context (previous bytes, literal vs. length/distance)
- Distance ring buffer: Tracks the four most recent distances; referencing a recent distance costs fewer bits
- Better match finding: More sophisticated algorithms for finding long matches
Compression ratios on web content (lower bits/byte = better):
Format HTML JS CSS Average
────────────────────────────────────────────────────
Uncompressed (bytes) 100% 100% 100% 100%
gzip level 6 27% 32% 23% 27%
Brotli level 6 22% 26% 19% 22%
Brotli level 11 19% 23% 17% 20%
Brotli compresses better, but it's slower than gzip at comparable quality settings. This is acceptable for static assets that are compressed once and served many times. For dynamic content, gzip or Zstd are better choices.
Brotli's decompression speed is comparable to gzip — fast enough for real-time web serving.
Zstandard: The New General Purpose
Zstandard (zstd) was developed at Facebook and open-sourced in 2016. It's designed to be:
- Fast at both compression and decompression
- Competitive with gzip at comparable quality
- Better than gzip at higher quality settings
- A drop-in replacement for zlib in many contexts
Zstandard has become the default compression algorithm for Linux kernels, Python wheel packages, Facebook's storage infrastructure, and Debian packages.
Zstd's Core Architecture
Zstandard encoding pipeline:
Raw data
│
├──► LZ77 matching (larger window than DEFLATE, configurable)
│
├──► Sequence encoding (literals + match tokens)
│
├──► Literal compression (tANS/FSE entropy coding)
│
├──► Sequence header compression (tANS for match commands)
│
▼
zstd frame (with optional frame header, checksums)
The key innovations in Zstd:
tANS (Finite State Entropy): Zstd uses tANS instead of Huffman for entropy coding. As discussed in Chapter 3, this approaches arithmetic coding quality at Huffman speed.
Training dictionaries: Zstd supports trained dictionaries — if you're compressing many similar small files (like JSON API responses), you can train a dictionary on a sample set. References to the dictionary dramatically improve compression of small inputs.
Configurable window size: DEFLATE is stuck at 32KB. Zstd can use windows up to 2GB, enabling better compression of large files with distant repetitions.
Parallel compression: zstd can split the input into blocks and compress them in parallel using multiple cores (with the --threads flag). Output is still a single stream, fully compatible with single-threaded decompression.
Zstd Compression Levels
Zstd's level system is broader than gzip's 1-9:
Zstd compression levels:
Level Speed (encode) Ratio Notes
──────────────────────────────────────────────────────
1 Very fast Moderate Competitive with LZ4
3 Fast Good Default; beats gzip speed
6 Moderate Good Near gzip -6 quality, faster
9 Moderate Better Better than gzip -9
12 Slow Best fast Starts using more memory
19 Very slow Excellent Near Brotli quality
22 Extremely slow Maximum (--ultra flag required)
Negative levels (e.g. -1, -5): faster than level 1, less compression.
Useful for real-time or log compression where CPU is the bottleneck.
# Python example using zstandard library
import zstandard as zstd
# Basic compression
cctx = zstd.ZstdCompressor(level=3)
compressed = cctx.compress(data)
dctx = zstd.ZstdDecompressor()
original = dctx.decompress(compressed)
# Streaming (memory-efficient for large data)
cctx = zstd.ZstdCompressor(level=6)
with open('large_file.txt', 'rb') as fin:
with open('large_file.zst', 'wb') as fout:
cctx.copy_stream(fin, fout)
# With a trained dictionary (for many small files)
# Train on samples:
samples = [b'{"user":"alice","action":"login"}',
b'{"user":"bob","action":"view"}',
...] # many samples
params = zstd.ZstdCompressionParameters.from_level(3)
dict_data = zstd.train_dictionary(8192, samples) # 8KB dict
# Compress with dictionary:
cctx = zstd.ZstdCompressor(dict_data=dict_data, level=3)
compressed = cctx.compress(b'{"user":"carol","action":"login"}')
# Much smaller than without dictionary!
Why Zstd Won
Head-to-head: Zstd vs. gzip on a 1GB log file
Metric gzip -6 zstd -3 zstd -6
──────────────────────────────────────────────────────────
Compression time 28s 1.8s 4.2s
Compressed size 312MB 320MB 298MB
Decompression time 3.2s 0.9s 0.9s
Compression ratio 3.2× 3.1× 3.4×
Memory usage (compress) 256KB 8MB 16MB
Memory usage (decompress) 32KB 128KB 128KB
Zstd is ~15× faster to compress at comparable ratio, and ~3.5× faster to decompress. The cost is slightly higher memory usage — acceptable for almost every modern context.
For static web assets, Brotli at level 11 is better (higher compression ratio). For real-time or batch compression of dynamic data, Zstd is the choice.
Putting It Together: When to Use What
Decision tree for lossless compression:
Is compatibility with old tools required?
├─ Yes → gzip/DEFLATE (ubiquitous)
└─ No → continue
Is compression speed the primary constraint?
├─ Yes (real-time, < 100ms) → LZ4 or Zstd level 1
└─ No → continue
Is decompression speed critical?
├─ Very (database, in-memory) → LZ4 or Snappy
└─ No → continue
Is the data primarily web content?
├─ Yes, static assets → Brotli level 6-11
├─ Yes, dynamic content → Zstd level 3 or gzip
└─ No → continue
Is maximum compression ratio needed?
├─ Yes, time doesn't matter → LZMA/7-zip (Chapter 6)
└─ No → Zstd level 6-9 (sweet spot)
Under the Hood: The zlib API
Most code doesn't call DEFLATE directly — it calls zlib, the C library that wraps DEFLATE and exposes a streaming API. Understanding the zlib API helps when debugging compression issues:
#include <zlib.h>
// Compress in one shot
int compress_buffer(const uint8_t *src, size_t src_len,
uint8_t *dst, size_t *dst_len) {
z_stream strm = {0};
deflateInit(&strm, Z_DEFAULT_COMPRESSION); // level 6
strm.next_in = (Bytef*)src;
strm.avail_in = src_len;
strm.next_out = dst;
strm.avail_out = *dst_len;
int ret = deflate(&strm, Z_FINISH);
*dst_len = strm.total_out;
deflateEnd(&strm);
return ret == Z_STREAM_END ? Z_OK : ret;
}
The streaming API (deflate/inflate called repeatedly) handles data larger than memory by processing in chunks. This is how nginx compresses HTTP responses on-the-fly without buffering the entire response body.
Summary
- DEFLATE = LZ77 + Huffman, 32KB window. Universal, fast to decompress, shows its age.
- Brotli adds a static web dictionary and better entropy coding. Ideal for static web assets.
- Zstandard uses tANS entropy coding, configurable windows, and trained dictionaries. The modern default for general-purpose compression — faster than gzip, better compression at high levels.
- For most new systems: use Zstd. For web serving: serve Brotli for supporting clients, gzip for older ones.
Next up: LZMA and 7-Zip — what happens when you push lossless compression to its practical limit.
Chapter 6: LZMA and 7-Zip
LZMA (Lempel-Ziv-Markov chain Algorithm) is the compression algorithm behind 7-Zip's .7z format, and also used in .xz — the format that replaced .bz2 in most Linux distributions. It achieves compression ratios that gzip can't touch, at the cost of significantly more time and memory.
Understanding LZMA means understanding what happens when you take every technique we've discussed and combine them: large sliding windows, Markov chain probability models, and range coding for entropy. The result is a compressor that's slow by design — but for cold storage, software distribution, and archiving, slow-to-compress is fine.
The LZMA Architecture
LZMA is best understood as a three-layer system:
LZMA compression layers:
Raw data
│
▼
┌─────────────────────────────────┐
│ Layer 1: LZ77 Matching │
│ - Sliding window (up to 4GB) │
│ - Match finder (hash chains, │
│ binary trees, or BT4) │
│ - Output: literal bytes + │
│ (offset, length) pairs │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Layer 2: Markov Model │
│ - Conditions each bit on │
│ context (recent bits, │
│ match/literal state machine) │
│ - Hundreds of probability │
│ models, one per context │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Layer 3: Range Coding │
│ - Arithmetic-coding-equivalent │
│ - Each bit encoded using its │
│ model's probability estimate │
└─────────────────────────────────┘
│
▼
.7z / .xz compressed stream
Layer 1: The Large Sliding Window
DEFLATE uses a 32KB sliding window. Brotli goes up to 16MB. LZMA goes up to 4GB (configurable, with common settings at 4MB–64MB for a balance of memory and compression).
Why does window size matter so much?
Example: Compressing a 10MB C source code archive
Pattern " return NULL;" appears 847 times throughout the file.
The first occurrence is at byte 0, the last at byte 9,900,000.
DEFLATE (32KB window):
First occurrence: encode literally
Subsequent occurrences within 32KB: back-reference (cheap)
After moving 32KB past last reference: must re-encode (expensive!)
Result: can only reference the ~last 32,000 bytes.
LZMA (64MB window):
First occurrence: encode literally
ALL 847 subsequent occurrences: back-reference (cheap)
Even occurrences 9MB later can reference the first.
For software distributions — which typically contain many similar files, headers, and boilerplate — large windows dramatically improve compression.
Match Finders
Finding the best match in a large window requires sophisticated data structures. LZMA uses several:
Hash chains (fast, less optimal): Hash the first 2-3 characters of each potential match position; store chains of positions with the same hash.
Binary trees (BT3, BT4): For each hash bucket, maintain a binary tree sorted by the actual string content at each position. The 4 in BT4 means the first 4 bytes are hashed before tree lookup. This finds near-optimal matches but requires more memory and CPU.
Match finder complexity tradeoff:
Finder Memory Speed Match quality
─────────────────────────────────────────────
HC4 Medium Fast Good
BT2 High Slow Excellent
BT3 Higher Slower Excellent
BT4 Highest Slowest Excellent (best for text)
Layer 2: Markov Chain Models
This is where LZMA gets unusual. After the LZ77 stage produces a stream of tokens (literals and back-references), LZMA doesn't just Huffman-code them. It uses a rich set of probability models, one per context.
The insight: the probability of the next bit depends heavily on what state the encoder is in. A bit that's part of a match length is statistically different from a bit that's part of a literal. A bit that's the high bit of a match offset is different from a bit that's the low bit.
LZMA defines a finite state machine with states like:
LIT: Last token was a literalMATCH: Last token was a matchLONGREP: Last token was a "long repeat" matchSHORTREP: Last token was a short repeat (single-byte repeat)
And within each state, each bit position within each field (literal byte, offset, length) has its own probability model.
LZMA state machine (simplified):
┌─────────────────────────────────────────┐
│ STATES │
│ │
│ LIT ──────► MATCH ──────► LONGREP │
│ ▲ │ ▲ │ ▲ │
│ │ ▼ │ ▼ │ │
│ └────── SHORTREP ◄──── LONGREP1 │
└─────────────────────────────────────────┘
Each state has its own probability tables for the next token type.
Each bit within each token field has its own probability model
conditioned on the current state + previously coded bits.
Total probability models: hundreds
This is why LZMA compresses so well — and why it's slow. The price of accurate probability models is maintaining and querying hundreds of them.
Layer 3: Range Coding
The probability estimates from the Markov models feed directly into a range coder (the arithmetic coding variant from Chapter 3). Each bit is encoded using its own probability, approaching the Shannon limit.
Range coding here operates at the bit level (binary arithmetic coding): each bit is either 0 or 1, and the model gives the probability of 0. The range coder narrows its interval accordingly.
# Simplified LZMA range coder core (pedagogical)
class RangeCoder:
RANGE_TOP = 1 << 24
MODEL_BITS = 11
MODEL_TOTAL = 1 << MODEL_BITS # 2048
def __init__(self):
self.low = 0
self.range = 0xFFFFFFFF
self.output = []
def encode_bit(self, prob: int, bit: int):
"""
Encode one bit using probability model.
prob: probability of bit=0, in range [0, 2048)
"""
bound = (self.range >> self.MODEL_BITS) * prob
if bit == 0:
self.range = bound
# Update model toward 0 (increase prob of 0)
prob += (self.MODEL_TOTAL - prob) >> 5
else:
self.low += bound
self.range -= bound
# Update model toward 1 (decrease prob of 0)
prob -= prob >> 5
# Renormalize
while self.range < self.RANGE_TOP:
self.output.append((self.low >> 24) & 0xFF)
self.low = (self.low << 8) & 0xFFFFFFFF
self.range <<= 8
return prob # return updated probability for next call
The model update (prob += (MODEL_TOTAL - prob) >> 5) is an exponential moving average — recent bits influence the probability more than old ones. The >> 5 factor controls the adaptation rate: faster adaptation tracks local statistics, slower adaptation tracks global statistics.
The 7-Zip Container Format (.7z)
LZMA (and its enhanced version LZMA2) is the primary codec in the .7z container format.
.7z file structure:
┌─────────────────────────────────────────────────┐
│ Signature: '7z\xBC\xAF\x27\x1C' (6 bytes) │
│ Version + CRC of "Next Header Info" (12 bytes) │
├─────────────────────────────────────────────────┤
│ Compressed data streams │
│ - Optionally solid (all files in one stream) │
│ - Multiple codecs can be chained: │
│ e.g., BCJ (x86 filter) → LZMA2 │
├─────────────────────────────────────────────────┤
│ Header (compressed) │
│ - File list, sizes, timestamps, attributes │
│ - Offsets to each file's data in stream │
└─────────────────────────────────────────────────┘
Solid archives: A key 7-Zip feature is solid compression — all files are concatenated into one stream before compression. This allows the compressor to exploit similarities across files (e.g., many .c files sharing headers). Solid archives compress much better but require decompressing from the beginning to extract a single file.
Archive type comparison (100 similar text files, 10MB total):
Ratio Extract-single-file
──────────────────────────────────────────
zip (per-file) 40% Fast (seek to offset)
7z non-solid 42% Fast
7z solid 28% Slow (decompress from start)
For software distribution archives (read-once, extract-all), solid is clearly better. For random-access archives (extract individual files frequently), non-solid or per-file compression wins.
The BCJ Filter
Executable files (x86, ARM, etc.) contain instruction addresses — absolute memory addresses embedded in jump and call instructions. These addresses look random to a compressor, even though they have structure.
LZMA uses preprocessor filters (BCJ = Branch-Call-Jump filters) that convert these absolute addresses to relative ones before compression. Relative addresses are much more redundant (many calls are to nearby functions, so many relative addresses are small and similar).
BCJ filter effect on x86 binary:
Without BCJ:
CALL instruction at offset 0x1000 targeting 0x5000:
Bytes: E8 FB 3F 00 00 (relative: 0x3FFB = 0x5000 - 0x1000 - 5)
CALL instruction at offset 0x1100 targeting 0x5000:
Bytes: E8 FB 3E 00 00 (relative: 0x3EFB = 0x5000 - 0x1100 - 5)
These two look different to the compressor.
With BCJ (convert to absolute):
Both become: E8 00 50 00 00 (stores absolute target 0x5000)
Now they're identical! Back-reference saves 4 bytes.
BCJ filters can reduce compressed executable size by 10-30% compared to raw LZMA.
XZ: LZMA2 for Unix
The .xz format packages LZMA2 (an improved version of LZMA that handles multi-stream and partial block updates better) with a Unix-friendly wrapper:
.xz structure:
┌────────────────────────────────────────────┐
│ Stream header │
│ - Magic: \xFD7zXZ\x00 │
│ - Stream flags (check type: CRC32/CRC64) │
├────────────────────────────────────────────┤
│ Blocks (LZMA2 compressed data) │
│ Each block has its own compressed size │
│ and uncompressed size │
├────────────────────────────────────────────┤
│ Index (optional block index for seeking) │
├────────────────────────────────────────────┤
│ Stream footer │
│ - Size of index │
│ - Stream flags │
│ - Magic: YZ │
└────────────────────────────────────────────┘
XZ replaced bz2 as the standard Linux source distribution format because it compresses significantly better at comparable memory use (bz2 has a 900KB memory limit that hurts compression on large files).
The Real Cost: Memory and Time
The reason LZMA isn't used everywhere is cost:
Compression resource requirements:
Memory (compress) Memory (decompress) Time
─────────────────────────────────────────────────────────────────
gzip -6 256KB 32KB 1×
Brotli-6 8MB 1MB 4×
Zstd-6 16MB 128KB 0.3×
xz -6 97MB 65MB 20×
xz -9 683MB 65MB 50×
xz at -9 needs 683MB to compress and 65MB to decompress. For a server decompressing a package install, 65MB is fine. For an embedded device — it might not have 65MB free.
The decompression memory requirement is the key constraint for LZMA in embedded contexts. This is why most Linux distributions ship packages compressed with xz (because the build server has plenty of memory and time), but the package manager's memory footprint during installation is bounded.
When LZMA/7-Zip Makes Sense
Use LZMA when:
✓ Cold storage or archival (compress once, rare access)
✓ Software distribution (compress once, download many times)
✓ Solid archives of similar files
✓ Target has adequate decompression memory (~64MB+)
✓ Compression time is not a constraint
✓ Maximizing compression ratio is the priority
Avoid LZMA when:
✗ Real-time or low-latency compression required
✗ Memory is severely constrained (< 64MB available)
✗ Data changes frequently (re-compression would be prohibitive)
✗ Random access to parts of the archive is needed
✗ Compression speed matters (CI/CD pipelines, logging, etc.)
Tuning LZMA
The main knobs:
Dictionary size (-d in 7z, --dict in xz): Bigger window = better compression, more memory. Common values: 4MB, 8MB, 16MB, 64MB.
Word size / fast bytes (-fb): Maximum match length the match finder will consider. Higher = better matches, slower.
Match cycles / depth (-mc, -md): How hard to search for matches. Diminishing returns past a certain point.
xz preset breakdown:
Preset Dict Word Depth Memory
──────────────────────────────────────
-0 256KB 32 2 2MB
-1 1MB 32 3 4MB
-3 4MB 32 4 15MB
-6 8MB 64 4 97MB
-9 64MB 273 unlimited 683MB
Summary
- LZMA combines large sliding windows (up to 4GB), hundreds of Markov chain probability models, and range coding.
- The Markov models condition each bit's probability on the encoder's current state.
- 7-Zip uses solid archives (compressing all files as one stream) to exploit cross-file redundancy.
- BCJ filters convert absolute addresses in executables to relative ones, dramatically improving compression.
- XZ wraps LZMA2 for Unix with multi-stream support and block-level CRC integrity checks.
- The cost: decompression needs ~64MB; compression can need hundreds of MB. Slow by design.
For everyday use, Zstd is a better choice. For software distribution archives where compression ratio is paramount and compress-once-decompress-many economics apply, LZMA is hard to beat.
Next, we step back to some simpler methods — run-length encoding and its friends — and find out where they still shine.
Chapter 7: Run-Length Encoding and Simple Methods
Not every compression problem requires a sophisticated algorithm. Some data has structure so obvious that a simple encoder can compress it optimally. Run-length encoding is the oldest compression technique — simple enough to implement in an afternoon, yet still used in fax machines, printer languages, and as a preprocessing step in far more complex compressors.
This chapter covers the "simple" methods: RLE, delta encoding, and the Burrows-Wheeler Transform. None of these alone produces state-of-the-art compression, but they each illuminate an important principle, and BWT in particular is a beautiful rearrangement that enables bzip2's remarkable compression ratios.
Run-Length Encoding (RLE)
The idea is almost embarrassingly simple: instead of encoding a run of repeated values literally, encode the value once and the count.
RLE example:
Input: AAAAAABBBCCDDDDDDDDDD
──────╌──╌──────────
6 A's, 3 B's, 2 C's, 10 D's
Output: 6A 3B 2C 10D
Compression: 21 bytes → 8 bytes
The Encoding Ambiguity Problem
Naive RLE has an ambiguity problem: how do you tell runs from literals? If your encoded byte 6A means "6 A's", what does 1B mean — one B (literal), or a run of 1 B? And what if the data contains the digit '6'?
Different formats resolve this differently:
Fixed format (always a pair): Always output (count, value). Problem: single non-repeated bytes still cost 2 bytes to encode.
Input: ABCAAAA
With fixed (count, value):
Output: 1A 1B 1C 4A = 8 bytes (worse than 7-byte input!)
Run/literal flag: Use the high bit of the count byte to distinguish runs from literal sequences.
PackBits format (used in TIFF, Macintosh PICT):
Count byte n:
n = 0..127: copy next (n+1) bytes literally
n = -1..-127 (signed, or 0x81..0xFF): repeat next byte (-n+1) times
n = -128: no-op
Input: AAAAAABBBCCDDDDDDDDDD
As PackBits:
-5, A → repeat A 6 times
-2, B → repeat B 3 times
-1, C → repeat C 2 times
-9, D → repeat D 10 times
Output: 4 pairs = 8 bytes ✓
Escape-based: Use an escape character to signal a run. All non-escape bytes are literals.
def rle_encode(data: bytes, min_run: int = 3) -> bytes:
"""
RLE encoder. Uses escape byte 0x90 followed by count.
Single occurrence of 0x90 in input is encoded as 0x90, 0x00.
"""
ESCAPE = 0x90
result = bytearray()
i = 0
while i < len(data):
run_char = data[i]
run_len = 1
while i + run_len < len(data) and data[i + run_len] == run_char:
run_len += 1
if run_len >= min_run or run_char == ESCAPE:
# Encode as run(s) — max run length 255
while run_len > 0:
chunk = min(run_len, 255)
if chunk == 1 and run_char != ESCAPE:
result.append(run_char)
else:
result.append(ESCAPE)
result.append(chunk)
result.append(run_char)
run_len -= chunk
else:
# Encode literally
result.extend(data[i:i+run_len])
i += run_len
return bytes(result)
Where RLE Shines (and Doesn't)
RLE is optimal for highly repetitive data and terrible for random data:
Compression ratio by data type:
Data type RLE performance
─────────────────────────────────────────────────
Solid color regions Excellent (100:1 possible)
Black/white fax pages Excellent (used in G3/G4 fax)
Palette-based graphics Good (large solid areas)
Natural photographs Terrible (rarely adjacent pixels match)
Binary executable Terrible to neutral
Compressed data Terrible (already near-random)
Overhead for random data:
A byte of escape value in the input → 2 bytes out. Up to 100% expansion.
Group 4 fax compression (the ITU-T standard for fax transmission) is essentially 2D RLE on black-and-white bit images, combined with a reference line. A typical fax page of text compresses from ~1MB to ~10KB — a 100:1 ratio — because pages of text are overwhelmingly white with small black runs for characters.
BMP and TGA image formats optionally use RLE for palette-based images where large areas share the same color.
Delta Encoding
Delta encoding doesn't compress repeated values — it compresses slowly changing values. Instead of storing absolute values, store the difference between consecutive values.
Delta encoding of time-series data:
Raw timestamps (Unix epoch, seconds):
1700000000
1700000060
1700000120
1700000180
...
Deltas:
1700000000 ← first value, store literally (4 bytes)
60 ← delta, fits in 1 byte!
60
60
...
Savings: 4 bytes per value → 1 byte per value after the first.
Delta encoding is the foundation of many domain-specific compressors:
Audio: CD audio is 16-bit samples at 44,100 Hz. Consecutive samples differ by small amounts (audio signals are smooth). Delta encoding turns these into small numbers that compress well with RLE or Huffman. FLAC uses a generalization: it fits a polynomial to blocks of samples and stores the residual (error).
Image compression: PNG uses delta encoding as a filter before compression:
- Subtract adjacent pixels (
Subfilter) - Subtract from the pixel above (
Upfilter) - Average of left and above (
Averagefilter) - Paeth predictor (best linear predictor from left, above, upper-left)
PNG filter example (Sub filter):
Original row: 100 102 105 108 112 115
Filtered row: 100 2 3 3 4 3 (subtract previous pixel)
The filtered row has small values — easier to compress.
Columnar databases: Sorted timestamp or monotonically increasing ID columns compress extremely well with delta encoding + RLE.
Columnar delta + RLE example (user IDs in a sorted table):
User IDs (sorted): 1001, 1002, 1003, 1004, 1005, ...
After delta: 1001, 1, 1, 1, 1, ...
After RLE on the delta stream:
1001 (literal), then (count=N-1, value=1)
Compression: (N × 4 bytes) → 8 bytes for any sorted sequential range.
The Burrows-Wheeler Transform (BWT)
The Burrows-Wheeler Transform is remarkable: it's a reversible rearrangement of data that makes it much easier to compress. It doesn't compress anything by itself — but after a BWT, the data often compresses dramatically better with simple methods like RLE and Huffman.
BWT is the core of bzip2, which consistently beats gzip for text compression despite being simpler conceptually.
How BWT Works
Given input string S, the BWT:
- Forms all rotations of S (appended with a special end-of-string marker $)
- Sorts them lexicographically
- Outputs the last column of the sorted matrix
BWT of "banana$":
All rotations:
banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana
Sorted:
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
Last column (BWT output): annb$aa
The magic: the last column tends to have long runs of repeated characters, because rotations that sort together often differ only at the end — meaning the character before the current sorted context is often the same.
For "banana": after BWT, we get "annb$aa" — three runs of 2+ characters. Not impressive here, but on real text:
BWT effect on English text (first 100KB of War and Peace):
Before BWT: Character runs of length 1 dominate (~99%)
After BWT: Character runs of length 3+ become common (~40%)
BWT output example fragment:
...eeeeeeeeeeeeeeeeeeetttttttttttttt...
(long runs of common characters that follow similar contexts)
Inverting the BWT
The inversion is the elegant part. Given only the last column (and the position of the original string's end-of-string marker), you can reconstruct the full original string.
The key insight: you can reconstruct the first column by sorting the last column. And the first column plus last column together give you enough to "walk" the permutation:
def bwt_encode(s: str) -> tuple[str, int]:
"""Compute BWT. Returns (transformed, index of original)."""
s = s + '\x00' # Append null byte as end-of-string marker
n = len(s)
rotations = sorted(s[i:] + s[:i] for i in range(n))
bwt = ''.join(r[-1] for r in rotations)
idx = rotations.index(s[1:] + s[0]) # index of sorted rotation starting with original
return bwt, idx
def bwt_decode(bwt: str, idx: int) -> str:
"""Invert BWT. Returns original string (without end-of-string marker)."""
n = len(bwt)
# Build table of (char, original_position) pairs
table = sorted((char, i) for i, char in enumerate(bwt))
# Walk the permutation
row = idx
result = []
for _ in range(n):
char, row = table[row]
result.append(char)
return ''.join(reversed(result))[1:] # strip the null byte
The Full bzip2 Pipeline
bzip2 applies BWT as part of a pipeline:
bzip2 compression pipeline:
Raw data
│
▼
Run-Length Encoding (pass 1)
- Pre-compress runs of 4+ identical bytes
- Prevents the BWT from sorting very long runs badly
│
▼
Burrows-Wheeler Transform
- Sort all rotations, output last column
- Groups similar contexts → long runs of same character
│
▼
Move-to-Front Transform
- Convert characters to ranks in an MRU (Most Recently Used) list
- Recently seen characters get small indices (0, 1, 2, ...)
- Makes output dominated by small integers
│
▼
Run-Length Encoding (pass 2)
- Compress the runs of 0s that MTF produces
│
▼
Huffman Coding
- Entropy code the MTF+RLE output
- bzip2 uses multiple Huffman tables, switching between them
│
▼
.bz2 file
Move-to-Front (MTF)
MTF is the step that makes BWT+Huffman work well together. It converts characters into indices of an MRU list:
MTF encoding of "aaabbbcccaaa":
Initial list: [a, b, c, d, e, ...] (all possible symbols)
Symbol 'a': index 0. Output 0. Move 'a' to front: [a,b,c,...]
Symbol 'a': index 0. Output 0. List unchanged.
Symbol 'a': index 0. Output 0.
Symbol 'b': index 1. Output 1. Move 'b' to front: [b,a,c,...]
Symbol 'b': index 0. Output 0. (b is now at front)
Symbol 'b': index 0. Output 0.
Symbol 'c': index 2. Output 2. Move 'c' to front: [c,b,a,...]
...
MTF output: 0, 0, 0, 1, 0, 0, 2, 0, 0, 2, 0, 0
The BWT output has long character runs.
MTF converts those runs to long runs of 0s.
RLE compresses the 0s.
Huffman codes the remaining non-zero values.
bzip2 in Practice
bzip2 vs. gzip on common file types:
File type gzip -6 bzip2 xz -6
────────────────────────────────────────────
English text 28% 22% 18%
C source code 22% 18% 15%
HTML files 25% 21% 17%
Binary executable 47% 40% 32%
JPEG (already cmp) 100% 100% 100%
bzip2 is typically 10-15% smaller than gzip on text, but compresses and decompresses more slowly. xz (LZMA) is typically another 15-20% smaller again.
bzip2 also has a useful property: its output consists of independent 900KB blocks, each with its own Huffman tables. This means:
- Parallel decompression is straightforward (process blocks independently)
- Corruption only destroys affected blocks, not the whole archive
- Random access is possible at block boundaries (with a block index)
When Simple Methods Still Win
It's tempting to think that simple methods have been made obsolete by sophisticated ones. Not so:
Sensor data: IoT temperature sensors produce slowly-changing values. Delta encoding (1-2 bytes per reading) + Huffman easily beats generic lossless compression and is implementable on a microcontroller.
Screen recordings / VNC: Consecutive frames differ only in regions of mouse activity. A simple delta between frames (XOR, then RLE), followed by Huffman, achieves 50-100:1 compression with negligible CPU.
Database WAL logs: Write-ahead logs are sequences of delta operations (insert row, update field). Delta encoding the operations themselves, not the data, can achieve 10:1 compression.
Palette images: 8-bit-per-pixel images with large solid areas compress better with RLE than JPEG or even PNG deflate.
The general rule: if you can characterize the structure of your specific data, a targeted simple method beats a general-purpose complex one.
Summary
- RLE encodes runs of repeated values as (count, value). Excellent for highly repetitive data, terrible for random data.
- Delta encoding stores differences between consecutive values. Excellent for slowly-changing sequences (timestamps, sensor data, sorted IDs).
- BWT rearranges data by sorting all rotations — a reversible transform that groups similar contexts and enables long runs of common characters.
- MTF converts character sequences into MRU indices, turning character runs into 0-runs that compress extremely well.
- bzip2 chains RLE → BWT → MTF → RLE → Huffman into a surprisingly effective pipeline.
- Simple methods often beat general-purpose compressors for domain-specific data with known structure.
Next, we enter the world of lossy compression with images — where the interesting question shifts from "can we reconstruct it exactly?" to "what can we safely throw away?"
Chapter 8: Image Compression
Images are where compression theory meets human perception in the most visible way. A JPEG artifact, a PNG that's larger than the original bitmap, a WebP that's visually indistinguishable from the original at a third the size — these are compression algorithms at work in a domain where "quality" is defined by what humans can see.
Image compression splits cleanly along the lossless/lossy divide. PNG and related formats preserve every pixel exactly. JPEG, WebP in lossy mode, and AVIF discard information your visual system is unlikely to notice. Understanding which approach to use — and why — requires understanding a bit about both the algorithms and the human visual system.
The Raw Numbers: Why Image Compression Matters
A 4000×3000 pixel photo, 24 bits per pixel (8 bits each R, G, B):
Uncompressed: 4000 × 3000 × 3 bytes = 36,000,000 bytes ≈ 36MB
PNG (lossless): ~18MB (2× compression on typical photos)
JPEG quality 90: ~4MB (9× compression)
JPEG quality 75: ~2MB (18× compression, still good quality)
AVIF quality 80: ~1MB (36× compression, excellent quality)
Modern smartphone photos are 10–20MB because they're stored as JPEG. Without compression, your photo library would be terabytes.
PNG: Lossless Compression for Images
PNG (Portable Network Graphics) was created in 1995 to replace GIF (which had LZW patent issues). It's lossless: decompressing a PNG gives you the exact original pixel values.
PNG's Filter Stage
PNG applies a filter to each row of pixels before compression. Filters exploit spatial correlation — adjacent pixels tend to be similar — by predicting each pixel from its neighbors and encoding the prediction error instead of the raw value.
PNG filter types (applied per row):
None: Px = raw value
Sub: Px = raw - left
Up: Px = raw - above
Average: Px = raw - floor((left + above) / 2)
Paeth: Px = raw - paeth_predictor(left, above, upper_left)
Paeth predictor (Philippe Paeth, 1991):
a = left pixel, b = above pixel, c = upper-left pixel
p = a + b - c
pa = |p - a|, pb = |p - b|, pc = |p - c|
if pa ≤ pb and pa ≤ pc: return a
elif pb ≤ pc: return b
else: return c
The PNG encoder typically tries all five filter types for each row and picks the one that produces the smallest values (easiest to compress). This adaptive filtering is one reason PNG compresses reasonably well despite using generic DEFLATE.
Filter example (a gradient row):
Raw pixels: 100, 102, 104, 106, 108, 110
Sub filter (subtract left):
100, 2, 2, 2, 2, 2
DEFLATE will easily compress the run of 2s.
Without filtering, DEFLATE has to match the gradually-changing values.
PNG Internals
PNG file structure:
┌─────────────────────────────────────────┐
│ PNG Signature: \x89PNG\r\n\x1a\n │
├─────────────────────────────────────────┤
│ IHDR chunk (13 bytes) │
│ - Width, height (4 bytes each) │
│ - Bit depth, color type │
│ - Compression method (always 0=DEFLATE)│
│ - Filter method, interlace method │
├─────────────────────────────────────────┤
│ Optional chunks: tEXt, gAMA, cHRM, etc.│
├─────────────────────────────────────────┤
│ IDAT chunk(s) — compressed image data │
│ - Filtered rows, DEFLATE compressed │
│ - May be split across multiple IDATs │
├─────────────────────────────────────────┤
│ IEND chunk — marks end of file │
└─────────────────────────────────────────┘
All PNG chunks have a 4-byte CRC. This makes PNG moderately resilient to corruption — a bad chunk can be detected (though not corrected).
Optimizing PNGs
Because PNG uses DEFLATE, the compressor's settings matter. Default PNG encoders often don't use the best DEFLATE settings. Tools like oxipng and pngcrush can reduce PNG file sizes by 20–50% without any quality loss by:
- Trying all filter combinations and picking the best per-row
- Using maximum DEFLATE settings
- Removing unnecessary metadata chunks (GPS, camera info)
# Optimize in place, removing unnecessary metadata
oxipng --opt 4 --strip all image.png
# Typical improvement
Before: 1.2MB
After: 800KB (33% smaller, identical pixels)
When to Use PNG
- Icons, logos, UI elements: Exact colors matter; flat regions with solid colors compress well.
- Screenshots: Text is sharp-edged; JPEG would produce visible ringing artifacts.
- Images with transparency: PNG supports alpha channels; JPEG does not.
- Workflow/intermediate images: When you'll process the image further, preserve all information.
- Synthetic images: Computer-generated graphics often have exact flat regions where PNG excels.
When not to use PNG: Photographs. A typical photo in PNG is 3–5× larger than a good JPEG at perceptually equivalent quality. The DEFLATE compression engine isn't designed for the statistical properties of photographic data.
JPEG: Lossy DCT Compression
JPEG (Joint Photographic Experts Group, 1992) is the format that stores almost every photograph in existence. It achieves dramatic compression by:
- Converting to a perceptually-motivated color space
- Applying the Discrete Cosine Transform (DCT) to 8×8 blocks
- Quantizing the DCT coefficients (this is where information is lost)
- Entropy coding the quantized coefficients
Step 1: Color Space Conversion (RGB → YCbCr)
The human visual system is much more sensitive to brightness (luminance) changes than to color (chrominance) changes. JPEG exploits this by separating luminance (Y) from two chrominance channels (Cb = blue-yellow, Cr = red-green).
RGB to YCbCr conversion:
Y = 0.299R + 0.587G + 0.114B
Cb = -0.168R - 0.331G + 0.500B + 128
Cr = 0.500R - 0.418G - 0.081B + 128
Why this matters:
- Y channel: human eye is highly sensitive → compress carefully
- Cb, Cr channels: human eye is less sensitive → compress aggressively
Chroma subsampling (4:2:0): Keep all Y values, but only keep
1 Cb and 1 Cr value per 2×2 block of pixels.
┌───┬───┬───┬───┐ Y: full resolution
│ Y │ Y │ Y │ Y │
├───┼───┼───┼───┤ Cb, Cr: half resolution in each direction
│ Y │ Y │ Y │ Y │ (one value per 4 pixels)
└───┴───┴───┴───┘
This halves chroma data before any further compression.
The result is usually invisible at normal viewing distances.
Step 2: The Discrete Cosine Transform
Each 8×8 pixel block is transformed using the DCT — related to the Fourier transform, but working with cosines only (which avoids the phase issues of the full Fourier transform for real-valued data).
The DCT converts 64 spatial values (pixel intensities) into 64 frequency coefficients. The top-left coefficient (DC) represents the average brightness of the block. The remaining coefficients (AC) represent progressively higher spatial frequencies.
An 8×8 pixel block (luminance values):
139 144 149 153 155 155 155 155
144 151 153 156 159 156 156 156
150 155 160 163 158 156 156 156
159 161 162 160 160 159 159 159
159 160 161 162 162 155 155 155
161 161 161 161 160 157 157 157
162 162 161 163 162 157 157 157
162 162 161 163 163 158 158 158
After DCT (simplified, showing the low-frequency dominance):
1260 -1 -12 -5 2 -2 0 0 ← DC + low freq
-23 -17 -6 -3 2 0 0 0
-11 -9 -2 2 0 1 0 0
-7 -2 0 1 0 0 0 0
2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 ← high freq
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Most energy is concentrated in the top-left (low-frequency) coefficients.
High-frequency coefficients are near zero for natural images.
Step 3: Quantization (The Lossy Step)
Each DCT coefficient is divided by a quantization value from a quantization table, then rounded to the nearest integer. This is where information is permanently lost.
Quantization table (JPEG standard luminance, quality 50):
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
Each DCT coefficient is divided by the corresponding table value and rounded.
Large quantization values → coarse rounding → more information lost (smaller file).
Small quantization values → fine rounding → less information lost (larger file).
"Quality" in JPEG is just a scalar applied to this table:
Quality 90: divide table by ~3 (fine quantization)
Quality 50: use table as-is (moderate quantization)
Quality 10: multiply table by ~8 (coarse quantization)
The visual effect of quantization is visible as "blocking" — 8×8 pixel blocks with slightly wrong colors that become obvious at low quality settings.
Step 4: Zigzag Scan and Entropy Coding
After quantization, the 64 coefficients are read out in a zigzag pattern that puts low-frequency (large) coefficients first and high-frequency (near-zero) coefficients last:
Zigzag scan order:
┌──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ 1│ 5│ 6│14│15│27│28│
├──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 4│ 7│13│16│26│29│42│
├──┼──┼──┼──┼──┼──┼──┼──┤
│ 3│ 8│12│17│25│30│41│43│
├──┼──┼──┼──┼──┼──┼──┼──┤
│ 9│11│18│24│31│40│44│53│
├──┼──┼──┼──┼──┼──┼──┼──┤
│10│19│23│32│39│45│52│54│
├──┼──┼──┼──┼──┼──┼──┼──┤
│20│22│33│38│46│51│55│60│
├──┼──┼──┼──┼──┼──┼──┼──┤
│21│34│37│47│50│56│59│61│
├──┼──┼──┼──┼──┼──┼──┼──┤
│35│36│48│49│57│58│62│63│
└──┴──┴──┴──┴──┴──┴──┴──┘
Reading zigzag order from quantized coefficients:
1260, -2, -23, -11, -17, -12, -7, ... 0, 0, 0, 0, 0, 0, 0, 0
The long run of zeros at the end is RLE-encoded with a special
end-of-block (EOB) symbol. Then Huffman coding compresses everything.
JPEG Artifacts
The DCT-based approach produces characteristic artifacts:
- Blocking: 8×8 grid visible, especially at edges and in smooth gradients
- Ringing: Dark halos around sharp edges (Gibbs phenomenon from truncating high-frequency content)
- Mosquito noise: Temporal noise in areas of fine texture in videos
- Color bleeding: Chroma subsampling creates color fringing at high-contrast edges
Where you'll see JPEG artifacts:
Sharp text on colored background → heavy ringing
──────────────────────────────────────────────
│ HELLO │ ← Dark halos around the letters
──────────────────────────────────────────────
Flat blue sky → visible 8×8 banding
Fine diagonal lines → "staircase" pattern
JPEG is optimized for natural photographic content, not:
- Text or diagrams with sharp edges
- Flat synthetic colors
- Images with text overlaid on photos
WebP: VP8 Intra-Frame Coding
WebP was developed by Google (from the VP8 video codec) and released in 2010. It supports both lossless and lossy modes.
Lossy WebP uses VP8's intra-frame prediction with smaller blocks than JPEG (4×4 in addition to 16×16), better chroma handling, and an arithmetic coder instead of Huffman — getting roughly 25–35% better compression than JPEG at equivalent perceptual quality.
Lossless WebP is more interesting: it uses a completely different algorithm from PNG, with:
- Spatial prediction (like PNG filters but more sophisticated, including multiple predictors per block)
- Color transform (subtract color correlation between channels)
- Subtract green (G channel before R and B)
- Palette coding (for images with limited colors)
- LZ77 + Huffman on the resulting values
WebP lossless vs. PNG on typical web images:
Image type PNG size WebP lossless Savings
──────────────────────────────────────────────────────────
Icon/logo 24KB 18KB 25%
Screenshot 450KB 320KB 29%
Illustration 1.2MB 820KB 32%
Photo (no alpha) 18MB (rare) 15MB 17%
WebP lossless is generally 26% smaller than equivalent PNG, with slightly slower encoding and comparable decoding speed.
AVIF: The Modern Challenger
AVIF (AV1 Image File Format) uses the AV1 video codec's intra-frame encoding to compress still images. Released in 2019 by the Alliance for Open Media (Apple, Google, Netflix, Amazon, et al.), it's the current state of the art for image compression.
What Makes AVIF Better
AVIF inherits AV1's modern encoding toolkit:
- Variable block sizes: 4×4 to 128×128 (vs JPEG's fixed 8×8)
- More sophisticated intra-prediction: 56 directional modes + DC and smooth predictors
- Better chroma subsampling: 4:4:4, 4:2:2, and 4:2:0, all handled natively
- Film grain synthesis: Rather than preserving noise in the bitstream, AVIF can store grain parameters and resynthethize natural-looking film grain on decode
- HDR and wide gamut: Native 10-bit and 12-bit support
- Alpha channel: In the lossy bitstream, not a separate lossless channel
Compression comparison at equivalent quality (DSSIM ≈ 0.001):
Format File size Encoding time Decoding time
──────────────────────────────────────────────────────────
JPEG (MozJPEG) 100% 1× 1×
WebP lossy 72% 1.5× 1×
AVIF 45% 50× 0.5× (hardware decode)
(software encode is slow)
AVIF encodes at roughly half the size of JPEG for equivalent perceptual quality. The catch: encoding is currently slow in software (AV1 was designed with hardware in mind). Browser support arrived in 2021 (Chrome, Firefox), with Safari following in 2023.
AVIF Deployment in Practice
# Using Pillow to encode AVIF
from PIL import Image
img = Image.open('photo.jpg')
# AVIF with quality control
img.save('photo.avif', quality=80)
# quality=100: visually lossless, ~50% smaller than JPEG 95
# quality=80: good quality, ~50% size of JPEG 80
# quality=50: noticeable artifacts, very small
# For web serving: offer AVIF with JPEG fallback
# <picture>
# <source srcset="photo.avif" type="image/avif">
# <source srcset="photo.webp" type="image/webp">
# <img src="photo.jpg">
# </picture>
Format Selection Guide
Image format decision tree:
Does the image require transparency (alpha)?
├─ Yes, lossless alpha required → PNG (or WebP lossless)
├─ Yes, lossy acceptable → WebP lossy or AVIF
└─ No → continue
Is it a photograph or complex scene?
├─ No (icon, logo, diagram, screenshot) → PNG or WebP lossless
└─ Yes → continue
Is encoding speed critical?
├─ Yes (dynamic generation, on-the-fly) → JPEG or WebP lossy
└─ No (batch encode, CDN delivery) → AVIF
Is browser support a constraint?
├─ Must support IE/old browsers → JPEG/PNG only
├─ Modern browsers (2021+) → AVIF with WebP/JPEG fallback
└─ Latest browsers only → AVIF
File size priority?
├─ Maximum compression → AVIF
├─ Good compression, fast encode → WebP lossy
└─ Compatibility → JPEG quality 75-85
The Quality-Size Curve
Every lossy format has a quality dial. Understanding how quality translates to visual fidelity helps set appropriate targets:
JPEG quality guidelines:
Quality 95+: Visually lossless for most humans. Use for source masters.
Typical size: 4-5× the AVIF equivalent.
Quality 85: High quality. Artifacts invisible at normal viewing distance.
Good for photos where quality is a priority.
Quality 75: Good quality. Slight artifacts visible at 1:1 pixel.
Standard web quality for photographs.
Quality 60: Moderate quality. Visible artifacts on close inspection.
OK for thumbnails, preview images.
Quality 40: Low quality. Obvious blocking on smooth gradients.
Only appropriate for very small sizes or previews.
Quality 20: Very low. Clear blocking, ringing, color errors.
Rarely appropriate.
Summary
- PNG uses predictive filtering (Paeth, Sub, etc.) + DEFLATE. Lossless, ideal for graphics with flat regions, transparency, and text. Too large for photographs.
- JPEG converts to YCbCr, applies DCT to 8×8 blocks, quantizes (lossily), then Huffman-codes. The standard for photos; characteristic blocking and ringing artifacts at low quality.
- WebP offers both lossless (26% better than PNG) and lossy (25% better than JPEG) modes. Good choice when broad support is available.
- AVIF uses AV1 intra-frame coding. ~50% smaller than JPEG at equivalent quality. Slow to encode in software; the future of web image compression.
- The human visual system cares more about luminance than chrominance — all lossy formats exploit this.
Next: audio compression, where the same DCT-based approach runs on time, and psychoacoustics determines what we can safely throw away.
Chapter 9: Audio Compression
Audio compression has a split personality that mirrors the lossless/lossy divide more starkly than almost any other domain. On one side: FLAC, where audiophiles insist on bit-perfect fidelity, and every sample is preserved exactly. On the other: MP3, which threw away half the bits and convinced the world it sounded fine.
Both are right, for different contexts. Understanding audio compression means understanding what the ear actually hears — and that turns out to be a surprisingly limited subset of what a microphone records.
The Physics of Audio Data
Digital audio is a sequence of samples, each representing the amplitude of a sound wave at a point in time. CD quality is:
- 44,100 samples per second (44.1 kHz)
- 16 bits per sample
- 2 channels (stereo)
That's 44100 × 16 × 2 = 1,411,200 bits per second, or about 10MB per minute uncompressed. A typical song is ~50MB of raw PCM data.
Audio signal visualization (1ms of a sine wave at 440 Hz):
Amplitude
+32767 ┤ ╭─╮
┤ ╭──╯ ╰──╮
┤ ╭╯ ╰╮
┤ ╭╯ ╰╮
0 ┤──────────╯──────────────╰──────────
┤ ╭╯ ╰╮
┤ ╭╯ ╰╮
┤ ╭─╯ ╰─╮
-32768 ┤ ╭─╯ ╰─╮
└──────────────────────────────────────→
0 1ms
(44.1 samples per ms of audio)
FLAC: Lossless Audio Compression
FLAC (Free Lossless Audio Codec, 2001) is the standard for lossless audio. It compresses CD-quality audio to roughly 50-60% of the original size without losing a single sample. You can decompress FLAC and compare it byte-by-byte to the original PCM — they're identical.
FLAC's Prediction + Residual Architecture
Audio signals are predictable. Given a few past samples, you can estimate the current sample quite accurately. FLAC exploits this through linear prediction:
Linear prediction:
Given past samples: x[n-1], x[n-2], ..., x[n-p]
Predict current: x̂[n] = a₁x[n-1] + a₂x[n-2] + ... + aₚx[n-p]
The residual (prediction error):
e[n] = x[n] - x̂[n]
For audio signals, residuals are much smaller than samples.
They compress much better.
Example: gradual sine wave
Samples: 1000, 1050, 1095, 1134, 1165, ...
Predicted: 1000, 1050, 1095, 1133, 1163, ... (from coefficients)
Residuals: 0, 0, 0, 1, 2, ... (tiny errors)
FLAC uses Linear Predictive Coding (LPC) with:
- Up to 32 predictor coefficients
- Coefficients quantized to integers (for exact arithmetic)
- Multiple sub-frame types (constant, verbatim, fixed predictor, LPC)
Rice Coding for Residuals
Residuals cluster around zero — they're much more likely to be small than large. FLAC compresses them using Rice coding, a variant of Golomb coding:
Rice coding with parameter k:
Every integer n is encoded as:
quotient q = floor(n / 2^k) → encoded in unary (q zeros, then a 1)
remainder r = n mod 2^k → encoded in k bits (binary)
For FLAC (using mapped signed integers):
0 → 0 (0 ones → unary "1", remainder "")
1 → 10
-1 → 11
2 → 010
-2 → 011
With k=0: small numbers get short codes.
With k=3: works better when residuals are larger.
FLAC chooses k adaptively per block to minimize bits used.
FLAC Frame Structure
FLAC file:
┌───────────────────────────────────────────┐
│ fLaC marker (4 bytes: 0x66 0x4C 0x61 0x43)│
├───────────────────────────────────────────┤
│ Metadata blocks: │
│ STREAMINFO (mandatory, always first) │
│ - Sample rate, channels, bit depth │
│ - Total samples, MD5 of uncompressed data│
│ SEEKTABLE (optional, enables fast seek) │
│ VORBIS_COMMENT (tags: artist, title...) │
│ PICTURE (embedded album art) │
├───────────────────────────────────────────┤
│ Audio frames │
│ Frame header: sample count, channels, │
│ blocking strategy │
│ Subframes (one per channel): │
│ Type: CONSTANT, VERBATIM, FIXED, LPC │
│ LPC coefficients (if LPC type) │
│ Rice-coded residuals │
│ Frame footer: CRC-16 │
└───────────────────────────────────────────┘
Typical FLAC compression ratios:
Genre Ratio
─────────────────────────────────
Classical (sparse) 45% of original size
Speech 48%
Rock/pop 52%
Electronic/electronic 55%
Already-lossy re-encode 90%+ (can't reclaim what's gone)
FLAC is also streamable — each frame is independently decodable. This enables gapless playback, random seeking, and robust error handling.
MP3: Psychoacoustic Compression
MP3 (MPEG-1 Audio Layer 3, 1993) achieved something remarkable: it convinced millions of people that 128kbps audio sounded "good enough" — a 90% reduction from CD quality. It did this by understanding what the human auditory system actually processes.
The Psychoacoustic Model
Human hearing has several exploitable limitations:
Frequency masking: A loud tone at one frequency masks quieter tones at nearby frequencies. If a 1 kHz tone is playing at 80dB, you can't hear a 1.1 kHz tone at 60dB — the louder one masks it.
Frequency masking:
Sound ╭─────────────╮
level(dB) │ Loud │
│ tone at │ ← Audible
│ 1kHz │
╰──────────────╯
Masking ╭──────╮ ← Also audible
threshold │Quiet │
│tone │
Hidden by mask: ╰──────╯
↕
Would need to be
HERE to be heard
─────────────────────────────────→ frequency
Temporal masking: A loud sound masks quieter sounds that occur slightly before and after it. The ear needs time to recover from loud stimuli.
- Pre-masking: ~5ms before the loud sound
- Post-masking: ~100ms after the loud sound
Absolute threshold of hearing: Some frequencies can't be heard at low volumes regardless (humans can't hear 20kHz at 20dB SPL, for example).
Critical bands: The cochlea divides the frequency spectrum into roughly 24 overlapping bands. Masking effects are strongest within the same critical band.
MP3 Encoding Pipeline
MP3 encoding pipeline:
PCM input (44100 Hz, 16-bit stereo)
│
▼
Filter bank (32 sub-bands) AND psychoacoustic model run in parallel:
┌────────────────────┐ ┌──────────────────────────────────┐
│ Polyphase filter │ │ FFT-based psychoacoustic analysis │
│ bank (32 sub-bands)│ │ - Compute masking thresholds │
│ │ │ - Determine allowed noise per band│
└────────────────────┘ └──────────────────────────────────┘
│ │
└─────────────┬───────────┘
▼
MDCT (Modified Discrete Cosine Transform)
- 576 frequency coefficients per granule
- Better frequency resolution than polyphase alone
│
▼
Quantization
- Each coefficient divided by step size
- Step sizes chosen so quantization noise stays below masking threshold
- Bit allocation: louder/more-masked bands get fewer bits
│
▼
Huffman coding
- Quantized coefficients coded using Huffman tables
- Quad-point, pair, and singleton encoders for different value ranges
│
▼
Bit reservoir management
- Smooth out bit usage across frames
- Complex sections use bits saved from simple sections
│
▼
MP3 frame output (each ~26ms of audio)
The MDCT: Frequency Analysis Over Time
The Modified Discrete Cosine Transform is a variant of the DCT designed for audio:
- Processes overlapping blocks (half-overlap to avoid blocking artifacts)
- Converts 1152 time-domain samples to 576 frequency coefficients
- "Long blocks" for steady-state content, "short blocks" for transients
MDCT overlapping blocks:
Block N: ├────────────────────┤
Block N+1: ├────────────────────┤
Block N+2: ├────────────────────┤
50% overlap
This overlap-add approach avoids blocking artifacts at the
boundaries of analysis windows.
MP3 Quality at Different Bitrates
MP3 bitrate guide:
64 kbps: Acceptable for speech, AM-radio quality audio
Ratio: ~22:1 vs CD
128 kbps: "Transparent" for most listeners on most content
Ratio: ~11:1 vs CD
Long considered the minimum for music
192 kbps: High quality; differences from original audible only
on extreme material (cymbals, sustained strings)
Ratio: ~7:1 vs CD
256 kbps: Very high quality; audibly transparent on almost
all content for most listeners
320 kbps: Maximum MP3 bitrate; effectively transparent
Ratio: ~4:1 vs CD
VBR V0: Variable bitrate, targeting ~245 kbps average;
better quality than 256 kbps CBR at similar size
The "128kbps is transparent" belief is now considered outdated — controlled blind tests show many listeners can distinguish 128kbps from lossless on quality headphones. 192kbps or VBR V2 (~190 kbps) is the current pragmatic minimum for music.
AAC: The MP3 Successor
Advanced Audio Coding (AAC, 1997) was developed as the successor to MP3. It achieves better audio quality than MP3 at the same bitrate, or equal quality at lower bitrates.
Key improvements over MP3:
- Longer MDCT windows: 2048 coefficients (vs 576), better frequency resolution
- More Huffman codebooks: MP3 has 32, AAC has 13+
- Temporal noise shaping (TNS): Shapes quantization noise in time, reducing pre-echo artifacts
- Perceptual noise substitution: For bands where noise is fully masked, replace the signal with synthesized noise (cheaper to encode)
- Better stereo handling: Mid/Side stereo coding, intensity stereo
Quality comparison (bitrate for transparent quality):
Format Bitrate for transparency
──────────────────────────────────
MP3 ~192 kbps
AAC ~128 kbps (AAC-LC, "Low Complexity" profile)
AAC-HE ~64 kbps (High Efficiency AAC, adds SBR)
AAC-HEv2 ~48 kbps (adds PS: Parametric Stereo)
Opus ~96 kbps (see below)
AAC is the standard for Apple Music, YouTube, streaming services, smartphones (AAC over Bluetooth), and broadcast television (AC-3/Dolby Digital is a related format).
Spectral Band Replication (HE-AAC)
AAC-HE adds Spectral Band Replication (SBR) — a clever trick for encoding at very low bitrates:
- Encode only the low-frequency half of the spectrum at full quality
- Encode a compact "envelope" describing the high-frequency energy distribution
- On decode, extrapolate high frequencies from the low-frequency signal using the envelope
At 48kbps, HE-AAC produces reasonable quality by letting the decoder reconstruct the "feel" of the high frequencies without transmitting them explicitly.
Opus: The Modern All-Purpose Codec
Opus (2012, standardized in RFC 6716) is the newest audio codec and in many ways the most impressive. It was designed for internet use — specifically WebRTC and streaming — and it outperforms every other audio codec across the bitrate range.
Codec performance by use case:
Low bitrate (8–24 kbps): Voice, narrow-band
Winner: Opus (uses SILK mode, speech-optimized)
Runner-up: AMR-WB
Medium bitrate (32–96 kbps): Music, wideband voice
Winner: Opus (uses CELT mode, MDCT-based)
Runner-up: AAC-HE
High bitrate (96+ kbps): High-quality music
Winner: Opus / AAC-LC (comparable)
Lossless: FLAC
For voice/teleconference: Always Opus
For streaming music: AAC-LC or Opus
For downloaded music: FLAC (lossless) or AAC-LC (lossy)
Opus works by combining two underlying codecs:
- SILK: Originally developed by Skype for voice, optimized for speech signals
- CELT: A high-quality audio codec based on MDCT with perceptual coding
Opus dynamically switches between them based on content type, and can also use a hybrid mode. This makes it uniquely suited for mixed voice-and-music content (like a podcast with intro music).
# Encoding audio with Opus in Python (using opuslib)
import opuslib
# Create encoder: 48000 Hz, 2 channels, audio application
enc = opuslib.Encoder(48000, 2, opuslib.APPLICATION_AUDIO)
enc.bitrate = 128000 # 128 kbps
# Encode a 20ms frame (960 samples at 48kHz)
pcm_frame = get_pcm_samples(960) # your audio source
encoded = enc.encode(pcm_frame, 960)
# Decode
dec = opuslib.Decoder(48000, 2)
decoded = dec.decode(encoded, 960)
The Container vs. Codec Distinction
A common confusion: "MP3" is both a codec and a container. But "AAC" is just a codec — it can be stored in several containers: .aac (raw), .m4a (MPEG-4 container), .mp4 (video container), .ogg (Ogg container, uncommon).
Container / Codec combinations:
Container Extension Common codecs
─────────────────────────────────────────────
MPEG-4 Audio .m4a AAC-LC, HE-AAC
Ogg .ogg Vorbis, Opus, FLAC
WebM .webm Vorbis, Opus
MPEG-3 .mp3 MP3 only (format = codec)
FLAC .flac FLAC only
WAV .wav PCM (uncompressed), sometimes others
AIFF .aiff PCM (uncompressed), Apple LPCM
The container matters for:
- Chapter markers and metadata
- Gapless playback (requires container support)
- Video synchronization
- DRM integration
What Happens to Transcoded Audio
A critical warning: transcoding between lossy formats (e.g., MP3 → AAC) does not improve quality. It makes it worse.
Transcoding degradation:
Original PCM → MP3 128kbps → re-encoded AAC 128kbps
Quality drop: step 1 step 2 (additional loss)
Total loss = loss1 + loss2 > loss1
The artifacts from MP3 become inputs to the AAC encoder,
which can't distinguish "this artifact was deliberately encoded"
from "this is the original signal" and may make different
(equally wrong) decisions about what to discard.
Always transcode from the highest-quality source available. For music production: record lossless, distribute lossy.
Summary
- FLAC uses LPC prediction + Rice coding for bit-perfect audio compression at ~50% of original size.
- MP3 uses psychoacoustic masking models to discard frequencies the human ear can't hear, achieving 10:1+ compression.
- AAC improves on MP3 with better MDCT windows, temporal noise shaping, and spectral band replication in HE-AAC variants.
- Opus is the modern all-purpose codec — better than everything else across all bitrates, especially for voice.
- Never transcode between lossy formats; always use a lossless source when re-encoding.
- Psychoacoustic models are the key to lossy audio: they define what can be safely discarded by modelling the limits of human perception.
Next: video compression, where everything from this chapter applies, plus an entirely new dimension — time.
Chapter 10: Video Compression
Video is where the stakes are highest and the algorithms are most complex. A single minute of uncompressed 1080p video at 30fps is about 6GB. Streaming services deliver that minute in roughly 50MB at good quality — a 120:1 compression ratio — in real time, to billions of devices.
Understanding video compression means understanding everything from the previous chapters, applied to a new dimension: time. Video has spatial redundancy (within a frame, like images) and temporal redundancy (between frames, since most of the scene doesn't move much between consecutive frames). The second kind is where the dramatic compression ratios come from.
The Scale of the Problem
Uncompressed video data rates:
Resolution FPS Bit depth Data rate
──────────────────────────────────────────────
720p 30 8-bit 777 Mbps
1080p 30 8-bit 1.5 Gbps
1080p 60 8-bit 3.0 Gbps
4K UHD 30 8-bit 5.7 Gbps
4K UHD 60 10-bit 17.6 Gbps
8K UHD 60 10-bit 71.6 Gbps
Target streaming bitrates (compressed):
1080p at good quality: 4–8 Mbps (H.264)
2–4 Mbps (H.265/HEVC)
1.5–3 Mbps (AV1)
4K at good quality: 15–25 Mbps (H.264, rarely used)
7–12 Mbps (H.265)
4–8 Mbps (AV1)
Spatial Compression: Intra-Frame Coding
Each individual video frame can be compressed like an image — using DCT, quantization, and entropy coding. This is called intra-frame coding (I-frames in MPEG terminology).
Modern video codecs apply intra-prediction before the transform: they predict each block from its neighbors within the same frame, then transform and quantize the prediction error (residual).
Intra-prediction directions (H.264/H.265):
For each block, try predicting from:
┌─────────────────────────────────────────┐
│ ←─ left ─→ ↑─ above ─↑ ↖diagonal │
│ ↙ down-left ↘ down-right │
│ DC (flat, average of neighbors) │
│ + many more in H.265 (33 directions) │
│ + even more in AV1 (56 directions) │
└─────────────────────────────────────────┘
Pick the prediction that leaves the smallest residual.
Encode: prediction mode + residual.
This gives modern codecs a significant advantage over JPEG even on still images — JPEG has no intra-prediction.
Temporal Compression: Inter-Frame Coding
The big win. Between consecutive video frames, most pixels are the same or have just moved slightly. Instead of encoding each frame independently, encode only the difference from a previous frame.
Temporal redundancy visualization:
Frame N: Frame N+1 (person moves slightly right):
┌──────────────────────┐ ┌──────────────────────┐
│░░░░░░░░░░░░░░░░░░░░░░│ │░░░░░░░░░░░░░░░░░░░░░░│
│░░░░░[PERSON]░░░░░░░░░│ │░░░░░░░[PERSON]░░░░░░░│
│░░░░░░░░░░░░░░░░░░░░░░│ │░░░░░░░░░░░░░░░░░░░░░░│
│░░░░░░░░░░░░░░░░░░░░░░│ │░░░░░░░░░░░░░░░░░░░░░░│
└──────────────────────┘ └──────────────────────┘
Background: identical (background compression is essentially free)
Person: same pixels, shifted 2 pixels right
→ encode as "motion vector (2,0), copy region"
The difference frame: almost entirely zero.
Motion Estimation and Compensation
Motion estimation: Find where each block of the current frame came from in the reference frame. Motion compensation: Use the found match to predict the current block.
The encoder tries to find, for each block, a motion vector (dx, dy) such that the block in the reference frame at (x+dx, y+dy) is as similar as possible to the current block.
Motion estimation search pattern:
For block at current position (cx, cy),
search reference frame within search range:
Reference frame search window:
┌─────────────────────────────┐
│ . . . . . . . . . . . . │ Each '.' is a candidate
│ . . . . . . . . . . . . │ position to check.
│ . . . . ┌─────┐ . . . . │
│ . . . . │BLOCK│ . . . . │ Find the candidate with
│ . . . . └─────┘ . . . . │ smallest difference (SAD,
│ . . . . . . . . . . . . │ Sum of Absolute Differences)
│ . . . . . . . . . . . . │
└─────────────────────────────┘
Exhaustive search: O(search_range²) per block position → too slow
Fast search algorithms:
Three-step search: O(log(search_range)) candidates
Diamond search: starts large, zooms in
Hierarchical (multi-resolution): search at coarse scale first
After motion compensation, the residual (actual block minus predicted block) is typically small — it can be transformed, quantized, and encoded like an intra-frame block, but with fewer bits needed because the values are close to zero.
Sub-pixel Motion
Real motion isn't always aligned to integer pixel boundaries. A block might have moved 2.5 pixels to the right. Modern codecs support sub-pixel motion vectors:
- Half-pixel precision: Intermediate values computed by bilinear interpolation of adjacent pixels
- Quarter-pixel precision: Bicubic or similar interpolation (H.264 and later)
- 1/8 pixel precision: Used in AV1
Better sub-pixel accuracy → better motion compensation → smaller residuals → better compression. Cost: more computation for both encoder and decoder.
Frame Types
Video streams use three types of frames:
GOP (Group of Pictures) structure:
I B B P B B P B B P B B I
│ │ │ │ │ │ │ │
│ │ │ └──┘ └──┘ └──┘ └── next I-frame (keyframe)
│ │ │ │
│ └──┴──┘
│
└── Intra-coded frame (keyframe)
Self-contained; decoder can start here.
Largest frames (no inter-prediction).
P-frame (Predicted):
Predicted from ONE past reference frame.
Contains motion vectors + residuals.
Much smaller than I-frames.
B-frame (Bi-directional):
Predicted from BOTH past AND future frames.
Smallest of the three — uses best of both references.
Requires decoder to buffer future frames → adds latency.
I-frames (keyframes) are where seeking can land. Every I-frame is fully self-contained. Streaming services insert I-frames every few seconds to allow seeks and fast channel-changing.
P-frames reference one past frame. A decoder must have correctly decoded the reference frame to decode a P-frame.
B-frames reference frames in both directions. Better compression, but adds decode complexity and latency. Live streaming often avoids B-frames; stored video uses them extensively.
H.264 / AVC: The Universal Codec
H.264 (Advanced Video Coding, 2003) is the codec that made HD video streaming practical. It's in every smartphone, streaming service, browser, and camera. Even as newer codecs have emerged, H.264 remains dominant because it has 15+ years of hardware decoder support.
H.264 Key Features
Entropy coding: Two options:
- CAVLC (Context-Adaptive Variable-Length Coding): Simpler, faster, baseline profile
- CABAC (Context-Adaptive Binary Arithmetic Coding): Better compression (~10-15%), requires more computation, used in main/high profiles
Reference frames: Can reference up to 16 past frames (not just the immediately previous one), enabling better motion compensation for complex motion.
Deblocking filter: Applied after reconstruction to reduce 8×8 blocking artifacts. H.264's deblocking filter is in-loop — it runs before reference frames are stored, so future frames use deblocked reference frames.
Profiles and levels: H.264 defines multiple profiles (Baseline, Main, High) with increasing feature sets, and levels controlling maximum resolution, bitrate, and buffer sizes.
H.264 profiles:
Baseline: No B-frames, CAVLC only. Low-power decoders.
Used for: video calling, older devices
Main: B-frames, CABAC. Better compression.
Used for: standard SD/HD video, some streaming
High: 8×8 intra DCT, adaptive quantization, CABAC.
Best compression. Used for: Blu-ray, streaming HD
High 10: 10-bit depth. HDR content.
High 4:2:2, High 4:4:4: Professional production.
H.264 Encoder Presets (x264)
The x264 encoder has presets from ultrafast to veryslow. The slower the preset, the more encoding time is spent looking for better compression, with diminishing returns.
x264 preset comparison (1080p at CRF 23):
Preset Encode speed File size PSNR
──────────────────────────────────────────────
ultrafast ~1500 fps Large Low
fast ~350 fps Medium Good
medium ~150 fps Medium Good
slow ~70 fps Small Better
veryslow ~20 fps Smaller Best
CRF (Constant Rate Factor) = quality target.
Lower CRF = higher quality, larger file.
CRF 23 = default; CRF 18 ≈ visually lossless.
H.265 / HEVC: Twice the Compression
H.265 (High Efficiency Video Coding, 2013) targets the same quality as H.264 at half the bitrate, or better quality at the same bitrate.
Key Differences from H.264
CTUs instead of macroblocks: H.264 uses 16×16 macroblocks. H.265 uses Coding Tree Units (CTUs) of up to 64×64 pixels, subdivided recursively using a quadtree structure.
CTU quadtree structure:
64×64 CTU:
┌──────────────────────────────────────┐
│ │ │CU│CU│ │
│ Single 32×32 CU │ 16×16 ├──┼──┤ │
│ (flat region) │ CU │CU│CU│ │
│ │ ├──┴──┤ │
│ ├─────────┤ 8×8│ │
│ │ 16×16 │ │ │
│ │ CU │ │ │
└──────────────────┴─────────┴─────┴───┘
Large blocks for uniform regions (efficient).
Small blocks for detailed/complex regions (accurate).
More intra prediction modes: 35 directions (vs 9 in H.264 high profile).
SAO (Sample Adaptive Offset): An additional in-loop filter that reduces banding and ringing artifacts using lookup tables.
Better motion compensation: Asymmetric partition shapes, more reference frames.
H.265 Licensing Problem
H.265 achieves excellent compression but has a complex patent licensing situation with multiple patent pools. This fragmentation discouraged browser support and pushed Google, Mozilla, Cisco, and others to form the Alliance for Open Media to develop AV1 as a royalty-free alternative.
AV1: The Open Codec
AV1 (2018) was designed by the Alliance for Open Media specifically to be:
- Royalty-free (no per-unit licensing fees)
- ~30% better compression than H.264 (similar to H.265, sometimes better)
- Better than VP9 (its predecessor from Google)
AV1 Innovations
Larger, more flexible block partitions: Up to 128×128, with 10 different split patterns (including T-shaped and L-shaped).
More intra prediction modes: 56 directional angles, plus "smooth" predictors, "palette mode" for flat-colored regions, and "CfL" (Chroma from Luma).
Frame-level film grain synthesis: Instead of encoding film grain in each frame (expensive), AV1 can encode grain parameters and resynthesize plausible grain on decode. Saves bits, preserves perceptual "texture."
Loop Restoration filter: More advanced post-processing than H.265's SAO.
Constrained Directional Enhancement Filter (CDEF): Reduces ringing artifacts along directional edges.
Codec comparison at equivalent quality:
Bitrate for comparable quality (normalized to H.264):
H.264: 100% (baseline)
VP9: 65% (~35% savings)
H.265: 55% (~45% savings)
AV1: 50% (~50% savings)
At 4K/60fps content:
H.264 needs ~50 Mbps
H.265 needs ~25 Mbps
AV1 needs ~18 Mbps
Encoding speed (software, 1080p):
H.264 (x264 fast): ~600 fps
H.265 (x265 fast): ~150 fps
AV1 (libaom fast): ~8 fps ← Very slow software encoder
AV1 (SVT-AV1 fast): ~120 fps ← Production-grade fast encoder
AV1 hardware encoders (in Intel Arc, NVIDIA RTX 40xx, AMD RDNA3) bring encoding speed to real-time. Hardware decoders are in smartphones (since 2021), smart TVs, and recent Intel/AMD/NVIDIA graphics cards. Netflix, YouTube, and Twitch all use AV1.
The Complete Video Compression Pipeline
Video codec encoding pipeline (H.264/H.265/AV1):
Raw video frames
│
▼
Scene change detection
→ Insert I-frame at scene changes
│
▼
For each frame, for each block:
┌──────────────────────────────────────────────┐
│ 1. INTRA PREDICTION or INTER PREDICTION │
│ Intra: predict from neighbors in frame │
│ Inter: motion estimation against refs, │
│ find best motion vector │
│ │
│ 2. TRANSFORM (DCT or integer approximation)│
│ Convert residual to frequency domain │
│ │
│ 3. QUANTIZATION │
│ Divide by quantization step (lossy!) │
│ Rate-distortion optimization decides │
│ which blocks get more/fewer bits │
│ │
│ 4. IN-LOOP FILTERING │
│ Deblocking, SAO (H.265), CDEF (AV1) │
│ │
│ 5. ENTROPY CODING │
│ CABAC (H.264/H.265) or ANS (AV1) │
└──────────────────────────────────────────────┘
│
▼
Video bitstream (NAL units / OBUs for AV1)
│
▼
Container: MP4 (H.264/H.265), WebM (AV1/VP9), MKV, TS, ...
Rate Control: The Invisible Algorithm
Rate control decides how many bits to allocate per frame. It's as important as the codec itself for real-world quality.
Rate control modes:
CRF (Constant Rate Factor): Target a quality level, let bitrate vary.
Best for: archiving, local storage
Downside: unknown output file size
CBR (Constant Bit Rate): Exactly N bits per second, always.
Best for: live streaming, broadcast with fixed bandwidth
Downside: wastes bits on simple scenes, degrades quality on complex ones
ABR (Average Bit Rate): Target average, let it vary.
Middle ground. Used by most streaming services.
CQ (Constant Quantizer): Same QP for every frame.
Simple, rarely optimal. Educational use mainly.
2-pass encoding:
Pass 1: Analyze video, estimate complexity per scene.
Pass 2: Allocate bits based on analysis.
Best quality at target filesize. 2× encoding time.
Why B-Frames Hurt Live Streaming
B-frames reference future frames — which means the encoder can't output a B-frame until it has seen the future frames it references, and the decoder can't display it until it has received those future frames.
B-frame latency:
Without B-frames (low latency):
Capture → Encode → Transmit → Decode → Display
Latency: ~1–2 frames
With B-frames (better compression):
Capture → [wait for future frames] → Encode → Transmit
→ [buffer future frames] → Decode → Display
Latency: 3+ frames minimum
For interactive video (gaming, video calling):
B-frames are disabled. Latency matters more than compression.
For stored video / VOD streaming:
B-frames are used. Viewers don't notice 2-second buffer.
Codec Support Matrix
Codec support in major contexts (as of 2025):
Context H.264 H.265 AV1 VP9
────────────────────────────────────────────────────
Chrome/Firefox ✓ ✓ ✓ ✓
Safari ✓ ✓ ✓ —
Mobile (iOS) ✓ ✓ ✓(hw) —
Mobile (Android) ✓ ✓ ✓(hw) ✓
Smart TVs ✓ ✓ Varies —
YouTube ✓ — ✓ ✓
Netflix ✓ ✓ ✓ —
Blu-ray ✓ ✓ — —
FFmpeg encode ✓ ✓ ✓ ✓
Summary
- Video compression combines spatial (intra-frame, like images) and temporal (inter-frame) compression.
- Motion estimation finds where blocks moved between frames; motion compensation predicts current blocks from past frames.
- I-frames are keyframes (self-contained), P-frames predict from past, B-frames predict from past and future.
- H.264 is the universal standard — hardware support everywhere, reasonable compression.
- H.265/HEVC is ~50% more efficient than H.264, hampered by patent complexity.
- AV1 is royalty-free, ~50% more efficient than H.264, hardware acceleration arriving. The future for streaming.
- Rate control and encoder presets are as important as codec choice for real-world quality.
- Live streaming avoids B-frames to minimize latency; stored video uses them for compression.
Next: how databases compress data, where the same LZ and Huffman ideas run at a completely different scale — and where column-oriented storage changes everything.
Chapter 11: Database and Columnar Compression
Databases have a compression problem that's different from the problems we've seen so far. A video compressor can take minutes to encode a single frame. A database compressor runs on every write, every read, potentially millions of times per second. The constraint shifts from "best possible ratio" to "best ratio achievable in microseconds."
But databases have an advantage too: they know something about their data. A database knows that a column contains timestamps, or integers, or strings of bounded length. It knows the data type, the sort order, the cardinality. This domain knowledge enables compression techniques that generic algorithms can't match.
Two Storage Models
Before compression, we need to understand the fundamental distinction between row-oriented and column-oriented storage, because it completely determines which compression techniques work.
The same data, stored two ways:
Logical table:
user_id country age plan
─────────────────────────────
1001 US 34 pro
1002 UK 27 free
1003 US 41 pro
1004 DE 33 free
1005 US 29 pro
Row-oriented (traditional OLTP):
[1001,US,34,pro][1002,UK,27,free][1003,US,41,pro][1004,DE,33,free]...
Column-oriented (OLAP / analytics):
user_id: [1001][1002][1003][1004][1005]...
country: [US][UK][US][DE][US]...
age: [34][27][41][33][29]...
plan: [pro][free][pro][free][pro]...
Why Columns Compress Better
When data is stored in columns, each column contains only one type of data — and real-world data has low cardinality (few distinct values) and high correlation within columns:
- All
countryvalues are strings from a 195-element set - All
agevalues are integers in [0, 150] - All
planvalues are one of {free, pro, enterprise} - Adjacent rows in a time-sorted table have similar timestamps
A generic compressor on row data sees: 1001, US, 34, pro, 1002, UK, 27, free — a heterogeneous mix. A columnar compressor on the plan column sees: pro, free, pro, free, pro — highly compressible.
Encoding Schemes for Columnar Data
Dictionary Encoding
Replace string values with integer codes. Store the code → string mapping once.
Dictionary encoding of the 'country' column:
Dictionary: {0: "US", 1: "UK", 2: "DE", 3: "FR", ...}
Raw strings: "US", "UK", "US", "DE", "US", "US", "FR", "US"
Encoded: 0, 1, 0, 2, 0, 0, 3, 0
Savings:
Raw: 8 × 2 bytes = 16 bytes (average country code is 2 chars)
Dict: 8 × 1 byte + 4 × 2 bytes (dict) ≈ 16 bytes
→ No savings for small cardinality strings!
But for longer strings ("United States", "United Kingdom"):
Raw: 8 × 13 bytes average = 104 bytes
Dict: 8 × 1 byte + 195 × 13 bytes dict = 8 + 2535...
The real win: dict encoding enables faster comparisons and aggregations.
Filter "country = 'US'" becomes "code = 0" — integer comparison on
compact integer arrays. No string hashing or comparison needed.
Dictionary encoding is the foundation of most columnar formats. It enables:
- Compact representation when cardinality is low
- Fast filtering without decoding (compare integers to integer codes)
- Late materialization (operate on codes, decode only when outputting results)
Run-Length Encoding (Columnar Style)
After sorting, columns often have long runs of identical values. RLE stores (value, count) pairs:
RLE on sorted 'plan' column:
Sorted data: free, free, free, ...(10000×)..., pro, pro, ...(5000×)..., enterprise, ...
RLE: [(free, 10000), (pro, 5000), (enterprise, 500)]
Query "count where plan = 'pro'": → read single RLE entry → 5000
No decompression needed!
This is "predicate pushdown into the encoding":
The compression format IS the index.
Parquet, ORC, and most columnar databases use RLE on sorted data extensively.
Bit-Packing
If a column's values fit in fewer bits than the storage type, pack them tightly.
Bit-packing a column of values 0–15 (4 bits needed, stored in 8):
Naive storage (8 bits each):
0x03 0x07 0x0B 0x0F 0x01 0x09 0x02 0x0E
= 8 bytes for 8 values
Bit-packed (4 bits each):
0x37 0xBF 0x19 0x2E
= 4 bytes for 8 values (50% savings)
Vectorized unpacking:
Modern CPUs can unpack bit-packed integers using SIMD instructions
at rates exceeding 1 billion integers per second.
PFOR (Patched Frame of Reference) is a sophisticated variant: most values in a frame fit in B bits, with occasional outliers ("exceptions"). Store the base values in B bits, exceptions separately.
Delta Encoding for Sorted Numerics
Sorted ID or timestamp columns benefit enormously from delta encoding:
User ID column (sorted, sequential):
Raw: 10000001, 10000002, 10000003, ..., 10000000+N
Delta: 10000001, 1, 1, 1, ..., 1
After delta: bit-pack the 1s into 1 bit each.
Savings: 32-bit integers → effectively 1 bit each.
Timestamp column (milliseconds, sorted):
Raw: 1700000000000, 1700000000060, 1700000000120...
Delta: 1700000000000, 60, 60, 60...
After delta: values fit in 6 bits (60 < 64).
Savings: 64-bit timestamps → 6 bits each.
Parquet: Columnar Storage Done Right
Apache Parquet (2013) is the de facto standard for columnar data at rest in the Hadoop ecosystem. It's used by Spark, Hive, Athena, Redshift Spectrum, BigQuery (via external tables), and most modern data lake tools.
Parquet File Structure
Parquet file layout:
┌──────────────────────────────────────────────┐
│ Magic: PAR1 (4 bytes) │
├──────────────────────────────────────────────┤
│ Row Group 1 (e.g., 128MB of row data) │
│ ┌─────────────────────────────────────────┐│
│ │ Column Chunk: user_id ││
│ │ Page 1 (Dictionary page, if any) ││
│ │ Page 2 (Data pages) ││
│ │ Page 3 ... ││
│ ├─────────────────────────────────────────┤│
│ │ Column Chunk: country ││
│ │ Dictionary page + data pages ││
│ ├─────────────────────────────────────────┤│
│ │ Column Chunk: age ││
│ │ Data pages (RLE/bit-packed) ││
│ └─────────────────────────────────────────┘│
├──────────────────────────────────────────────┤
│ Row Group 2... │
├──────────────────────────────────────────────┤
│ File Footer (Thrift-encoded metadata) │
│ - Schema (column names, types) │
│ - Row group offsets and sizes │
│ - Column statistics (min, max, null count) │
│ - Encoding types used per column │
├──────────────────────────────────────────────┤
│ Footer length (4 bytes) │
│ Magic: PAR1 (4 bytes) │
└──────────────────────────────────────────────┘
Parquet Encodings
Parquet supports multiple encoding schemes per column chunk, chosen by the writer:
Parquet encoding types:
PLAIN: Raw values, no encoding. Fallback.
DICTIONARY: Dict page + integer codes. Default for strings.
RLE_DICTIONARY: RLE on top of dict codes. Sorted low-cardinality.
PLAIN_DICTIONARY: Deprecated (same as RLE_DICTIONARY).
RLE: Run-length encoding.
BIT_PACKED: Bit-packing. Used for repetition/definition levels.
DELTA_BINARY_PACKED: Delta + bit-packing for sorted integers.
DELTA_LENGTH_BYTE_ARRAY: Delta coding of string lengths.
DELTA_BYTE_ARRAY: Delta coding of byte arrays (prefix compression for sorted strings).
BYTE_STREAM_SPLIT: Split float bytes across separate streams for better GZIP compression.
The most important: DELTA_BINARY_PACKED for timestamp columns. A column of 1 billion sorted timestamps takes roughly 8GB naive; with delta + bit-packing it can compress to under 1GB before any additional gzip/Snappy compression.
Parquet + Compression Codec
After columnar encoding, Parquet applies a general-purpose compressor to each page:
Encoding pipeline for a string column:
"US", "UK", "US", "US", "DE", ...
│
▼ Dictionary encoding
Dict: {0:"US", 1:"UK", 2:"DE"}
Codes: 0, 1, 0, 0, 2, ...
│
▼ RLE_DICTIONARY
(run-length encode the codes)
(0, 3), (1, 1), (0, 47), (2, 2), ...
│
▼ Snappy / GZIP / Zstd / LZ4
Generic compression on the encoded bytes
│
▼ Parquet page bytes
Choosing the Parquet compression codec:
Parquet codec comparison:
Codec Compress speed Decompress Ratio Use case
───────────────────────────────────────────────────────────────
UNCOMPRESSED — Fastest 1.0× Development/debug
SNAPPY Very fast Very fast 1.5× General Hadoop
LZ4_RAW Very fast Fastest 1.5× Performance-critical
GZIP Slow Fast 2.0× Cold storage, S3
ZSTD Fast Very fast 2.0× Modern default choice
BROTLI Very slow Fast 2.2× Static web data
Zstd has become the recommended default for new Parquet files in 2023+: better compression than Snappy, much faster than GZIP, good decompression speed.
LZ4: Speed Above All Else
LZ4 was designed by Yann Collet (who also wrote Zstandard) with a single goal: be the fastest possible LZ-family compressor. Not the best ratio — the fastest.
LZ4 design philosophy:
"The minimum information needed to decompress is:
- A stream of literal bytes (copy these literally)
- A stream of (offset, length) match references
Encode these as simply as possible.
No Huffman. No arithmetic coding. Just token headers."
LZ4 token format:
┌──────────────────────────────────────────────────┐
│ 1 byte token: │
│ [literal_len: 4 bits][match_len: 4 bits] │
│ │
│ If literal_len == 15: read more bytes (255+) │
│ Then: literal_len literal bytes │
│ Then: 2-byte offset (little-endian) │
│ If match_len == 15: read more bytes (255+4) │
└──────────────────────────────────────────────────┘
No entropy coding = decompressor does almost nothing.
Modern CPUs can decompress LZ4 at 4+ GB/s.
LZ4 compresses roughly 2:1 on typical data — much less than Zstd or gzip. But its decompression speed is so high that using LZ4 to compress RAM can actually increase effective memory bandwidth:
LZ4 in-memory caching:
Without compression:
Read 100MB from disk → store 100MB in RAM → serve 100MB
Network → 100MB/s → Cache hit reads: 100MB/s
With LZ4 compression:
Read 100MB from disk → compress to 50MB → store 50MB in RAM
Network → 100MB/s → Cache hit reads: 50MB at 4GB/s decompression
Effective cache size: 2× (same RAM, more data)
Latency: slightly higher (decompress time ~12ms for 50MB)
This is why Redis, Memcached, and many in-memory databases offer LZ4 as an optional compression layer.
Snappy: Google's In-Memory Compressor
Snappy (originally Zippy, by Google, 2011) is similar to LZ4 — designed for speed over ratio. It's the default codec in Hadoop, Cassandra, and Leveldb/RocksDB.
Snappy vs. LZ4:
Metric Snappy LZ4
─────────────────────────────────────
Compression speed ~500 MB/s ~700 MB/s
Decompression ~1800 MB/s ~4000 MB/s
Compression ratio ~1.7× ~2.1×
Framing format Yes (spec) Flexible
LZ4 has mostly superseded Snappy in new systems — it's faster and compresses better. Snappy is still dominant in existing deployments due to ecosystem inertia.
Columnar Compression in Practice
Reading Only What You Need
The biggest benefit of columnar storage isn't compression ratio — it's I/O efficiency:
Query: SELECT avg(age) FROM users WHERE country = 'US'
Row-oriented storage:
Must read ALL bytes of ALL rows (user_id, country, age, plan, ...)
to extract country and age for each row.
For 10M rows, 4 columns, 8 bytes each: 320MB read.
Column-oriented storage:
Read ONLY the 'country' column and 'age' column.
For 10M rows, 2 columns, 4 bytes each: 80MB read.
4× less I/O before even considering compression.
With compression:
Country column (dictionary): ~5MB (dictionary codes are tiny)
Age column (bit-packed): ~20MB (values 0-150 fit in 8 bits)
Total: 25MB read — 13× less than row-oriented.
Predicate Pushdown and Bloom Filters
Parquet stores column statistics (min, max, null count) and optional Bloom filter data in the row group footer. Query engines use this to skip entire row groups without reading the column data:
Row group pruning example:
Query: SELECT * FROM orders WHERE order_date = '2024-01-15'
Row group 1: date range [2024-01-01, 2024-01-10] → SKIP
Row group 2: date range [2024-01-11, 2024-01-20] → READ
Row group 3: date range [2024-01-21, 2024-01-31] → SKIP
Row group 4: date range [2024-02-01, 2024-02-15] → SKIP
Result: read 25% of the data, skip 75%.
Combined with columnar projection, potentially read 3-5% of bytes.
Bloom filters provide set membership tests ("is this order_id in this row group?") without false negatives, at the cost of ~1% false positives. Combined with statistics, they enable very high skip rates for point queries.
RocksDB and LSM-Tree Compression
RocksDB (Facebook, 2012) is a key-value store used by Cassandra, MySQL (MyRocks), Kafka (for log compaction), and many others. It uses an LSM (Log-Structured Merge) tree, which has interesting compression characteristics.
RocksDB LSM structure:
Level 0: Small SSTables (sorted string tables)
Written by memtable flush, not sorted between files
Level 1: Larger SSTables, sorted, non-overlapping
Level 2: 10× larger, same rules
...
Compression per level (typical config):
L0, L1: Snappy (fast, writes happen often, frequent compaction)
L2+: Zstd or ZLIB (better ratio, these levels change less often)
Data size at each level:
L0: 256MB
L1: 256MB
L2: 2.5GB
L3: 25GB
L4+: most of the data
Better compression on lower levels = smaller footprint.
The tiered compression strategy makes sense: levels that are frequently compacted (rewritten) benefit from fast compression; levels that are rarely compacted benefit from better ratio.
Summary
- Columnar storage puts values of the same type together, making them dramatically more compressible.
- Dictionary encoding replaces string values with integer codes; enables fast filtering without decoding.
- RLE on sorted columnar data is the most powerful technique — can skip entire operations.
- Bit-packing and delta encoding exploit bounded ranges and sorted numeric columns.
- Parquet stacks columnar encodings + general-purpose compression (GZIP, Snappy, Zstd) per column chunk.
- LZ4 and Snappy prioritize decompression speed for in-memory and real-time use cases.
- Predicate pushdown and statistics-based row group pruning multiply the benefit of columnar formats beyond compression ratio alone.
Next: network compression — where we add the constraint that both ends of the wire must agree on the algorithm, and examine when not to compress at all.
Chapter 12: Network Compression
Network compression has a constraint that no other compression context has: both sender and receiver must agree on the algorithm, negotiate it at connection time, and do it all quickly enough not to increase latency. The compression must be faster than the network would take without it — otherwise you've made things slower.
This creates a different set of tradeoffs. The question isn't just "what compresses best?" but "what compresses fast enough that the savings in transmission time outweigh the cost of compression itself?"
The Break-Even Analysis
For compression to help, the time saved transmitting fewer bytes must exceed the time spent compressing:
Break-even calculation:
Original size: S bytes
Compressed size: C bytes (where C < S)
Network bandwidth: B bytes/second
Compression speed: V bytes/second (throughput of compressor)
Time to send without compression: S / B
Time with compression: S / V + C / B
Compression helps when:
S / V + C / B < S / B
S / V < S/B - C/B = (S-C)/B
S / V < (S-C) / B
B < V × (S-C)/S = V × compression_savings_ratio
Compression always helps if B << V (slow network, fast compressor).
Compression never helps if B >> V (fast network, slow compressor).
For a 1 Gbps LAN with gzip at 25 MB/s: compression speed (200 Mbps) << network speed (1 Gbps). Gzip would hurt. For a 10 Mbps WAN connection with gzip at 25 MB/s (200 Mbps): network is the bottleneck, gzip helps dramatically.
This is why HTTP compression is standard for internet-facing services but rare for internal datacenter traffic (unless the data is very large or the network is saturated).
HTTP Content Negotiation
HTTP compression is opt-in and negotiated per-request. The client advertises what it supports; the server picks what it sends.
HTTP compression negotiation:
Client request:
GET /api/data HTTP/1.1
Host: example.com
Accept-Encoding: gzip, deflate, br, zstd
Server response:
HTTP/1.1 200 OK
Content-Encoding: br
Content-Type: application/json
Content-Length: 4832
[Brotli-compressed body]
Client: decompress Brotli body, present to application.
Content-Encoding specifies the compression applied to the response body. Multiple encodings can be chained (rarely used):
Content-Encoding: gzip, identity
Accept-Encoding priorities: Clients can give quality factors: Accept-Encoding: gzip;q=1.0, br;q=0.9, identity;q=0.5. In practice, most servers ignore q-factors and just pick the best supported algorithm.
gzip in HTTP
gzip is the universal HTTP compression format — supported since HTTP/1.1 and by essentially every client and server in existence. Nginx and Apache enable it with:
# nginx gzip configuration
gzip on;
gzip_comp_level 6; # Level 1-9; 6 is the sweet spot
gzip_min_length 1000; # Don't compress tiny responses
gzip_types
text/plain
text/css
text/javascript
application/javascript
application/json
application/xml
image/svg+xml;
# Don't compress responses that are already compressed
# (JPEG, PNG, MP4 etc. are not in the list above)
Pitfall: Compressing already-compressed data wastes CPU and sometimes increases size. The types list should include only text-based formats.
Brotli in HTTP
Brotli (br) is supported by all major browsers (since 2016/2017) and produces ~15-25% better compression than gzip on typical web content.
# nginx Brotli configuration (requires ngx_brotli module)
brotli on;
brotli_comp_level 6; # 0-11; 6 is reasonable default
brotli_min_length 1000;
brotli_types text/plain text/css application/javascript application/json;
Brotli is slower to compress than gzip at equivalent levels, making it more suitable for:
- Pre-compressed static assets: Compress at build time, serve from cache.
file.js.brstored alongsidefile.js.gz. - Moderately dynamic content: Where the compression overhead is acceptable.
# Pre-compress static assets
for f in dist/*.js dist/*.css; do
brotli --best --output "${f}.br" "$f"
gzip --best --keep --output "${f}.gz" "$f"
done
The server then serves .br or .gz based on Accept-Encoding, without compressing on the fly.
Zstd in HTTP
Zstd HTTP compression is specified in RFC 8878 (2021) and increasingly supported:
- Chrome 118+ (2023)
- Firefox 116+ (2023)
- Safari (as of 2024)
Compression ratio comparison on a typical 100KB JSON API response:
Encoding Size Ratio Encode time
──────────────────────────────────────────────
None 100KB 1.0× —
gzip-6 28KB 3.6× 4ms
br-6 24KB 4.2× 12ms
zstd-6 26KB 3.8× 1ms
For dynamic API responses where compress-on-the-fly matters:
zstd wins: better than gzip, much faster than Brotli.
For pre-compressed static assets:
Brotli wins: compress once offline, better ratio.
Zstd also has a killer feature for HTTP: trained dictionaries. If you pre-train a dictionary on samples of your API responses, Zstd can compress small JSON responses to a fraction of their gzip size.
# Zstd shared compression dictionary for API responses
import zstandard as zstd
import json
# Training (offline, during deployment)
samples = [json.dumps(resp).encode() for resp in sample_api_responses]
dict_data = zstd.train_dictionary(32768, samples) # 32KB dictionary
dict_data.write_to_file('api_response.dict')
# Server-side compression (with dictionary)
cdict = zstd.ZstdCompressionDict(dict_data.as_bytes())
compressor = zstd.ZstdCompressor(dict_data=cdict, level=3)
def compress_api_response(data: dict) -> bytes:
return compressor.compress(json.dumps(data).encode())
# Client must have the same dictionary (served once, cached)
# This is nonstandard - Zstd dict IDs would need custom negotiation
# In practice, shared dicts are used for specific known clients (mobile apps)
HPACK: HTTP/2 Header Compression
HTTP headers are surprisingly expensive. A typical HTTP request has 500-800 bytes of headers (cookies, User-Agent, Accept, Authorization, etc.), and for short API responses, headers can be larger than the body.
HTTP/2 introduced HPACK (Header Compression for HTTP/2, RFC 7541) specifically for this problem.
HPACK's Approach
HPACK is not a generic compression algorithm — it's purpose-built for HTTP headers. It uses:
- Static table: 61 predefined header name-value pairs (
:method GET,:status 200,content-type application/json, etc.) - Dynamic table: A FIFO buffer of recently seen header name-value pairs, shared between encoder and decoder
- Huffman coding: A static Huffman table for header values (based on frequency of characters in HTTP headers)
HPACK encoding examples:
:method: GET (static table entry #2):
Representation: 0x82 (1 byte! Full header in 1 byte)
content-type: text/html (static table entry #31 for name only):
Literal header, indexed name, literal value
0x5f 0x1d [huffman-encoded "text/html"]
Authorization: Bearer eyJhbGc... (not in tables):
Literal header, incremental indexing (add to dynamic table)
[huffman-encoded "authorization"] [huffman-encoded "Bearer eyJ..."]
Future requests reuse this from the dynamic table.
The result: many common headers compress to 1-3 bytes. A typical HTTPS API request that would have 800 bytes of headers in HTTP/1.1 might have 100-200 bytes in HTTP/2 after the first request (once the dynamic table is populated).
HPACK Security: CRIME and BREACH
HPACK doesn't allow request headers and response bodies to share a context — this prevented the CRIME attack (2012), where a TLS+gzip combination allowed attackers to recover session cookies by observing compressed request sizes.
If the compression context is shared between attacker-controlled input and secret data, an attacker can determine the secret by crafting inputs that compress differently based on whether they share patterns with the secret. HPACK addresses this by separating header tables from body compression.
HTTP/3 and QPACK
HTTP/3 runs over QUIC (UDP-based) instead of TCP. HPACK had a limitation: it required headers to be processed in order (head-of-line blocking). QPACK (RFC 9204) fixes this for HTTP/3, allowing headers to be processed out of order while maintaining equivalent compression.
When NOT to Compress in Transit
Already-Compressed Content
Never compress data that's already compressed. The overhead of attempting to compress JPEG, MP4, ZIP, GZIP, or any already-compressed format is pure waste:
Attempting to gzip an already-gzipped response:
Original JSON: 50KB
gzip compressed: 15KB
"Re-gzip" the gzip stream: 15.1KB (slightly larger due to headers!)
CPU cost: 2× (compress + decompress the inner layer)
Configure your server to not compress Content-Type: image/*, video/*, audio/*, .zip, .gz, .br, .zst, .7z, and similar binary formats.
Small Responses
For responses under ~1KB, the overhead of compression headers and minimum compressor block sizes can exceed the savings:
Compression overhead for small responses:
gzip headers + trailer: ~18 bytes
For a 200-byte response:
Compressed: 150 bytes + 18 overhead = 168 bytes
Improvement: 200 → 168 (16% savings, cost: CPU)
For a 50-byte response:
Compressed: 45 bytes + 18 overhead = 63 bytes
LARGER than original! Don't compress.
gzip_min_length 1000 (nginx default) is a reasonable threshold.
High-Traffic Real-Time APIs
For APIs handling thousands of requests per second with sub-10ms response time requirements, on-the-fly compression at gzip-6 can add 2-5ms of latency. Options:
- Pre-compress: For cacheable responses, compress once and cache the compressed response.
- Lower compression level: gzip-1 is 3-5× faster than gzip-6 with 10-20% worse ratio.
- Zstd-1: Faster than gzip-1, better ratio than gzip-6.
- No compression: For internal microservice communication on high-speed networks.
TLS and Compression (CRIME)
After the CRIME and BREACH vulnerabilities, compressing HTTPS responses that include both attacker-controlled content and sensitive cookies/tokens requires care. BREACH showed that even response body compression (not just header compression) can leak secrets if:
- The response contains a secret (CSRF token, session ID)
- The response also contains attacker-controlled data (reflected parameter)
- The response is compressed
BREACH mitigations: disable compression on mixed-content responses, or use response padding/masking. For most modern applications that don't reflect user input in responses containing secrets, the risk is low — but understanding it matters for security-sensitive APIs.
Compression in Transit vs. At Rest
A common confusion: content-encoding (transit compression) and storage compression are orthogonal:
Content lifecycle and compression:
Origin server CDN / Cache Client
┌──────────┐ ┌──────────┐ ┌──────────┐
│ file.js │ ────Brotli────▶ │ file.js │ ─Brotli──▶ │ file.js │
│(at rest) │ (in transit) │ (stored │ (in transit)│(in memory│
│ zstd │ │ as .br │ │ uncompressed)
└──────────┘ └──────────┘ └──────────┘
At rest: zstd (storage format, high compression, rarely accessed)
In transit: Brotli (browser understands this encoding)
In memory: uncompressed (browser needs raw JS to parse and execute)
A CDN might:
- Store assets at rest in an efficient format (Zstd, high level)
- Decompress and re-compress as Brotli when serving to browsers that support it
- Decompress and re-compress as gzip when serving to older browsers
Or better: pre-generate .br and .gz versions at build time, store all three, serve the appropriate one based on Accept-Encoding.
Summary
- Compress when network bandwidth << compressor throughput. Don't compress when bandwidth is fast or compressor is slow.
- gzip is universal for HTTP; use level 6 as the default.
- Brotli is 15-25% better than gzip on web content; ideal for pre-compressed static assets.
- Zstd is the modern choice for dynamic API compression — faster than gzip, better than Brotli at equivalent speed.
- HPACK compresses HTTP/2 headers using static+dynamic tables and Huffman coding; common headers drop to 1-3 bytes.
- Never compress already-compressed data types (JPEG, MP4, ZIP, etc.).
- Small responses (<1KB) often expand under compression due to format overhead.
- Understand CRIME/BREACH before enabling compression on responses mixing user input with secrets.
Next: putting it all together with a decision framework for choosing the right algorithm for your context.
Chapter 13: Choosing the Right Algorithm
After twelve chapters of algorithms, the practical question remains: when you sit down with a compression problem, how do you pick? The answer depends on your constraints, and compression problems have many possible constraints: speed, ratio, memory, compatibility, streaming, random access, data type, and more.
This chapter offers a structured decision framework, a set of guiding principles, and a reference matrix you can return to.
The Four Dimensions
Every compression decision lives in a four-dimensional space:
Compression decision dimensions:
┌─────────────────────────────────────────────────────────┐
│ 1. SPEED │
│ Compression speed: how fast can you compress? │
│ Decompression speed: how fast must you decompress? │
│ Are they the same process? (usually not) │
│ │
│ 2. RATIO │
│ How important is file size? │
│ Is there a hard size budget? │
│ What's the cost of additional bytes? (bandwidth $) │
│ │
│ 3. COMPATIBILITY │
│ Who reads the compressed data? │
│ Can you control both compressor and decompressor? │
│ Is this a public format or internal? │
│ │
│ 4. DATA CHARACTERISTICS │
│ Text? Binary? Images? Video? Sensor data? │
│ Already compressed? │
│ Regular structure? Random? │
└─────────────────────────────────────────────────────────┘
The Decision Tree
START HERE: What kind of data?
│
├── MEDIA (images, audio, video)
│ │
│ ├── Images
│ │ ├── Must be lossless → PNG (photos: use WebP lossless)
│ │ ├── Photographs, quality matters → AVIF (WebP lossy fallback)
│ │ ├── Photographs, compatibility → JPEG quality 80-90
│ │ ├── Icons, UI, transparency → PNG or WebP lossless
│ │ └── Already compressed → don't re-compress
│ │
│ ├── Audio
│ │ ├── Lossless archival → FLAC
│ │ ├── Distribution/streaming → Opus (or AAC-LC)
│ │ ├── Voice/communications → Opus
│ │ └── Legacy compatibility → MP3 VBR V2
│ │
│ └── Video
│ ├── New production/streaming → AV1 (H.265 fallback)
│ ├── Broad compatibility required → H.264
│ ├── Archival/lossless → FFV1 or uncompressed
│ └── Screen recording → VP9 or AV1 (high-ratio intra)
│
├── GENERAL PURPOSE (files, archives, backups)
│ │
│ ├── Need maximum compression, time doesn't matter
│ │ └── 7-Zip / xz (LZMA2) with solid archive
│ │
│ ├── Good compression, reasonable speed
│ │ └── Zstd level 6-9, or Brotli level 6 for web content
│ │
│ ├── Fast compression (logs, telemetry, backup streams)
│ │ └── Zstd level 1-3
│ │
│ ├── Legacy tools / universal compatibility
│ │ └── gzip level 6
│ │
│ └── Already compressed content (JPEG, MP4, ZIP, etc.)
│ └── Store, don't compress
│
├── DATABASE / IN-MEMORY
│ │
│ ├── Fastest possible decompression (cache, OLTP)
│ │ └── LZ4 or Snappy
│ │
│ ├── Good compression + fast decompress (OLAP, analytics)
│ │ └── Zstd level 3 + columnar encodings (Parquet)
│ │
│ ├── Maximum compression, read-heavy (cold storage)
│ │ └── GZIP or Zstd level 9+ in Parquet
│ │
│ └── Time-series / sorted numerics
│ └── Delta + bit-packing + Zstd (columnar format)
│
└── NETWORK / HTTP
│
├── Static web assets (compress once, serve many)
│ └── Brotli level 9-11 (+ gzip fallback)
│
├── Dynamic API responses
│ └── Zstd level 3 (or gzip level 6 for compatibility)
│
├── Internal microservice traffic
│ └── Zstd level 1 (if network is bottleneck), else none
│
└── Already-compressed responses (images, video)
└── No Content-Encoding (pass through as-is)
Guiding Principles
1. Measure on Your Actual Data
The most important advice in this chapter. Every benchmark in this book uses synthetic or corpus data. Your data is different. Run actual measurements.
# Quick benchmark: compare Zstd levels on your data
for level in 1 3 6 9 12 19; do
echo -n "Level $level: "
time zstd -$level -q -f your_data.bin -o /tmp/test.zst
ls -lh /tmp/test.zst | awk '{print $5}'
done
# Compare decompression speeds
for f in your_data.gz your_data.br your_data.zst; do
echo -n "$f: "
time zcat $f > /dev/null 2>&1 # adjust command per format
done
# The lzbench tool benchmarks many algorithms at once:
lzbench -t16,16 your_data.bin
lzbench (https://github.com/inikep/lzbench) is the standard tool for this — it benchmarks 70+ compression algorithms and levels on any file you provide.
2. Compress Once, Decompress Many
The compression/decompression speed asymmetry matters enormously for economics. If a file will be downloaded 100,000 times:
Economic calculation for a 100MB file downloaded 100K times:
Option A: gzip-9 (slow compress, good ratio)
Compress time: 200 seconds (one-time cost)
Compressed size: 30MB → save 70MB per download
Total bandwidth saved: 70MB × 100K = 7TB
At $0.09/GB: $630 saved
Option B: gzip-6 (faster compress, slightly worse ratio)
Compress time: 40 seconds (one-time cost)
Compressed size: 32MB → save 68MB per download
Total bandwidth saved: 68MB × 100K = 6.8TB
At $0.09/GB: $612 saved
The 2MB per download difference is worth it for popular content.
For rarely-accessed files: faster compression is better economics.
3. Know When Compatibility Is Non-Negotiable
If you control both the writer and the reader, you can use any format you want. If the reader is a third party (a browser, a legacy system, a standards-mandated format), you're constrained.
Format compatibility grid (a reminder):
Format gzip br zstd lzma zlib
──────────────────────────────────────────────
curl ✓ ✓ ✓ — —
Chrome ✓ ✓ ✓ — —
Safari ✓ ✓ ✓ — —
Python ✓ ✓ ✓ ✓ ✓
Java (JDK) ✓ — — — ✓
Node.js ✓ ✓ ✓ — ✓
Most Unix ✓ — ✓ ✓ —
Windows ✓ — — — —
For files that must be openable by non-technical users without installing software: ZIP or GZIP. For everything else: Zstd.
4. Pre-Compress When You Can
Pre-compressing at high levels (and storing the result) always beats on-the-fly compression:
- Better ratio (more CPU time available)
- No latency penalty when serving
- Consistent performance under load
Static site generators, CDN pipelines, and build systems should always pre-compress at the highest practical level.
5. Don't Compress Already-Compressed Data
This deserves repeating because it's a common mistake. Adding a gzip step after a JPEG, MP4, or ZIP step is pure waste. The result will be approximately the same size, and you've wasted CPU.
Signs you're doing this wrong:
- Your archive contains
.jpg,.mp4, or.pdffiles and they compress to ~100% of their original size - Your gzip ratio on "all files" is 1.01:1
6. Test Extreme Cases
Whatever algorithm you pick, test what happens when it gets input that's worst-case for that algorithm:
- Zstd on already-compressed data (should be ~1.0 ratio, not expand)
- RLE on random data (should not expand catastrophically)
- LZMA on a 10GB file (memory usage OK?)
- LZ4 on a 1-byte file (what's the overhead?)
Algorithm Reference Matrix
Algorithm Best for Ratio Speed(c) Speed(d) Memory
────────────────────────────────────────────────────────────────────────────
PLAIN/None Random, already cmp 1.0× ∞ ∞ 0
RLE Highly repetitive 10-100× Very fast Very fast Tiny
Delta Time-series, sorted 2-20× Very fast Very fast Tiny
BWT+MTF Text (bzip2) 3-4× Slow Slow 8MB
LZ4 In-memory, database 2-3× 700 MB/s 4000 MB/s KB
Snappy Database (Hadoop) 1.7-2× 500 MB/s 1800 MB/s KB
Zstd-1 Fast general purpose 3-4× 500 MB/s 1200 MB/s 1MB
Zstd-3 General purpose 3-5× 350 MB/s 1200 MB/s 4MB
Zstd-9 High ratio + speed 4-6× 50 MB/s 1200 MB/s 16MB
Zstd-19 Near-max lossless 4-7× 8 MB/s 1200 MB/s 64MB
gzip-6 Universal compat 2.5-4× 25 MB/s 200 MB/s 256KB
Brotli-6 Web content 3-5× 15 MB/s 200 MB/s 8MB
Brotli-11 Pre-compressed web 3.5-5.5× 1 MB/s 200 MB/s 8MB
xz-6 (LZMA) Archival, distrib. 4-8× 4 MB/s 50 MB/s 97MB
xz-9 (LZMA) Maximum lossless 5-9× 1 MB/s 50 MB/s 683MB
PNG Lossless images 1.5-3× Medium Fast Medium
JPEG-90 Photographs 5-10× Fast Fast Tiny
WebP-lossy Web photographs 8-15× Medium Fast Tiny
AVIF Modern web images 15-30× Very slow Fast Medium
FLAC Lossless audio ~2× Medium Fast Small
MP3-192 Lossy audio 7× Medium Fast Small
Opus-128 Lossy audio (modern) 10× Fast Fast Small
H.264 Video (universal) 100-200× Slow Fast Medium
H.265 Video (high eff.) 150-300× Very slow Fast Medium
AV1 Video (open, modern) 200-400× Very slow Fast Large
Speed values are order-of-magnitude estimates on commodity hardware; actual speeds vary significantly by data type, CPU, and implementation.
The Compression Ratio vs. Speed Tradeoff Curve
One of the most useful mental models: every algorithm lives somewhere on a ratio-vs-speed tradeoff curve. Faster algorithms sacrifice ratio; better-ratio algorithms sacrifice speed.
Compression ratio vs. speed landscape:
Ratio
9× │ ·xz-9
│ ·xz-6
7× │ ·Zstd-19
│ ·Brotli-11
5× │ ·Brotli-6 ·Zstd-9
│ ·gzip-6
3× │ ·Zstd-3 (Modern sweet spot:
│ ·Zstd-1 ·Zstd-3)
2× │·Snappy
│·LZ4
1× │·None
└────────────────────────────────────────── Speed
0.1 MB/s 10 MB/s 100 MB/s 1000 MB/s
(xz-9) (Brotli) (gzip) (LZ4)
The "efficient frontier": no algorithm is simultaneously better
at ratio AND speed than the frontier algorithms.
Choosing an algorithm is choosing where on this curve you live.
Zstd's genius is that it offers points along this curve with a single tool: -1 for LZ4-comparable speed, -19 for xz-comparable ratio, and everything in between with the same binary and API.
A Worked Example: Designing a Log Storage System
Problem: Store application logs. 50GB/day written, kept for 90 days (4.5TB). Queries: recent logs accessed frequently, old logs accessed rarely. Must compress to keep storage costs under $50/month.
Step 1: Characterize the data
Logs are structured text (JSON or key=value), with high repetition of field names and common values. Very compressible.
Step 2: Define the constraints
- Write speed: must handle ~600KB/s sustained write rate
- Read speed: recent logs must be queryable with <100ms latency
- Cost: must fit in ~5TB storage (4.5TB uncompressed → compress to < 1.5TB)
- Compatibility: in-house tools, no compatibility constraint
Step 3: Tiered approach
Log storage tiered compression:
Hot tier (< 24 hours): Zstd level 1
- 600KB/s write → easily within Zstd-1 speed (500+ MB/s)
- Ratio: ~4× → 12.5GB/day stored
- Fast decompression for recent queries
Warm tier (1-7 days): Zstd level 6
- Recompressed during off-peak (batch job)
- Better ratio: ~5× → 10GB/day stored
- Still fast decompress for query latency
Cold tier (> 7 days): Zstd level 19
- Recompressed during off-peak weekly
- Best ratio: ~6× → 8.3GB/day stored
- Slower decompress acceptable for historical queries
Total storage:
Day 1: 12.5GB (hot)
Days 2-7: 10GB/day × 6 = 60GB (warm)
Days 8-90: 8.3GB/day × 83 = 689GB (cold)
Total: ~762GB for 90 days (vs 4.5TB uncompressed = 83% savings)
Cost at $0.02/GB/month: ~$15/month. ✓
Step 4: Validate on real data
Run lzbench on a sample of actual log files. Adjust tier boundaries based on actual ratios and speeds.
Summary
- The four dimensions of every compression decision: speed, ratio, compatibility, and data characteristics.
- Measure on your actual data. Benchmarks on synthetic data are guides, not guarantees.
- Compress once, decompress many: justify expensive compression for frequently-accessed data.
- Pre-compress when possible: build-time compression beats run-time compression in almost every case.
- Never compress already-compressed formats: waste of CPU, rarely any size savings.
- Zstd is the modern general-purpose default: one tool, configurable from LZ4-speed to xz-ratio.
- Brotli for static web assets, gzip for universal compatibility, LZ4/Snappy for in-memory.
- LZMA/7-Zip when maximum compression ratio is the only goal and time doesn't matter.
- Tiered compression (different algorithms for hot/warm/cold data) is the practical answer for large-scale systems.
The next (and final) chapter is something different: a look at the argument that large language models are, at their core, a very sophisticated form of lossy compression.
Chapter 14: LLMs as Lossy Compressors
This chapter is different. It's speculative, philosophical, and only partially serious. Take it as a frame for thinking about LLMs, not as a rigorous technical argument. The frame is useful even if imperfect.
Having spent thirteen chapters understanding how compressors work — what they keep, what they throw away, how they model data, and what the theoretical limits are — we're in a surprisingly good position to think about large language models. Because the argument that LLMs are compressors is not just a metaphor. It holds up under scrutiny, and it explains several things about LLM behavior that are otherwise puzzling.
The Compressor Framing
Here's the core claim: a language model's training process is the act of lossy compression of a large corpus of text into a set of floating-point weights.
Let's break this down piece by piece.
Training = Compression
A large language model is trained on something like the entire public internet — trillions of tokens of text. The model itself contains some number of parameters: GPT-3 had 175 billion parameters; Llama 3 405B has 405 billion; the trajectory is toward hundreds of billions, maybe trillions.
The raw training data is vastly larger than the model's weights. The pre-training corpus for a frontier model is on the order of 10–100 trillion tokens, each a 16-bit value (plus vocabulary etc.). The weights are a fixed-size array of floats.
LLM as a lossy compressor:
Training corpus: ~10 trillion tokens × 2 bytes = ~20 TB
GPT-4 weights (estimated ~1.7T params at fp8): ~1.7TB
Llama 3 70B (fp16): ~140GB
Llama 3 405B (fp8): ~400GB
Gemma 2 9B (fp16): ~18GB
GPT-4 estimated compression ratio: ~12:1 minimum
But the ratio understates the compression. The model doesn't just store the text — it captures something more like the structure of the text. It learns that code tends to have matching brackets, that English sentences tend to have subject-verb agreement, that numbers that appear in certain financial contexts tend to follow certain distributions.
This is better than the training data itself for a compressor's purposes — a good model of the structure enables encoding at near-entropy.
Actually, this exact idea was tested. In 2019, Jack Clark and Dario Amodei observed that language models could be used directly as lossless compressors: use the model's probability estimate for each token to arithmetic-code the next token. A model that assigns high probability to the actual next token → uses few bits to encode it. A perfect model → achieves Shannon entropy.
The result: large language models, used as arithmetic coders, beat every other general-purpose compressor on text.
LLM-as-compressor results (approximate, on enwik8 — 100MB of Wikipedia):
Compressor Bits per byte
─────────────────────────────────
gzip-6 3.20
bzip2 2.43
xz (LZMA) 2.19
Brotli-11 2.11
PPM 2.04
LSTM language model 1.77 ← early result (2019)
GPT-3 (large) ~1.30 ← estimated
GPT-4 (estimated) ~1.10 ← approaching Shannon limit for English
Shannon limit English ~1.0 ← theoretical minimum
An LLM beats every non-LLM compressor on English text,
by a wide margin, because it models the language so much better.
Inference = Decompression
If training is compression, then inference is decompression — roughly. The model "stored" the patterns from the training data in its weights. When you prompt it, you're asking it to "decompress" the relevant portion: generate text that conforms to those stored patterns.
Classical lossy compression:
Input data Lossy encoder Compressed bits
(photograph) ─────────────────▶ (JPEG bytes)
│
│ Decoder
▼
Approximate reconstruction
(slight artifacts, close to original)
LLM "compression":
Training corpus Gradient descent Model weights
(internet text) ─────────────────▶ (fp16/fp8 params)
│
│ Inference
▼
Approximate "reconstruction"
(statistically similar to training)
This framing explains something curious: why language models produce text that sounds like their training data. They don't retrieve it — they generate text that's statistically consistent with what was stored. The compression preserved the structure, the patterns, the correlations — not the verbatim content.
What Gets Lost
Every lossy compressor decides what to discard. JPEG discards high-frequency spatial information. MP3 discards sounds masked by louder sounds. A video codec discards temporal detail where motion compensation fails.
What does an LLM discard?
Verbatim Text
The model almost certainly cannot reproduce most of the training data verbatim. It stores patterns, not photographic reproductions. Ask an LLM for the 73rd word of a specific Wikipedia article from 2022, and it will either make something up or fail gracefully.
Exceptions: text that appears many times, or that was highly prominent in training, may be "memorized" — stored in the weights in a way that allows near-verbatim reproduction. This is how LLMs can reproduce famous poems, common code snippets, and heavily cited passages.
Precise Numbers
Numbers present a specific challenge. "The speed of light is approximately 300,000 km/s" is a pattern that appears everywhere and is reliably stored. "The GDP of Paraguay in Q3 2021 was $9.73 billion" is a specific fact that may or may not have appeared often enough to be reliably stored.
LLMs frequently hallucinate specific numbers. This is the compressor discarding precision: the "shape" of the fact (GDP of Paraguay is in the range $8–12 billion) is stored; the exact value is not.
Types of information and LLM retention:
Highly retained:
Common patterns of language and grammar
Widely repeated facts (speed of light, capital cities)
Code patterns and API signatures (if in training data)
General reasoning structures
Poorly retained (hallucination risk):
Precise numerical values
Specific dates and version numbers
Exact quotations and attributions
Events after training cutoff
Obscure facts that appeared rarely in training
Provenance and Attribution
JPEG doesn't know which photographer took the photo it's compressing. It just stores the pixels. Similarly, an LLM during training doesn't (in base form) learn who said what — it learns what kinds of things tend to be said. The author, the source, the publication date, the context — these are not inherently preserved.
This is why LLMs confidently produce text that sounds authoritative but isn't: they're generating according to the statistical pattern "authoritative-sounding text about topic X," without any ground truth to constrain them.
The Lossy Compressor and Confabulation
The JPEG analogy is illuminating here. When you compress a JPEG heavily and then ask "what color is this pixel?", the JPEG decoder gives you an answer — but it might be wrong, because heavy quantization changed the color. It doesn't say "I'm not sure"; it says "blue."
LLMs behave similarly. When prompted about something that was poorly preserved in their weights, they don't (by default) say "I'm not sure about this." They generate text that conforms to the pattern "confident answer to a question about X" — because that's what was in the training data. People asking questions and confident people answering them.
This is confabulation — not deliberate lying, but generating text that's statistically plausible without a mechanism for checking whether it's factually correct.
The Model as a "Compressed Civilization"
There's a more optimistic way to view the same facts. The LLM hasn't just compressed text — it's compressed something like human knowledge and reasoning patterns. When you ask an LLM how to debug a TypeScript error, you're accessing the compressed form of thousands of Stack Overflow threads, documentation pages, blog posts, and GitHub issues about TypeScript debugging. The model synthesizes these into a response that no single document contained.
This is the sense in which LLMs are more useful than retrieval systems for many tasks: they've not just indexed the data, they've compressed it into a form where patterns combine. A user who asks "write a Python function that does X" gets something assembled from many training examples, not retrieved from any one of them.
Retrieval vs. Generation:
Retrieval (search engine / RAG):
Query → find relevant documents → return verbatim excerpts
Precise for specific known facts.
Can't synthesize across sources.
Generation (LLM inference / "decompression"):
Query → generate statistically appropriate response
Synthesizes across everything in training.
Hallucinations possible for specific facts.
Better for synthesis, explanation, reasoning.
The tradeoff maps neatly to lossless vs. lossy compression:
Retrieval = lossless (return the original)
Generation = lossy (return a reconstruction, possibly imperfect)
Why "Really Expensive"
The chapter title calls LLMs "really expensive lossy compressors." Let's be specific about the costs.
Training costs: Frontier LLMs cost hundreds of millions to billions of dollars to train, requiring thousands of specialized accelerators (H100 GPUs, Google TPUs) running for months.
Storage costs: Model weights are large — 140GB for Llama 3 70B, estimated ~1TB+ for GPT-4 — requiring specialized storage and memory hierarchies.
Inference costs: Each forward pass requires enormous compute. Generating 1000 tokens from a 70B model requires approximately 70 billion multiply-accumulate operations per token — 70 trillion operations per request. At scale, this is the dominant cost for AI companies.
Memory bandwidth costs: The weights must be read from memory for every inference. For large models, memory bandwidth (how fast you can move data from GPU memory to compute units) is often the bottleneck. This is the "memory wall" problem.
Cost comparison to classical compression:
Decompress a 100MB gzip file:
CPU: 1 second on a single core
Memory: 32KB
Power: ~10 watt-seconds
"Decompress" (inference) from a 70B LLM, 500 tokens output:
Hardware: 4× A100 GPUs (required for model to fit in memory)
Time: ~30 seconds
Memory: 140GB GPU memory
Power: ~150,000 watt-seconds (150 kWh equivalent... ok that's extreme,
more like 150 watt-seconds per query on efficient hardware)
The cost gap is enormous. But the "compression ratio" is too:
An LLM can synthesize knowledge from 20TB of training data
in response to a 100-token query.
No classical compressor does this.
Compression and Intelligence
The compression framing suggests something interesting about what intelligence might be. Marcus Hutter's "AIXI" framework and related theoretical work has long noted a connection between intelligence and compression: an agent that can compress data well implicitly has a model of the world that generated the data.
Hutter went further: he defined the Hutter Prize (the "Hutter Prize for Lossless Compression of Human Knowledge"): a prize for compressing 100MB of Wikipedia as small as possible. The prize is based on the argument that better compression implies better understanding, and the winner of the prize would effectively have a more concise model of human knowledge than anyone else.
LLMs have effectively won this argument by a wide margin — they're the best known compressors of human-generated text, by virtue of having (arguably) the best model of how humans generate text.
The compression-intelligence connection:
Better world model → Better prediction → Better compression
An agent that understands physics will predict the trajectory
of a ball better than one that doesn't, and therefore can
compress a video of a thrown ball more effectively.
An agent that understands English grammar will predict the
next word in a sentence better than one that doesn't.
The model quality IS the compression quality.
Evaluating a compressor IS (partially) evaluating the model.
What This Frame Gets Wrong
Every analogy breaks down somewhere. The LLM-as-compressor frame has real limits:
Training objectives differ: A lossy image compressor explicitly optimizes for minimum distortion at a given bit rate. An LLM optimizes for next-token prediction. These are related but not identical. An LLM trained to minimize perplexity is not the same as an LLM trained to be a good assistant.
Inference is not simply decoding: JPEG decoding is a deterministic algorithm. LLM inference is sampling from a distribution — the same weights can produce different outputs. This is more like a probabilistic generative model than a traditional decompressor.
LLMs can reason in ways compressors can't: A JPEG decoder cannot modify the compression algorithm based on what it decompresses. An LLM, through chain-of-thought reasoning and tool use, can in some sense "look up" information it's uncertain about. The analogy breaks when you extend LLMs with retrieval, tool use, and agentic capabilities.
Fine-tuning adds information: RLHF (Reinforcement Learning from Human Feedback) and fine-tuning modify the weights after pretraining. This is like post-processing a compressed file — not decompression, but further transformation. The compression analogy is cleanest for base models without fine-tuning.
Practical Implications
Whether or not you find the compression frame intellectually satisfying, it generates useful heuristics:
Use LLMs for synthesis, not lookup: Where the compression shines is combining patterns from many sources. Where it fails is verbatim recall of specific facts.
Retrieval Augmented Generation (RAG): RAG combines an LLM's synthesis capabilities with verbatim retrieval from a document store. This is like having both a compressor (for synthesis) and a lossless store (for facts). The combination is more powerful than either alone.
Hallucinations are a compression artifact, not a bug: They're the model generating statistically-plausible content where specific information was lost in compression. Treating them as bugs to be "fixed" misses the point — the solution is architecture (add retrieval), not just training.
Context is decompression guidance: When you provide context in a prompt ("here is the actual text of the relevant document..."), you're giving the "decompressor" ground truth to work from, reducing the chance it generates a plausible-but-wrong reconstruction.
Model size and data affect what's retained: Larger models (more parameters) have more capacity to retain detail. More training data on a topic increases the "effective compression ratio" for that topic — more training signal means the pattern is more reliably stored.
The Deeper Question
The compression frame raises a question that's genuinely unresolved: is a model that has compressed the world's text into weights understanding the world, or just very efficiently predicting patterns?
For classical compressors, we'd never ask this. A gzip stream doesn't understand the text it compressed. But language models do things that look like reasoning, analogy-making, and generalization — behaviors we don't see from gzip.
One answer: the difference is quantitative, not qualitative. gzip's model of the world (LZ77 with a 32KB window) is just too simple to do anything that looks like reasoning. A transformer with 400 billion parameters and 20 trillion training tokens has a rich enough model that the representations it builds internally support what looks like reasoning.
Another answer: "understanding" requires something that statistical compression doesn't provide — grounding in the world, embodied experience, intentionality. In this view, an LLM is forever just a very sophisticated pattern-matcher, not an understander.
This debate isn't resolved. The compression frame doesn't resolve it either. But it does clarify the terms: whatever LLMs are doing, they're doing it by building an exceptionally good model of the statistical structure of human-generated text. Whether that model captures "meaning" is a philosophical question that compression theory can't answer.
Summary
- Language model training is a form of lossy compression: the training corpus (trillions of tokens) is compressed into model weights (billions of parameters).
- LLMs, used as arithmetic coders, beat every classical lossless compressor on English text — they have a better probability model.
- Inference is approximately decompression: generating text that conforms to the stored statistical patterns.
- What gets lost: verbatim text, precise numbers, provenance, rare facts. Hallucinations are the model generating plausible-but-imprecise reconstructions of poorly-compressed information.
- LLMs are expensive because the "compressor" and "decompressor" are enormous neural networks, not simple algorithms.
- Better compression implies a better model of the data; this connects compression quality to something like intelligence.
- The frame breaks down for fine-tuned models, reasoning capabilities, and tool use — but it's a useful lens for understanding base model behavior and limitations.
The link between compression and understanding runs throughout the history of information theory. Shannon's work was not just about sending bits efficiently — it was about the nature of information itself. The LLM is, in some sense, the culmination of that thread: a machine that has learned to model the information in human communication well enough to generate new instances of it.
Whether that constitutes understanding, or merely very sophisticated pattern-matching, is left as an exercise for the reader.
Further Reading
Foundational Papers
Shannon, C.E. (1948) A Mathematical Theory of Communication Bell System Technical Journal, 27(3), 379–423. The paper that started everything. Defines entropy, channel capacity, and the source coding theorem. Still readable and worth it.
Huffman, D.A. (1952) A Method for the Construction of Minimum-Redundancy Codes Proceedings of the IRE, 40(9), 1098–1101. The original paper. Three pages. Elegant.
Ziv, J. & Lempel, A. (1977) A Universal Algorithm for Sequential Data Compression IEEE Transactions on Information Theory, 23(3), 337–343. LZ77. The sliding window.
Ziv, J. & Lempel, A. (1978) Compression of Individual Sequences via Variable-Rate Coding IEEE Transactions on Information Theory, 24(5), 530–536. LZ78. The explicit dictionary.
Burrows, M. & Wheeler, D.J. (1994) A Block-sorting Lossless Data Compression Algorithm Digital SRC Research Report 124. The BWT. Makes bzip2 work. Clever sorting trick.
Books
Sayood, K. (2017) Introduction to Data Compression, 5th edition. Morgan Kaufmann. The textbook. Comprehensive, mathematically rigorous. Good for going deep on entropy coding.
Salomon, D. & Motta, G. (2010) Handbook of Data Compression, 5th edition. Springer. Reference encyclopedia. Almost everything is in here somewhere.
Blelloch, G. (2001) Introduction to Data Compression Carnegie Mellon University lecture notes. Free, concise, excellent. https://www.cs.cmu.edu/~guyb/realworld/compression.pdf
Nelson, M. & Gailly, J.L. (1995) The Data Compression Book, 2nd edition. M&T Books. Older but has detailed DEFLATE and LZW implementation walkthroughs.
Modern Algorithm Papers
Deutsch, P. (1996) DEFLATE Compressed Data Format Specification (RFC 1951). The standard. Short, precise, implementable from this document alone.
Collet, Y. & Kucherawy, M. (2021) Zstandard Compression and the 'application/zstd' Media Type (RFC 8878). Zstandard specification with good explanation of design decisions.
Alakuijala, J. et al. (2016) Brotli: A General-Purpose Data Compressor Google open-source documentation and original paper.
Duda, J. (2009, 2013) Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding arXiv:1311.2540. The ANS paper. Dense but important.
Image and Media
Wallace, G.K. (1991) The JPEG Still Picture Compression Standard IEEE Transactions on Consumer Electronics. How JPEG actually works, from the people who designed it.
Richardson, I. (2003) H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia Wiley. The most readable treatment of H.264 internals.
Alliance for Open Media AV1 Bitstream & Decoding Process Specification https://aomediacodec.github.io/av1-spec/ The full AV1 spec. Long.
Tools and Benchmarks
lzbench — Benchmark 70+ compressors on your own data https://github.com/inikep/lzbench
Squash compression benchmark — Web-based benchmark with many algorithms https://quixdb.github.io/squash-benchmark/
Silesia corpus — Standard benchmark dataset (various file types) https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
Canterbury corpus — Classic benchmark for text compression https://corpus.canterbury.ac.nz/
Hutter Prize — Compressing 100MB of Wikipedia http://prize.hutter1.net/
LLMs and Compression
Hutter, M. (2006) Universal Algorithmic Intelligence: A Mathematical Top-Down Approach Springer. The theoretical framework connecting compression and intelligence.
Deletang, G. et al. (2023) Language Modeling Is Compression arXiv:2309.10668. The DeepMind paper that made this argument rigorous: LLMs as compressors and the connection to intelligence.
Huang, C. et al. (2022) In-context Retrieval-Augmented Language Models arXiv:2302.00083. One of the early RAG papers, addressing the "what gets lost" problem.
Implementation References
zlib source code https://github.com/madler/zlib Mark Adler's reference implementation. Readable, annotated, definitive.
zstd source code https://github.com/facebook/zstd Facebook's implementation. Fast, well-commented, with extensive documentation.
The reference bzip2 implementation https://sourceware.org/bzip2/ Small, clear C code. Good way to understand BWT in practice.
FFmpeg — The reference for audio and video codecs https://ffmpeg.org/ Contains implementations of nearly every audio/video codec. Source is large but well-organized.
libvpx / libaom — VP9 and AV1 reference encoders/decoders https://chromium.googlesource.com/webm/libvpx https://aomedia.googlesource.com/aom/
Going Further
Matt Mahoney's data compression FAQ http://mattmahoney.net/dc/dce.html Matt Mahoney runs the Hutter Prize and has an encyclopedic knowledge of data compression. His FAQ is one of the most detailed resources available.
Fabian Giesen's blog https://fgiesen.wordpress.com/ Deep technical posts on entropy coding, ANS, and low-level compression implementation. His ANS posts are the best English-language explanation of the algorithm.
Charles Bloom's blog https://cbloomrants.blogspot.com/ Compression engineering from someone who has built production codecs (Oodle, used in many game engines). Practical, opinionated, excellent.