Chapter 2: Huffman Coding
In 1952, MIT student David Huffman was taking an information theory course and was given an assignment: prove which of two methods for constructing optimal prefix codes was better. He couldn't prove either was optimal — so he invented a third method that was demonstrably optimal, and published it. His professor, Robert Fano, had been working on the problem for years. Huffman's algorithm was simpler and better.
The core idea: give shorter codes to more frequent symbols, longer codes to rarer ones. This is the same intuition behind Morse code (E is ·, Z is −−··). Huffman's contribution was an algorithm that produces provably optimal variable-length prefix codes.
Prefix Codes and the No-Prefix Property
Before the algorithm, let's understand what we're building. A prefix code (also called a prefix-free code) is a set of codewords where no codeword is a prefix of another.
Why does this matter? Because without it, decoding is ambiguous:
Bad code (not prefix-free):
Symbol Code
A 0
B 01
C 10
D 11
Receiving "010":
- Could be A, B (0, 10... wait, is it A then B starting with 0? or 01 then?)
- Ambiguous!
A prefix-free code can be decoded greedily: read bits until you match a codeword, output the symbol, repeat. No backtracking needed.
Good code (prefix-free):
Symbol Code
A 00
B 01
C 10
D 11
Receiving "010011":
- 01 → B
- 00 → A
- 11 → D
- Clear!
Building the Huffman Tree
Huffman codes are built from a binary tree. Leaves are symbols; the path from root to leaf defines the codeword (left branch = 0, right branch = 1).
The algorithm:
- Count the frequency of each symbol.
- Create a leaf node for each symbol, with its frequency as its weight.
- Insert all nodes into a min-priority queue.
- While more than one node remains: a. Extract the two nodes with lowest weight. b. Create a new internal node with these two as children, weight = sum of children's weights. c. Insert the new node back into the queue.
- The remaining node is the root.
Step-by-Step Example
Input: "ABRACADABRA" (11 characters)
Step 1: Count frequencies
Symbol Count Probability
A 5 0.455
B 2 0.182
R 2 0.182
C 1 0.091
D 1 0.091
Step 2: Build initial priority queue (lowest weight first)
[C:1] [D:1] [B:2] [R:2] [A:5]
Step 3: Combine C and D
Extract C:1 and D:1 → create internal node CD:2
CD:2
/ \
C:1 D:1
Queue: [B:2] [R:2] [CD:2] [A:5]
Step 4: Combine B and R
Extract B:2 and R:2 → create internal node BR:4
BR:4
/ \
B:2 R:2
Queue: [CD:2] [A:5] [BR:4]
→ sorted: [CD:2] [BR:4] [A:5]
Step 5: Combine CD and BR
Extract CD:2 and BR:4 → create internal node CDBR:6
CDBR:6
/ \
CD:2 BR:4
/ \ / \
C:1 D:1 B:2 R:2
Queue: [A:5] [CDBR:6]
Step 6: Combine A and CDBR
Extract A:5 and CDBR:6 → create root node ALL:11
ALL:11
/ \
A:5 CDBR:6
/ \
CD:2 BR:4
/ \ / \
C:1 D:1 B:2 R:2
Step 7: Read off codewords (left=0, right=1)
Symbol Path Code Length
A left 0 1
C right,left,left 100 3
D right,left,right 101 3
B right,right,left 110 3
R right,right,right 111 3
Verify the compression:
Original (fixed 3-bit codes, since 5 symbols needs 3 bits):
11 chars × 3 bits = 33 bits
Huffman:
A appears 5 times × 1 bit = 5 bits
B appears 2 times × 3 bits = 6 bits
R appears 2 times × 3 bits = 6 bits
C appears 1 time × 3 bits = 3 bits
D appears 1 time × 3 bits = 3 bits
Total = 23 bits
Savings: 33 - 23 = 10 bits (30% reduction)
How close are we to Shannon's entropy limit?
H = -(5/11)log₂(5/11) - 2×(2/11)log₂(2/11) - 2×(1/11)log₂(1/11)
≈ 0.455×1.14 + 2×0.182×2.46 + 2×0.091×3.46
≈ 0.518 + 0.895 + 0.630
≈ 2.043 bits per symbol
Entropy for 11 symbols: 11 × 2.043 ≈ 22.5 bits
Huffman achieved: 23 bits ✓ (within 2.2% of theoretical minimum)
Python Implementation
Here's a clean, readable implementation of Huffman coding:
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional
@dataclass(order=True)
class HuffNode:
weight: int
symbol: Optional[str] = field(default=None, compare=False)
left: Optional['HuffNode'] = field(default=None, compare=False)
right: Optional['HuffNode'] = field(default=None, compare=False)
def build_huffman_tree(text: str) -> HuffNode:
"""Build a Huffman tree from input text."""
freq = Counter(text)
# Initialize priority queue with leaf nodes
heap = [HuffNode(weight=count, symbol=char)
for char, count in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
internal = HuffNode(
weight=left.weight + right.weight,
left=left,
right=right
)
heapq.heappush(heap, internal)
return heap[0]
def build_codebook(root: HuffNode) -> dict[str, str]:
"""Extract symbol → bitstring mapping from tree."""
codebook = {}
def traverse(node: HuffNode, path: str):
if node.symbol is not None: # leaf node
codebook[node.symbol] = path or '0' # single-symbol edge case
else:
traverse(node.left, path + '0')
traverse(node.right, path + '1')
traverse(root, '')
return codebook
def huffman_encode(text: str) -> tuple[str, dict]:
"""Encode text, return bitstring and codebook."""
if not text:
return '', {}
tree = build_huffman_tree(text)
codebook = build_codebook(tree)
encoded = ''.join(codebook[char] for char in text)
return encoded, codebook
def huffman_decode(bits: str, codebook: dict) -> str:
"""Decode bitstring using codebook."""
# Invert codebook: bitstring → symbol
reverse = {v: k for k, v in codebook.items()}
result = []
current = ''
for bit in bits:
current += bit
if current in reverse:
result.append(reverse[current])
current = ''
return ''.join(result)
# Demo
text = "ABRACADABRA"
encoded, codebook = huffman_encode(text)
print(f"Original: {len(text) * 8} bits (naive 8-bit ASCII)")
print(f"Encoded: {len(encoded)} bits")
print(f"Codebook: {codebook}")
print(f"Decoded: {huffman_decode(encoded, codebook)}")
Output:
Original: 88 bits (naive 8-bit ASCII)
Encoded: 23 bits
Codebook: {'A': '0', 'C': '100', 'D': '101', 'B': '110', 'R': '111'}
Decoded: ABRACADABRA
The Optimality Guarantee
Huffman coding is provably optimal among all prefix codes. The proof is an exchange argument: if you have an optimal code that doesn't satisfy Huffman's tree properties, you can always swap two codewords to make it "more Huffman" without increasing average code length.
Key properties of an optimal Huffman tree:
- More frequent symbols have shorter (or equal) codes.
- The two least frequent symbols have the same length.
- The two least frequent symbols differ only in the last bit.
The algorithm constructs a tree satisfying all three properties by construction.
One caveat: Huffman is optimal among prefix codes, but prefix codes are not always optimal encoders. Arithmetic coding (Chapter 3) can beat Huffman by encoding fractional bits per symbol — relevant when probabilities aren't nice powers of 1/2.
Canonical Huffman Codes
There's a problem with Huffman: the decompressor needs the codebook to decode. Sending the full tree with the compressed data adds overhead, and trees can be large.
Canonical Huffman solves this by defining a standard way to assign codewords given only the code lengths. The decompressor only needs to know the length of each symbol's codeword — not the codeword itself — to reconstruct the same canonical codebook.
The canonical assignment rule:
- Sort symbols by code length, then alphabetically within same length.
- Assign the numerically smallest codeword to the first symbol.
- Each subsequent symbol at the same length gets the next integer.
- When moving to a longer length, left-shift and continue.
From our ABRACADABRA example:
Sorted by length:
A: length 1
B: length 3
C: length 3
D: length 3
R: length 3
Canonical assignment:
A: 0 (0 in binary)
B: 100 (shift 0 to 000, +1 → 001... wait, jump from length 1 to 3)
(last code at length 1 was 0, shift left twice → 000, +1 → 001... no)
Correct algorithm:
Start code = 0, length = 1
A (length 1): code = 0b0 = "0"
Move to length 3: code = (0 + 1) << (3 - 1) = 1 << 2 = 4 = 0b100
B (length 3): code = 4 = "100"
C (length 3): code = 5 = "101"
D (length 3): code = 6 = "110"
R (length 3): code = 7 = "111"
Canonical codebook:
A → "0"
B → "100"
C → "101"
D → "110"
R → "111"
The decompressor only needs to store: the list of code lengths [1, 3, 3, 3, 3] (one per symbol, in sorted order). From this alone, it can reconstruct the entire codebook. This is how DEFLATE (the algorithm inside gzip and PNG) stores its Huffman tables.
def canonical_codebook(lengths: dict[str, int]) -> dict[str, str]:
"""Generate canonical Huffman codes from code lengths."""
# Sort by length, then by symbol
sorted_syms = sorted(lengths.items(), key=lambda x: (x[1], x[0]))
codebook = {}
code = 0
prev_len = 0
for symbol, length in sorted_syms:
if length > prev_len:
code <<= (length - prev_len)
codebook[symbol] = format(code, f'0{length}b')
code += 1
prev_len = length
return codebook
Where Huffman Is Used Today
Huffman coding is not usually used alone — it's a building block inside larger formats:
- DEFLATE (gzip, zlib, PNG, zip): Two Huffman trees, one for literals/lengths, one for distances.
- JPEG: Each coefficient's magnitude is Huffman coded.
- MP3: Scale factors and Huffman tables for spectral data.
- HTTP/2 HPACK: A static Huffman table for header value compression.
Pure Huffman is rare in modern systems because it requires two passes (one to count frequencies, one to encode) and doesn't handle adaptive probability models well. But it remains foundational: almost every format that does entropy coding has Huffman somewhere inside it.
Huffman's Limitations
Two-pass requirement: You need to know all frequencies before building the tree, which means reading the data twice or buffering it all. Adaptive Huffman (not covered here) addresses this but is more complex.
Integer code lengths: Each symbol gets a whole number of bits. If a symbol has probability 0.4, it ideally gets 1.32 bits — but Huffman must assign it 1 or 2. This integer rounding is where arithmetic coding gains its advantage.
Block-based: Huffman builds one tree for the whole block. If the data's statistics change mid-block (common in real files), the single tree is suboptimal for parts of the data.
Despite these limitations, Huffman coding's simplicity, efficiency, and optimality guarantee have kept it relevant for over 70 years. It's one of those rare algorithms that was essentially perfect from day one.
Summary
- Huffman coding assigns shorter binary codewords to more frequent symbols.
- The algorithm builds a binary tree bottom-up using a min-priority queue.
- The resulting prefix-free code is provably optimal among all prefix codes.
- Canonical Huffman reduces decompressor overhead to just storing code lengths.
- Huffman is a building block inside DEFLATE, JPEG, MP3, and dozens of other formats.
Next, we'll look at arithmetic coding — which breaks the "integer bits" barrier and can approach Shannon's limit even more closely.