Further Reading
This book is a field guide. To go deeper, the following resources are the ones we recommend — with brief commentary on what each is best at.
The Unicode Standard itself
The Unicode Standard, Version 16.0. Published by the Unicode Consortium; freely available at unicode.org/versions/Unicode16.0.0/.
The core standard is about 1,100 pages; the full set with annexes is thousands more. You will not sit down and read it cover to cover. But when you have a specific question — "what does UAX #14 say about line-break opportunities around hyphens?" — this is the authoritative source, and it is extremely well-written for a technical standard. Read it like a reference.
The key Unicode Technical Reports:
- UAX #9 — Bidirectional Algorithm: how text with mixed left-to-right and right-to-left scripts is laid out.
- UAX #14 — Line Breaking: where a line can be broken for wrapping.
- UAX #15 — Normalization Forms: NFC, NFD, NFKC, NFKD in precise detail.
- UAX #24 — Script Property: the Script property and its values.
- UAX #29 — Text Segmentation: grapheme clusters, word boundaries, sentence boundaries.
- UAX #31 — Unicode Identifier and Pattern Syntax: identifier rules.
- UAX #38 — Unicode Han Database: CJK-specific properties.
- UAX #44 — Unicode Character Database: the UCD files we discussed in Chapter 13.
- UTS #10 — Unicode Collation Algorithm: sorting rules.
- UTS #39 — Unicode Security Mechanisms: confusables, restriction levels, mixed-script detection.
- UTS #46 — Unicode IDNA Compatibility Processing: the modern IDN algorithm.
- UTS #51 — Unicode Emoji: the emoji-specific rules, including ZWJ sequences and Fitzpatrick modifiers.
The bracketing UAX ("Unicode Annex"), UTS ("Unicode Technical Standard"), and UTR ("Unicode Technical Report") are three different levels of normativity. UAX is part of the standard proper; UTS is an independently-normative standard; UTR is informational. In practice, conforming implementations treat all three as must-follow references.
Introductory essays
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), Joel Spolsky, 2003. joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.
The foundational essay on this topic. Twenty-plus years old and still correct in most of its core claims. Spolsky's "plain text does not exist" framing is the seed that this entire book grew from.
Breaking our Latin-1 assumptions, Manish Goregaokar. A blog series walking through a selection of the hardest assumptions programmers make about text and where they break. A good post-Spolsky read for people who have moved past "encoding matters."
What every programmer absolutely, positively needs to know about encodings and character sets to work with text, David C. Zentgraf, kunststube.net/encoding/. A very practical companion piece to Spolsky.
Programming with Unicode, Victor Stinner. Detailed Python-centric treatment. Exists at unicodebook.readthedocs.io.
The W3C internationalization articles
The W3C's i18n working group publishes a large set of short, focused articles on Unicode as it meets the web. Start at w3.org/International/articles/ and browse. Topics we recommend:
- Character encodings for beginners.
- Setting the HTTP charset parameter.
- Questions and answers about character sets.
- Personal names around the world (a humbling read for anyone who has a
first_name/last_nameschema).
These are short. You can read most of them in under ten minutes each.
Essential libraries
ICU — International Components for Unicode. icu.unicode.org. The industrial-strength Unicode library. C/C++ core with bindings for most languages. If you need serious Unicode, you eventually end up here.
ICU4X. github.com/unicode-org/icu4x. The modern Rust-first rewrite of ICU, designed for smaller footprint and WASM use. Will eventually replace ICU in many contexts. Already usable.
CLDR — Common Locale Data Repository. cldr.unicode.org. Not a library but a data repository: the locale tailorings for collation, date/time formatting, number formatting, plural rules, and so on. Shipped with ICU; also available as raw JSON.
Per language:
- Python:
regex(grapheme + property regex),icu(PyICU),confusables,idna,grapheme. - JavaScript: built-in
Intlis surprisingly capable; fill in gaps withemoji-regex,grapheme-splitter,punycode. - Java: built-in
java.text.*, plus ICU4J for anything serious. - Go:
golang.org/x/textforunicode/norm,collate,language, and others;rivo/unisegfor graphemes. - Rust:
unicode-normalization,unicode-segmentation,unicode-properties,icu(ICU4X),idna,confusable_detection. - Swift: built-in.
- C/C++: ICU.
Data tools
The Unicode Utilities at util.unicode.org/UnicodeJsps/. We mentioned these in Chapter 13; bookmark them.
Glyph browsers:
- graphemica.com: search, properties, look-alikes.
- unicode-table.com: visual browsing by block or script.
- emojipedia.org: the emoji-specific reference.
- compart.com/en/unicode: a searchable character browser with full metadata.
CLI tools:
uni(arp242/uni): fast CLI for character lookup.hexyl: colored hex dumps that show UTF-8 structure.rg(ripgrep) with--pcre2for Unicode-property regex.
History and context
Unicode Explained, Jukka Korpela, O'Reilly, 2006. A dense, thorough guide from one of the most careful Unicode writers. Its core content has aged well, though the emoji chapter is missing (pre-2010).
Fonts & Encodings, Yannis Haralambous, O'Reilly, 2007. The big book on fonts and their relationship with character encodings. More than you need, but if you need it, it's the best thing.
Strange Code: Esoteric Languages That Make Programming Fun Again, Ronald T. Kneusel — includes a chapter on Unicode-based languages like Whitespace, which will make you think about what characters really "are" in a programming language.
The short, pointed reads
If you only ever read three more things after this book, let them be:
- Spolsky's essay, for the vibe.
- UAX #29 (Text Segmentation), for the grapheme cluster rules you will encounter more than any other.
- The Unicode Character Code Charts at unicode.org/charts/, to browse human writing the way you once browsed an atlas.
Closing
Unicode is a living project. It is one of the most successful and least-celebrated engineering achievements of the last fifty years. It turned the world's writing systems into something computers can carry, and it did so in a way that preserved history, handled politics, and survived contact with JavaScript.
If you close this book with the four distinctions — grapheme cluster, code point, code unit, byte — and the habit of asking which one is this function counting? every time you touch a string, you will already write more correct Unicode code than most of your profession.
Thank you for reading.