Introduction: The Book That Would Have Saved You Three Bugs

You have written code that worked fine in testing and broke the moment a user pasted in their name.

You have seen the little box with the question mark — □ — where a letter should have been.

You have written if (s.length > 10) on a string and felt a quiet unease, because you half-remembered that some other language did this differently, and you were not sure which one was right.

You have stared at the bytes C3 A9 in a hex dump and wondered what the é they were supposed to be had to do with C3 or A9.

You have read three separate tutorials on UTF-8 and come away feeling like you understood it right now, in this sentence, but could not have explained it out loud five minutes later.

This book is for you.

What this book is

A patient, honest tour of Unicode — what it is, what it isn't, how it is encoded into bytes, how those bytes travel between programs, how they are stored in files, how they are compared, how they are normalized, how they are sorted, how they can be used against you, and how your programming language treats them.

We will show you actual bytes. We will use real code. We will define each term precisely the first time we use it, and we will stick to those definitions. We will not tell you that Unicode is a mess, because it isn't. It is a careful, thoughtful standard that solved a genuinely hard problem. The confusion that surrounds it is mostly not its fault.

What this book is not

It is not a replacement for the Unicode Standard itself. The standard is roughly 1,100 pages and encyclopedic; this book is a field guide. When you need to know the exact combining class of U+0308, you will still open UnicodeData.txt. But after reading this book, you will know what "combining class of U+0308" means, why you would care, and where that file is.

It is not a rant about JavaScript. (JavaScript does have some distinctive Unicode choices. We will explain them, with sympathy for the historical constraints that produced them.)

It is not a victory lap for languages that "got it right." No language has gotten all of Unicode right, because Unicode is still evolving and some of its answers are not universal.

A small promise about vocabulary

The biggest single reason programmers get Unicode wrong is that four different things are all casually called the character. They are not the same thing. In order of abstraction, from most abstract to most concrete:

A grapheme cluster — what a human reader would call a "character." é, 😀, 👨‍👩‍👧‍👦. A user-perceived unit.
A code point — an integer assigned by the Unicode Standard. U+0041 (A). U+1F600 (😀). An abstract identity.
A code unit — the atomic piece of an encoding. For UTF-8 it is a byte. For UTF-16 it is a 16-bit value.
A byte — eight bits. What your disk actually stores.

We will draw those lines early and keep them drawn. When a language's standard library is counting something and calling it length, the first question is always: length in what unit?

How to read this book

Front-to-back works, and the chapters are ordered so that each one uses vocabulary defined earlier. But the book is designed so you can also open it to a chapter in the middle, get a specific question answered, and close it again. Every chapter cross-references the ones it depends on.

If you remember only one thing from the whole book, let it be this:

There are four different things called character. Every Unicode bug begins with someone mixing two of them up.

Everything else follows from that.

Let's begin with the history, because the most confusing parts of Unicode are the parts it inherited.

A Brief, Honest History

To understand Unicode, you have to understand what came before it, because Unicode's shape is a direct response to the problems that preceded it. This chapter is not nostalgic. It is diagnostic: each old encoding is a wound, and Unicode is the stitches.

ASCII: 128 characters, 7 bits, 1963

ASCII (the American Standard Code for Information Interchange) is a table that maps the integers 0 through 127 to a set of 128 characters: the uppercase and lowercase Latin letters, the digits, a handful of punctuation marks, and a number of control characters like newline (10), tab (9), and carriage return (13). Each integer fits in seven bits. The eighth bit of a byte, on the hardware of the early 1960s, was typically used for parity.

ASCII is small, regular, and entirely adequate for written English, which is what its designers cared about. Here are the printable ASCII letters:

0x41  A        0x61  a
0x42  B        0x62  b
0x43  C        0x63  c
...            ...
0x5A  Z        0x7A  z

The uppercase and lowercase letters differ by exactly one bit (0x20). This is a design choice, not a coincidence; it made case folding cheap in hardware. You will see it pay off later in this book.

You already knew ASCII existed. The important thing to notice is what it does not contain. No é. No ñ. No ß. No Cyrillic, no Greek, no Arabic, no Hebrew, no Devanagari, no CJK. ASCII describes English and only English.

The 8-bit wild west

Throughout the 1980s and 1990s, the eighth bit of the byte — no longer needed for parity — became an opportunity. You could assign meanings to the 128 values from 128 to 255 and double the size of your character set. Everyone did.

ISO 8859-1, also called Latin-1, filled 128–255 with accented Latin letters for Western European languages: é, ñ, ü, ø, and so on.
ISO 8859-2 (Latin-2) did the same for Central and Eastern European languages: č, ł, ő, ř.
ISO 8859-5 was Cyrillic.
ISO 8859-6 was Arabic.
KOI8-R was a different arrangement of Cyrillic, popular in Russia, designed so that if you stripped the high bit you got approximate Latin transliterations of the letters. Д (position 228) stripped to d (position 100).
Windows-1252 was Microsoft's extension of Latin-1 that stuffed additional characters (€, ™, ", ") into the 0x80–0x9F range, which ISO 8859-1 had left for control codes. This is the encoding Windows calls ANSI, a name that has nothing to do with ANSI.
Shift-JIS (Japanese), Big5 (Traditional Chinese), and GB2312 (Simplified Chinese) were multi-byte encodings — they used either one or two bytes per character, with a lead-byte convention to distinguish. These had to exist because Japanese and Chinese have tens of thousands of characters and cannot possibly fit in 256 slots.

The result, by 1995, was a thousand-character problem: the same bytes meant different things depending on which encoding the software had in mind, and there was no reliable way to find out which one that was.

Example: the byte E9 decoded as Latin-1 is é. Decoded as Windows-1252 it is also é (these two encodings agree on this byte). Decoded as KOI8-R it is щ. Decoded as Shift-JIS it is either an error or, depending on context, part of a two-byte sequence. One byte, four different meanings, no in-band way to tell which.

When your data was born inside one encoding and consumed as another, the result had a name: mojibake — text turned to garbled characters. 文字化け. You have seen it.

Unicode's goal

The founders of Unicode looked at this and asked a radical question. What if there were one character set, large enough to contain every writing system ever used by humans, and every symbol anyone ever wanted to standardize — and what if every encoding from then on was merely a way to serialize that single set to bytes?

Then the mojibake problem becomes a transport problem, not a meaning problem. The text "the letter é" would have one globally agreed-upon identity. If different programs serialized it into different bytes, they could still agree on what the underlying text was.

That single character set is what Unicode is. Specifically, Unicode assigns an integer (a code point) to every character it standardizes. Currently, there are 1,114,112 possible code points (the numbers 0 through 1,114,111), of which about 155,000 are assigned as of Unicode 16.0. The rest are reserved for future use.

Code points are written in hexadecimal with a U+ prefix, zero-padded to at least four digits: U+0041 is A, U+00E9 is é, U+1F600 is 😀, U+2603 is ☃.

That is the whole big idea. One integer per character. The "character set" problem is solved by fiat: we all use the same set.

The UCS-2 mistake

Unicode was originally designed around a different assumption. In its first version (1991), Unicode believed it could fit every character it would ever need into 65,536 slots — the range of a 16-bit integer. Two bytes per character, always, forever. The encoding that served this assumption was called UCS-2: pack each 16-bit code point into two bytes and you're done. Strings became arrays of 16-bit units and each unit was a character.

This was a mistake, but an honest one: nobody had yet counted how many distinct CJK characters actually existed in real historical use, and the first estimates were wildly low.

By 1996, Unicode had to concede that 65,536 code points were not enough. The character space was extended to 1,114,112 slots. But by then, several large systems had already committed to two bytes per character, always:

Windows NT used UCS-2 internally, and its file APIs (CreateFileW, the W family) took 16-bit units.
Java, launched in 1995, defined char as a 16-bit unit and String as an array of them.
JavaScript, specified in 1997, defined string indexing in terms of 16-bit units.
Objective-C's NSString used 16-bit units.

All of these languages and runtimes had to retrofit a way to represent the supplementary code points (the ones above U+FFFF) using pairs of 16-bit units. That retrofit is called surrogate pairs, and it is the reason "😀".length is 2 in JavaScript. We will cover surrogate pairs in detail in Chapter 3.

The lesson is this: UCS-2 is not the same as UTF-16. UCS-2 was the original fixed-width two-bytes-per-code-point encoding, and it cannot represent anything above U+FFFF. UTF-16 is its successor — a variable-width encoding that uses one 16-bit unit for code points in the Basic Multilingual Plane (U+0000–U+FFFF) and a pair of 16-bit units for everything else. Languages that still call their string unit a char (Java, JavaScript, C#) are living in the space where this retrofit happened.

Where we are now

Today:

UTF-8 is dominant on the web (around 98% of pages, by most counts), on Linux filesystems, in most network protocols, and in most modern programming languages' default I/O.
UTF-16 persists inside Windows, Java, JavaScript, and anywhere else that made the UCS-2 bet in the 1990s. It is still the internal string representation of those systems even when their I/O is UTF-8.
UTF-32 exists, is occasionally useful for internal work, and is almost never used for interchange.
The single-byte encodings (Latin-1, Windows-1252, Shift-JIS, KOI8-R, the whole ISO 8859 series) are a dwindling but persistent minority. They live in legacy databases, old email archives, filesystems with historical data, and a truly surprising number of CSV exports from enterprise software.

Unicode is, at this point, not one of several character sets. It is the character set, and the other things people once called character sets are now best understood as alternative ways of not quite encoding Unicode.

With that in mind, we are ready to draw the distinctions that matter.

Code Points, Code Units, Bytes

This is the most important chapter in the book. It is also the shortest, because the ideas are small; they are only confusing when they are left implicit. Once you name them, Unicode becomes tractable.

Three levels of abstraction

Think of text as sitting on three layers:

┌────────────────────────────────────────────────────┐
│  The abstract character the Unicode Standard       │  <- code point
│  has assigned a number to.                         │
│      U+0041 = A   U+00E9 = é   U+1F600 = 😀        │
└────────────────────────────────────────────────────┘
                        │  (encoded by UTF-8, UTF-16, ...)
                        ▼
┌────────────────────────────────────────────────────┐
│  A sequence of atomic units of some encoding.      │  <- code units
│  UTF-8: 8-bit units.  UTF-16: 16-bit units.        │
└────────────────────────────────────────────────────┘
                        │  (serialized to memory/disk)
                        ▼
┌────────────────────────────────────────────────────┐
│  The bytes on the wire or on disk. Endianness may  │  <- bytes
│  matter for multi-byte code units.                 │
└────────────────────────────────────────────────────┘

You need names for all three of these layers, and you need to not confuse them.

Code point

A code point is an integer assigned by the Unicode Standard to a specific abstract character. There are 1,114,112 possible code points, in the range U+0000 through U+10FFFF. About 155,000 are currently assigned; the rest are reserved.

A code point is written in hexadecimal with a U+ prefix and at least four digits: U+0041, U+00E9, U+1F600. There are no leading zeros beyond the fourth digit, but the four-digit minimum is conventional.

A code point is not a number of bytes. U+1F600 is the integer 128,512. Whether that integer takes 1, 2, 3, or 4 bytes to store depends on the encoding, which we are about to discuss. The code point itself is indifferent to storage.

Code points are conceptually grouped into 17 planes of 65,536 code points each:

Plane 0, the Basic Multilingual Plane (BMP): U+0000–U+FFFF. Contains nearly every character used in modern living languages.
Plane 1, the Supplementary Multilingual Plane (SMP): U+10000–U+1FFFF. Emoji, historical scripts, musical notation.
Plane 2, the Supplementary Ideographic Plane (SIP): U+20000–U+2FFFF. Additional CJK ideographs.
Planes 3–13: mostly unassigned.
Plane 14: special-purpose characters, including tags and variation selectors.
Planes 15–16: private use areas.

Why does the total max out at U+10FFFF and not something neater, like U+FFFFFF? Because of UTF-16. U+10FFFF is the largest code point that UTF-16 can represent with a surrogate pair. Unicode's upper bound is literally set by the encoding capacity of one of its encodings — a retrofit.

Code unit

A code unit is the atomic piece of whatever encoding you are using. It is a fixed-size integer, and the size depends on the encoding:

UTF-8 has 8-bit code units (so each code unit is also a byte).
UTF-16 has 16-bit code units.
UTF-32 has 32-bit code units.

A single code point may be represented by one or more code units, depending on the encoding and the code point. In UTF-8, code points above U+007F take multiple code units. In UTF-16, code points above U+FFFF take two code units. In UTF-32, every code point is exactly one code unit.

When a string API tells you its length, you need to know what unit it is counting. JavaScript counts UTF-16 code units. Python counts code points. Go's len() counts bytes. Rust's str::len() also counts bytes. Swift's String.count counts grapheme clusters (we will get to those in Chapter 4).

All of these are legitimate answers to "how long is this string?" depending on what you mean by "long." None of them is "the number of characters," because character has at least four meanings.

Byte

A byte is eight bits. What ends up in a file or on a network socket. Code units that are one byte wide (UTF-8) are directly bytes. Code units that are wider than a byte (UTF-16, UTF-32) have to be serialized to bytes, which means choosing an endianness — most significant byte first, or least significant byte first. More on that when we cover UTF-16 in Chapter 3.

A worked example: `"é😀"` in three encodings

Take a two-character string, é😀. In code points:

U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+1F600 GRINNING FACE

Two code points. That's the abstract level; every encoding has to represent these two code points somehow.

UTF-8

UTF-8 is a variable-width encoding with 8-bit code units. Here are the rules, which we will justify in Chapter 3:

Code points U+0000–U+007F: 1 code unit. 0xxxxxxx.
Code points U+0080–U+07FF: 2 code units. 110xxxxx 10xxxxxx.
Code points U+0800–U+FFFF: 3 code units. 1110xxxx 10xxxxxx 10xxxxxx.
Code points U+10000–U+10FFFF: 4 code units. 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.

U+00E9 = 0xE9 = decimal 233 falls in the second range. The binary is 11101001. Padded to the 2-code-unit pattern:

110xxxxx 10xxxxxx
   00011    101001

which becomes 11000011 10101001 = C3 A9.

U+1F600 = decimal 128512. Binary 0001 1111 0110 0000 0000 (20 bits). Fits the 4-code-unit pattern:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
     000   011111   011000   000000

which becomes 11110000 10011111 10011000 10000000 = F0 9F 98 80.

So the string é😀 in UTF-8 is 6 bytes:

C3 A9 F0 9F 98 80

UTF-16

UTF-16 uses 16-bit code units. For code points in the BMP (U+0000–U+FFFF), a single code unit holds the code point directly. For supplementary code points (U+10000–U+10FFFF), two code units called a surrogate pair are used.

U+00E9 is in the BMP, so it becomes one code unit: 0x00E9.

U+1F600 is supplementary. The surrogate pair rule:

Subtract 0x10000: 0x1F600 − 0x10000 = 0x0F600.
Split the resulting 20-bit number into two 10-bit halves: high = 0x03D, low = 0x200.
High surrogate: 0xD800 + 0x03D = 0xD83D.
Low surrogate: 0xDC00 + 0x200 = 0xDE00.

So U+1F600 is the UTF-16 code unit pair D83D DE00.

The string é😀 in UTF-16 is three code units:

00E9 D83D DE00

In bytes, we also have to choose endianness. Big-endian (UTF-16BE):

00 E9 D8 3D DE 00    (6 bytes)

Little-endian (UTF-16LE):

E9 00 3D D8 00 DE    (6 bytes)

(It is a coincidence that UTF-8 and UTF-16 land on the same byte count here. In general they do not. ASCII text is much smaller in UTF-8; CJK text is smaller in UTF-16.)

UTF-32

UTF-32 uses one 32-bit code unit per code point. No surrogates, no variable width.

000000E9 0001F600     (2 code units, 8 bytes)

In big-endian bytes:

00 00 00 E9 00 01 F6 00

Most of those bytes are zeroes. That is why UTF-32 is rarely used in transit or on disk: it's fixed-width but space-inefficient.

The same string, four "lengths"

Given é😀, here are four numbers, all of which could be called the length:

What you count	Value
Code points	2
UTF-8 code units (= bytes)	6
UTF-16 code units	3
UTF-32 code units (= bytes / 4)	2
Grapheme clusters	2

Here is what various languages say:

# Python 3: len() counts code points
>>> len("é😀")
2
>>> len("é😀".encode("utf-8"))
6

// JavaScript: .length counts UTF-16 code units
> "é😀".length
3
> [..."é😀"].length   // iterator splits by code point
2

// Go: len() on a string counts bytes (Go strings are UTF-8)
len("é😀")                             // 6
utf8.RuneCountInString("é😀")          // 2  (code points)

#![allow(unused)]
fn main() {
// Rust: str::len() counts bytes
"é😀".len()              // 6
"é😀".chars().count()    // 2  (code points)
}

// Swift: String.count counts grapheme clusters
"é😀".count              // 2  (grapheme clusters, same as code points here)
"é😀".utf16.count        // 3
"é😀".utf8.count         // 6

None of these languages is wrong. They are each counting a specific, well-defined thing. The confusion comes from calling them all "length."

In Chapter 4 we will look at strings where the code-point count and the grapheme-cluster count disagree. You will like those examples. They are where the bugs live.

What to remember

A code point is a number that Unicode assigned to a character. It is encoding-independent.
A code unit is the atomic piece of some encoding. Its size depends on the encoding.
A byte is what you store. It equals a code unit only in UTF-8.
When a function returns the length of a string, find out which of these it is counting.

Next, we'll look at the encodings themselves — how they work, why UTF-8 is the shape it is, and how to read its bytes with your bare eyes.

The Encodings

Three encodings of Unicode matter in practice: UTF-8, UTF-16, and UTF-32. This chapter explains what each is doing mechanically, why it is shaped the way it is, and what pain each one can cause when you meet it in the wild.

UTF-8

UTF-8 is a variable-width encoding with 8-bit code units. It has four critical properties, and if you understand the properties, you understand the encoding.

Property 1: ASCII-compatible

Any ASCII file (every byte < 0x80) is a valid UTF-8 file, and means exactly the same thing in UTF-8 as it did in ASCII. This is enormously important for backward compatibility. Every program that handled ASCII continues to handle UTF-8 in part without modification — it will just not understand the non-ASCII parts.

Property 2: Self-synchronizing

Given an arbitrary byte in a UTF-8 stream, you can tell immediately whether it is a start byte (the first byte of a code point's encoding) or a continuation byte (a non-first byte):

Byte pattern	Meaning
`0xxxxxxx`	1-byte sequence (ASCII, U+0000–U+007F)
`110xxxxx`	start of 2-byte sequence
`1110xxxx`	start of 3-byte sequence
`11110xxx`	start of 4-byte sequence
`10xxxxxx`	continuation byte

If you land in the middle of a stream and see 10xxxxxx, you know you are mid-character; back up until you see something that isn't a continuation byte. This is what self-synchronizing means. It is why dropping a byte from a UTF-8 stream corrupts exactly one character, not the entire rest of the stream.

Property 3: Variable-width, but bounded

Code points use 1, 2, 3, or 4 bytes in UTF-8. The number of bytes is determined by the code point's value:

Code point range	Bytes	Pattern
U+0000 – U+007F	1	`0xxxxxxx`
U+0080 – U+07FF	2	`110xxxxx 10xxxxxx`
U+0800 – U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000 – U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

The xxx bits are taken, in order, from the code point's binary representation (most-significant bit first) and packed into the x slots.

Property 4: Lexicographic byte order matches code point order

If you compare two UTF-8-encoded strings byte by byte (as raw bytes, unsigned), the result is the same as comparing them code point by code point. This is a very convenient property: databases, filesystems, and network protocols that sort by bytes end up sorting Unicode strings by code point "for free." (This is not the same as sorting them correctly for humans, which we will cover in Chapter 6. But it is stable and predictable.)

Reading a UTF-8 byte sequence with your bare eyes

Here are the bytes of our example string é😀:

C3 A9 F0 9F 98 80

C3 = 11000011. Top bits 110 → start of 2-byte sequence. The 5 payload bits are 00011.
A9 = 10101001. Top bits 10 → continuation byte. The 6 payload bits are 101001.
Combine the payload bits: 00011 101001 = 00011101001 = 0xE9 = U+00E9 = é. ✓
F0 = 11110000. Top bits 11110 → start of 4-byte sequence. Payload bits: 000.
9F = 10011111. Continuation. Payload: 011111.
98 = 10011000. Continuation. Payload: 011000.
80 = 10000000. Continuation. Payload: 000000.
Combine: 000 011111 011000 000000 = 11111011000000000 = 0x1F600 = U+1F600 = 😀. ✓

You can do this in your head with practice. It will pay off the first time you debug a mojibake problem at the byte level.

Overlong encodings and why they're forbidden

The UTF-8 rules technically allow the same code point to be encoded in more than one way: you could encode U+0041 (A) in 1 byte (0x41) or "pad" it into a 2-byte sequence (0xC1 0x81) or a 3-byte sequence. The multi-byte forms are called overlong encodings, and they are forbidden by the UTF-8 spec.

The reason matters for security: if overlong encodings were legal, a filter that blocked / (0x2F) by looking only for the byte 0x2F could be bypassed by sending the overlong 2-byte encoding 0xC0 0xAF or the 3-byte encoding 0xE0 0x80 0xAF, both of which "mean" / but contain no 0x2F byte. Several real path-traversal vulnerabilities worked this way in the early 2000s, most famously in IIS. Modern UTF-8 decoders reject overlong forms.

Valid code point range limits

A well-formed UTF-8 decoder rejects:

Overlong encodings (as above).
Bytes 0xC0, 0xC1 (can only appear as overlong starts).
Bytes 0xF5–0xFF (would encode code points above U+10FFFF, which don't exist).
UTF-16 surrogates (U+D800–U+DFFF) encoded in UTF-8. These code points are reserved for the UTF-16 surrogate mechanism and are not legal stand-alone characters. UTF-8 that contains them is CESU-8 or WTF-8, not proper UTF-8.

UTF-16

UTF-16 is a variable-width encoding with 16-bit code units. It is the encoding that most UCS-2 systems retrofitted to when it became clear that 16 bits were not enough.

BMP code points

For a code point in the range U+0000 – U+FFFF (the Basic Multilingual Plane), UTF-16 uses a single 16-bit code unit whose value equals the code point. This is the part that UCS-2 already did, and it is why UCS-2 and UTF-16 produce the same bytes for BMP text.

There is one wrinkle: the range U+D800 – U+DFFF is reserved in the Unicode standard — no characters are ever assigned there. That range is reserved specifically for UTF-16's surrogate mechanism.

Supplementary code points: surrogate pairs

For a code point in U+10000 – U+10FFFF (supplementary), UTF-16 uses two 16-bit code units called a surrogate pair. The encoding:

Let cp be the code point. Subtract 0x10000 to get a 20-bit value v.
The high surrogate is 0xD800 | (v >> 10). This is in the range 0xD800 – 0xDBFF.
The low surrogate is 0xDC00 | (v & 0x3FF). This is in the range 0xDC00 – 0xDFFF.

Both halves are always 16-bit code units, and the high/low ranges don't overlap, so decoders can always tell which half of a pair they are looking at. A high surrogate without a following low surrogate (or vice versa) is a lone surrogate — malformed UTF-16.

Example: U+1F600 (😀).

cp  = 0x1F600
v   = cp - 0x10000 = 0x0F600
high = 0xD800 | (0x0F600 >> 10) = 0xD800 | 0x03D = 0xD83D
low  = 0xDC00 | (0x0F600 & 0x3FF) = 0xDC00 | 0x200 = 0xDE00

So 😀 in UTF-16 is the pair D83D DE00. You saw this in the previous chapter; now you know where it came from.

Endianness and UTF-16

Because UTF-16 code units are 16 bits wide, serializing them to bytes requires picking an endianness. There are three related encodings:

UTF-16BE — big-endian, most significant byte first. D83D → D8 3D.
UTF-16LE — little-endian, least significant byte first. D83D → 3D D8.
UTF-16 — the abstract encoding, which in practice is either BE or LE with an optional Byte Order Mark (BOM) at the start to tell you which.

The BOM is the code point U+FEFF (ZERO WIDTH NO-BREAK SPACE), encoded per the chosen endianness. In UTF-16BE the BOM is the bytes FE FF; in UTF-16LE it is FF FE. Some files have a BOM; some don't. Some specs require one; others forbid it. This is an endless source of pain.

Lone surrogates, WTF-8, and JavaScript's leaky abstraction

JavaScript strings are sequences of 16-bit code units, not sequences of code points. You can create a JavaScript string containing a lone surrogate — a high surrogate not followed by a low surrogate, or vice versa. That string is not valid UTF-16 in the strict sense, but it is a valid JavaScript string.

const s = "\uD83D";    // a lone high surrogate
s.length;              // 1
s.charCodeAt(0);       // 55357

If you try to JSON-encode this string, you get the escape "\ud83d", which some JSON parsers tolerate and some reject. If you try to encode it as UTF-8, well-behaved encoders will refuse or replace it with U+FFFD (REPLACEMENT CHARACTER). WHATWG specs have a concept called WTF-8, which is UTF-8 that also encodes surrogates — a pragmatic compromise for dealing with JavaScript strings that contain them.

This is the tax you pay for UCS-2's ghost. In Python 3, you cannot construct a string containing a lone surrogate (well, not easily; surrogateescape is a specific escape hatch for filesystem paths). In JavaScript, you can, and so the language's String type is not quite "a sequence of Unicode code points."

UTF-32

UTF-32 uses one 32-bit code unit per code point, directly. No surrogates, no variable width, no rules to remember. The code point 0x1F600 is the code unit 0x0001F600.

This is conceptually the simplest encoding and is occasionally useful in-memory for algorithms that want random access to code points. But:

It quadruples the size of ASCII text.
It doubles the size of BMP text vs. UTF-16.
It is almost never used on the wire or on disk.

In practice, UTF-32 shows up as an internal representation in some C libraries (wchar_t is 32 bits on Linux, 16 on Windows — we'll revisit in Chapter 7), and it is the implicit representation when a language like Python 3 gives you a string whose indexing counts code points. Python 3 CPython actually uses a flexible internal representation (1, 2, or 4 bytes per code unit depending on the string's largest code point), but semantically it behaves like UTF-32.

Byte Order Marks, in detail

The Byte Order Mark (BOM) is the Unicode code point U+FEFF at the start of a file or stream. Its purpose depends on the encoding:

In UTF-16 and UTF-32, the BOM genuinely carries information: it tells the reader whether the stream is big- or little-endian.
In UTF-8, the BOM carries no byte-order information (UTF-8 has no byte-order choice). It exists only as a signature — a marker that the file is UTF-8. Its bytes are EF BB BF.

The UTF-8 BOM is controversial. Some tools expect it (Windows Notepad used to write it by default); some tools treat it as a literal character at the start of the file, breaking things. Unix tools tend not to expect it, so a UTF-8 BOM at the start of a shell script will cause the shebang line to not be recognized; a UTF-8 BOM at the start of a CSV will show up as a weird before your first column header.

When to use a BOM:

UTF-16 and UTF-32 files: yes, if you have any reason to think the consumer might not know the endianness.
UTF-8 files: prefer not to. If the file's encoding is documented out-of-band (HTTP Content-Type, filename convention, project convention), don't add a BOM. Add one only if consumers explicitly require it.
Never include a BOM in protocols that forbid it — JSON (RFC 8259), for instance, is required to be UTF-8 without a BOM.

When to strip a BOM:

When reading input of unknown provenance, be liberal: if you see EF BB BF at the start of a UTF-8 stream, eat it. Most modern language stdlibs have an encoding called utf-8-sig that does exactly this.

Choosing an encoding today

For almost every new use case, the answer is UTF-8, for the following reasons:

It is ASCII-compatible. Any ASCII tooling works on the ASCII portion.
It is the default encoding of the web, Unix filesystems, most databases, and most network protocols.
It is endian-free. No BOM needed, no byte-swapping at ingress.
It degrades gracefully: a stray byte corrupts exactly one character, not the rest of the stream.
For text with a lot of ASCII (source code, HTML, JSON), it is the most compact option.

Reasons to use something else:

You are working inside a system whose internal string type is UTF-16 (JavaScript, JVM, Windows). Then the trade-off is at the boundary: your storage and I/O should still typically be UTF-8.
You are working with almost entirely CJK text, where UTF-16 is slightly more compact than UTF-8.
You need fixed-width random access to code points for an algorithm, and you cannot afford to iterate. Then UTF-32 in memory, UTF-8 everywhere else.

With the encodings in hand, we are ready to confront the most commonly misunderstood concept in all of Unicode: what a "character" actually is.

Grapheme Clusters: Why "Character" Is a Trap

When you read the word character, you think of a thing on a page. A letter. A digit. A punctuation mark. One slot in a monospace font. That mental model served you fine in ASCII. In Unicode, it quietly stops working.

This chapter names what you were actually thinking of — a grapheme cluster — and shows you why a grapheme cluster is not the same as a code point, and why mistaking one for the other is the single most common source of Unicode bugs.

What a user means by "character"

The informal definition that matters: a character is whatever a human reader would point to and call a character. It is the unit the cursor moves over when you press the right arrow key. It is what the backspace key deletes.

Unicode has a technical name for that thing: an extended grapheme cluster, defined by Unicode Annex #29 (UAX #29). We will shorten it to grapheme cluster — sometimes just grapheme — and use it precisely from now on.

A grapheme cluster is a sequence of one or more code points that is treated as a single unit by a reader. Most of the time it is exactly one code point. Sometimes it is more.

Combining marks

Consider the letter é — a lowercase Latin e with an acute accent. Unicode can represent this in two different ways:

Precomposed: a single code point, U+00E9 (LATIN SMALL LETTER E WITH ACUTE).
Decomposed: two code points, U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT).

Both render as é on your screen. Both are legitimate Unicode. They are not the same sequence of code points, but they are the same grapheme cluster: one visual character, one cursor slot, one thing to backspace over.

>>> a = "\u00e9"           # precomposed
>>> b = "\u0065\u0301"     # decomposed
>>> a == b                  # byte-for-byte comparison
False
>>> len(a), len(b)
(1, 2)
>>> print(a, b)
é é

The code-point counts disagree. Visually they look identical. Neither is wrong.

U+0301 is a combining mark — a code point that, on its own, doesn't render as a standalone character. It attaches to the preceding base character. Unicode contains hundreds of combining marks for accents, tone marks, stacking diacritics, and so on. You can stack them:

e + ́ + ̂ + ̈  →  é̂̈        # e with acute, circumflex, and diaeresis

The grapheme cluster here is still one cluster; it just contains four code points.

People use combining marks intentionally to produce creations like "Zalgo text," which stacks many combining marks on a single base character. Zalgo text is mostly silly, but it is a useful reminder that a grapheme cluster can be arbitrarily many code points long.

Why both forms exist

Why not just pick one? Historical reasons, of course. Precomposed forms for common accented letters (é, ü, ñ, and so on) existed in legacy encodings like Latin-1, and Unicode preserved them for round-tripping. Combining marks exist because they generalize: you cannot precompose every letter–accent combination that a linguist might need, but you can combine them.

The result is that for many characters, there are multiple legitimate code-point sequences that render identically. In Chapter 5 we will cover normalization, the official way to turn any of these equivalent sequences into a canonical form, so that string comparison can be made reliable.

Emoji and their modifiers

Modern emoji turn the grapheme-cluster problem from an edge case into a headline feature.

Skin tone modifiers (Fitzpatrick)

Consider the waving hand emoji 👋. On its own, it is U+1F44B — one code point, one grapheme cluster.

Now consider 👋🏽 — a waving hand with medium skin tone. This is two code points:

U+1F44B WAVING HAND SIGN
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4

But it is one grapheme cluster. Your cursor moves over it as one character. Your backspace deletes both code points at once.

Unicode defines five skin tone modifiers (U+1F3FB through U+1F3FF), corresponding to the Fitzpatrick scale's types 1–2, 3, 4, 5, and 6. A supported base emoji followed by a modifier renders as a single tinted emoji, a single grapheme cluster, two code points, and — in UTF-8 — eight bytes.

Variation selectors

Some symbols can be rendered as either a "text" glyph (monochrome, aligned to the baseline) or an "emoji" glyph (colored, possibly animated). The default differs by symbol and platform. To force one or the other, Unicode provides two variation selectors:

U+FE0E (VARIATION SELECTOR-15) forces text presentation.
U+FE0F (VARIATION SELECTOR-16) forces emoji presentation.

Consider the heart symbol:

♥ alone: U+2665. Renders as text or emoji depending on context.
♥︎ = U+2665 U+FE0E. Forced text presentation.
♥️ = U+2665 U+FE0F. Forced emoji presentation.

All three are one grapheme cluster. The second and third are two code points each.

ZWJ sequences

The zero-width joiner (U+200D, or ZWJ) tells a renderer "treat the characters on either side of me as a single ligature if you can." It is the mechanism behind the most elaborate emoji sequences in Unicode.

Consider the family emoji 👨‍👩‍👧‍👦 — man, woman, girl, boy. As a sequence of code points:

U+1F468 MAN
U+200D  ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D  ZERO WIDTH JOINER
U+1F467 GIRL
U+200D  ZERO WIDTH JOINER
U+1F466 BOY

Seven code points. One grapheme cluster. In UTF-8, that is 25 bytes. In UTF-16, 11 code units (four supplementary code points × 2 + three ZWJs × 1). The renderer — if its font supports this particular family configuration — draws a single glyph.

Not all ZWJ sequences have defined glyphs. An unrecognized ZWJ sequence typically falls back to rendering the individual emoji side by side, possibly with the ZWJ as a visible gap. This is why family emoji look different across platforms, and why composing arbitrary ZWJ combinations may or may not produce meaningful pictures.

A worked example

Let us take the string "Hi 👨‍👩‍👧‍👦!" and count it four different ways.

>>> s = "Hi 👨\u200d👩\u200d👧\u200d👦!"
>>> len(s)                              # code points
11
>>> len(s.encode("utf-8"))              # UTF-8 bytes
29
>>> len(s.encode("utf-16-le")) // 2     # UTF-16 code units
15

Grapheme clusters: 5 (H, i, , 👨‍👩‍👧‍👦, !). Python's standard library does not count these directly; you need the regex module (not re) or the grapheme package:

>>> import regex
>>> len(regex.findall(r"\X", s))        # \X matches one grapheme cluster
5

Five is the answer a user would give if you asked "how many characters is this?" None of the other counts match it.

How to count grapheme clusters correctly, by language

This is the operation most languages make awkward. Here is where to reach for in each:

Python: the third-party regex module, regex.findall(r"\X", s), or the grapheme package.
JavaScript: Intl.Segmenter (ECMA-402, available in all modern browsers and Node ≥ 16). [...new Intl.Segmenter().segment(s)].length.
Java: java.text.BreakIterator.getCharacterInstance().
Go: the golang.org/x/text/unicode/norm and rivo/uniseg modules.
Rust: the unicode-segmentation crate. s.graphemes(true).count().
Swift: s.count. This is the default behavior of String.count — it counts grapheme clusters, not code points. Swift is the one major exception.
C/C++: the ICU library. icu::BreakIterator::createCharacterInstance.

If you are writing user-facing code — rendering a text field, truncating a tweet, counting remaining characters in a form — you want grapheme clusters. Counting code points will miscount any string with combining marks, emoji with modifiers, or ZWJ sequences. Counting code units is even more wrong. Counting bytes is most wrong of all.

Grapheme cluster boundaries: the rough rules

UAX #29 defines grapheme cluster breaks by a complex table of code-point categories. At a high level, a break is allowed between two code points unless the code points are in one of several non-breaking combinations:

No break after a CR followed by LF. (Treats \r\n as one cluster.)
No break between a base character and a combining mark.
No break between emoji components connected by ZWJs.
No break within a regional indicator pair (flag emoji).
No break between an emoji and a Fitzpatrick modifier.
No break between Hangul L, V, T, LV, and LVT syllable components.

The rules have exceptions and are revised with each Unicode version; if you need them in production, use an implementation, not your own code. But knowing the shape of the rules lets you predict what will happen to a weird string.

Regional indicators and flags

Flag emoji have a particularly elegant encoding: each country flag is a pair of regional indicator code points (U+1F1E6 through U+1F1FF), where each regional indicator corresponds to a Latin letter A–Z. Two regional indicators form an ISO 3166-1 alpha-2 country code.

🇺 + 🇸  =  🇺🇸          U+1F1FA U+1F1F8 ("US")
🇯 + 🇵  =  🇯🇵          U+1F1EF U+1F1F5 ("JP")
🇬 + 🇧  =  🇬🇧          U+1F1EC U+1F1E7 ("GB")

A pair of regional indicators is one grapheme cluster. An odd number of consecutive regional indicators is ambiguous and typically breaks after every pair.

Subnational flags (Scotland, Wales, England, Texas) use a different mechanism involving tag characters — U+E0000 through U+E007F — which we will revisit in Chapter 14. For now, observe the elegance: Unicode did not assign a code point per country flag. It defined a combinatorial rule, and the world's ~250 country flags fall out of it automatically.

The key takeaway

When you write length(s) or s.length or len(s), you have to ask two questions:

What unit is this counting? Code points, code units, bytes, or grapheme clusters?
Is that the unit I actually want?

For user-facing purposes — truncation, cursor movement, character count — the answer is almost always grapheme clusters. For wire-format purposes — buffer size, database column width — the answer is usually bytes (or code units, if your storage is 16-bit). For internal string manipulation where you need to reason about abstract text identity, code points are the right unit.

In the next chapter, we will tackle the follow-up problem: when the same cluster can be written two different ways (precomposed vs. decomposed), how do we reliably compare them?

Normalization

At the end of the previous chapter, we met a problem. The letter é can be encoded as one code point (U+00E9) or two (U+0065 U+0301). Visually identical; byte-sequentially distinct. If you naively compare two strings for equality, you will sometimes say they are different when a human would say they are the same.

Normalization is Unicode's answer: a deterministic procedure that rewrites a string into one of four canonical forms, so that any two strings representing the same text end up with the same bytes.

The four forms

Unicode defines four normalization forms, from UAX #15. Their names are almost self-explanatory once you know the two axes they vary on:

	Composed	Decomposed
Canonical	NFC	NFD
Compatibility	NFKC	NFKD

The two axes:

Canonical vs. Compatibility: how aggressively to treat two code points as "the same."
Composed vs. Decomposed: whether to prefer single precomposed code points or base+combining-marks sequences.

You will meet all four in your career. Here is when each matters.

Canonical equivalence: NFC and NFD

Two code-point sequences are canonically equivalent if they represent the same abstract character. U+00E9 (precomposed é) and U+0065 U+0301 (e + combining acute) are canonically equivalent by definition.

NFC (Normalization Form C) converts a string to its canonically composed form: wherever a base + combining sequence has a precomposed equivalent, use the precomposed one.

NFD (Normalization Form D) does the opposite: decomposes every precomposed character into its base + combining sequence.

NFC(NFC(s)) = NFC(s); NFD(NFD(s)) = NFD(s); NFC(NFD(s)) = NFC(s). Both forms are stable.

>>> import unicodedata as ud
>>> a = "café"             # 4 code points, precomposed
>>> b = "cafe\u0301"       # 5 code points, decomposed
>>> a == b
False
>>> ud.normalize("NFC", a) == ud.normalize("NFC", b)
True
>>> ud.normalize("NFD", a) == ud.normalize("NFD", b)
True

You will often see NFC described as "the default" because it is what most text on the Web uses, what most users type, and what most fonts render most efficiently. When in doubt, normalize to NFC.

Compatibility equivalence: NFKC and NFKD

Canonical equivalence preserves meaning exactly. Compatibility equivalence is looser: it will also fold together characters that have the same "underlying identity" but differ in formatting or presentation. For example:

ﬁ (U+FB01, LATIN SMALL LIGATURE FI) is compatibility-equivalent to fi (U+0066 U+0069).
² (U+00B2, SUPERSCRIPT TWO) is compatibility-equivalent to 2 (U+0032).
① (U+2460, CIRCLED DIGIT ONE) is compatibility-equivalent to 1.
Half-width and full-width Latin letters fold together: ＡＢＣ → ABC.
Some Arabic presentation forms fold to their base letters.

NFKC and NFKD apply these folds in addition to the canonical folds of NFC and NFD respectively. The "K" stands for Kompatibility — spelled with a K to distinguish it from C (which already meant Composed).

>>> ud.normalize("NFC", "ﬁ")
'ﬁ'
>>> ud.normalize("NFKC", "ﬁ")
'fi'

NFKC and NFKD are lossy transformations: the formatting distinctions they erase are not recoverable. You should not normalize your data to NFKC and then store it unless you are sure you never wanted to preserve those distinctions. That said, NFKC is enormously useful for search and for identifier comparison, which we will cover in Chapter 12.

When to normalize

The practical rule: normalize at input boundaries, not at every comparison.

When text enters your system — form submission, API payload, database import, file read — decide on a normalization form and apply it once.
Store normalized data.
Then comparison, searching, and indexing can use fast byte-level operations.

The alternative — storing text unnormalized and normalizing on every comparison — is correct but slow, and invites subtle bugs where some comparison paths normalize and others don't.

Choose NFC for most user text. Choose NFD if you are doing linguistic analysis that cares about individual combining marks. Choose NFKC for case-insensitive searching, for username normalization, and for user-visible identifier comparison. Never choose NFKC if you need to preserve the distinction between, say, ﬁ and fi in stored user content.

Canonical ordering

Here is a detail that bites in practice. Suppose a single base character has multiple combining marks:

a + acute + cedilla       U+0061 U+0301 U+0327
a + cedilla + acute       U+0061 U+0327 U+0301

Both render identically (both say "a with acute and cedilla"). Are they canonically equivalent?

Yes — but only because normalization reorders them. Unicode assigns each combining mark a Combining Class (CCC): a positive integer for non-spacing marks, zero for base characters and spacing marks. Normalization to NFC or NFD reorders combining marks so that marks with a lower combining class come first.

This means that after normalization, there is a single canonical ordering for any stack of combining marks, and byte equality corresponds to meaning equality.

Filesystems and normalization: a real-world mess

Filesystems have had to answer the question "are these two filenames the same?" and they have given different answers.

macOS / HFS+ / APFS

Historical HFS+ normalized filenames to a variant of NFD on write. APFS, which replaced HFS+ in 2017, is normalization-insensitive by default: it accepts filenames in any form, stores them as-given, but compares them case- and normalization-insensitively. On macOS, you can create a file named café (NFC) and open it as café (NFD), and the OS will treat them as the same file. The actual bytes stored in the directory depend on the filesystem version and the creator.

Linux / ext4 / btrfs / xfs

Standard Linux filesystems are byte-literal: a filename is a sequence of bytes, end of story. Two filenames that differ only in normalization form are different filenames. You can have café and café in the same directory and the OS is happy about it.

This causes real problems when a macOS user sends a git repo to a Linux user and the two end up with filename variants that conflict.

Windows / NTFS

NTFS preserves the filename's original case and normalization. Comparison is case-insensitive by default (as on Windows), but the comparison is otherwise byte-literal at the normalization level. Two files differing only by normalization form can coexist, though most Windows tools treat this as a surprise.

The portable rule: pick one normalization form for your project (NFC is the best default) and stick to it. If your build system cares about cross-platform filename compatibility, normalize on commit.

The `\r\n` footnote

One last canonical-equivalence footnote. Normalization does not touch line endings. \r\n, \n, and \r remain distinct after normalization. If you need line endings normalized, do that separately. If you need BOMs normalized (stripped), do that separately too.

Worked example: a search bar that finds `café` when the user types `cafe`

This is where the pieces come together.

import unicodedata as ud

def normalize_for_search(s):
    # Decompose, drop combining marks, then lowercase.
    s = ud.normalize("NFD", s)
    s = "".join(ch for ch in s if not ud.combining(ch))
    s = s.casefold()
    return s

normalize_for_search("café")   # 'cafe'
normalize_for_search("CAFÉ")   # 'cafe'
normalize_for_search("cafe")   # 'cafe'

What we did:

Decomposed: é → e + acute.
Removed all combining marks (ud.combining returns non-zero for combining code points).
Case-folded (more on that in Chapter 6; it is not the same as lowercasing).

This is a classic "fuzzy match" for user-facing search. It throws away accents, which is desirable in Chrome's address bar and disastrous for a dictionary that distinguishes resume from résumé. Know which one you want.

In the next chapter, we tackle the sibling topic: once strings are canonically comparable, how do we sort them? The answer is more than you might expect.

Comparison and Collation

There are two kinds of string comparison and they are almost never interchangeable.

Byte comparison asks: are these two sequences of bytes identical? It is fast, deterministic, transitive, and useless for most user-facing work.
Collation asks: in a human-readable sort order — specifically, the order a fluent reader of this language expects — which string comes first? It is slow, locale-dependent, full of exceptions, and absolutely necessary for user-facing lists.

This chapter covers both, plus case folding, which is subtler than it looks.

Byte comparison and its pitfalls

If you want two strings to be "the same," normalize them (Chapter 5), encode them to the same encoding, and compare the bytes. This answers are these bytes bit-for-bit the same?. Good for primary keys, cache keys, anything mechanical.

It is a bad answer to does the user think these are the same? A user typing "CAFÉ" and searching for "café" expects a hit. A user visiting /User/23 expects it to match /user/23. A user sorting ["Äpfel", "Apfel"] in German will be unhappy with the byte order.

Byte comparison is a useful primitive, but it is not what "compare two strings" usually means in an interface.

Case folding vs. lowercase

You might be about to say, "I will just s.lower() everything and compare." That will mostly work. Here is why it is not quite right.

Lowercase is a transformation that maps a string to its lowercase form, intending to produce text that looks correct when displayed. It is a display operation. It is locale-sensitive. In Turkish, lowercasing I (U+0049) yields ı (U+0131, dotless i), not i. İ (U+0130, capital I with dot above) lowercases to i (U+0069). Turkish distinguishes the dotted and dotless i, and its casing rules differ from English.

Case folding is a transformation that maps a string to a comparison-safe form. Its output may not be "correct lowercase"; it is merely a form where two strings that should compare equal under case-insensitive comparison produce the same bytes. It is locale-insensitive by default (you can ask for a Turkic-specific variant explicitly).

>>> "ß".lower()
'ß'
>>> "ß".casefold()
'ss'

German ß (the eszett) uppercases to SS but historically had no lowercase for SS. Case folding turns ß into ss so that "Straße" and "STRASSE" compare equal when you case-fold both. Plain lowercasing doesn't do this.

The rule of thumb: for case-insensitive comparison, use casefold() (Python), toLocaleLowerCase + caution (JavaScript), or a collator with sensitivity: "base" (better). For case-changed display, use lower() / toLowerCase(), and pass a locale if you have one.

Collation

Collation is the process of producing an order over strings that matches human expectations. It depends on a locale: the same sort call, run against the same list, can produce different results for a Swedish user and a German user — and that is correct.

Why sort order differs by locale

In Swedish, the letters å, ä, and ö come at the end of the alphabet, after z. In German, ä, ö, ü are treated as variants of a, o, u and intercalate among them. In Spanish (older conventions), ll and ñ used to be their own letters; modern Spanish treats ll as two characters but ñ is still a separate letter after n.

So the list ["Apfel", "Äpfel", "Zebra"] sorts as:

German (phonebook variant): Apfel, Äpfel, Zebra — ä right next to a.
Swedish: Apfel, Zebra, Äpfel — ä after z.
Byte order (UTF-8 or code point): Apfel, Zebra, Äpfel — coincidentally matches Swedish here, but only because Ä (U+00C4) has a higher code point than Z. It is not a general truth.

You cannot sort a mixed list of names "correctly" in a single global order. You can only sort correctly for a given locale, and the locale has to come from somewhere — user preference, document language, URL parameter.

The Unicode Collation Algorithm

UCA (Unicode Collation Algorithm, UAX #10) is the framework. It assigns each code point a sequence of collation weights, typically at three levels:

Primary (letter identity: A vs. B).
Secondary (diacritics: A vs. Á).
Tertiary (case: A vs. a).

To compare two strings, the algorithm compares their level-1 weights first; if equal, their level-2 weights; if equal, their level-3. This naturally expresses "A and Á are the same letter but the accented one sorts after the unaccented one, and lowercase sorts after uppercase at the last tiebreak."

The default weights come from the Default Unicode Collation Element Table (DUCET). To get a locale-specific order, you overlay that locale's tailoring: a set of changes to the weights that encodes the conventions of that language. The Common Locale Data Repository (CLDR) publishes tailorings for hundreds of locales; every serious Unicode library ships a copy.

You almost never implement UCA yourself. You call:

Python: pyuca package or icu.Collator from PyICU.
JavaScript: Intl.Collator. Built in.
Java: java.text.Collator.
Go: golang.org/x/text/collate.
Rust: the icu crate.
Swift: String.localizedStandardCompare(_:).
C/C++: ICU.

Intl.Collator, a practical tour

Intl.Collator is the single most ergonomic UCA implementation on any platform. If you have a browser or Node, you have it.

const en = new Intl.Collator("en");
const de = new Intl.Collator("de");
const sv = new Intl.Collator("sv");

const names = ["Zebra", "Äpfel", "Apfel"];

names.slice().sort(en.compare);   // ["Apfel", "Äpfel", "Zebra"]
names.slice().sort(de.compare);   // ["Apfel", "Äpfel", "Zebra"]  — same in German
names.slice().sort(sv.compare);   // ["Apfel", "Zebra", "Äpfel"]  — Swedish puts ä after z

// Case-insensitive, accent-insensitive:
const base = new Intl.Collator("en", { sensitivity: "base" });
base.compare("Café", "cafe");     // 0 — equal at the primary level

// Natural numeric sort:
const nat = new Intl.Collator("en", { numeric: true });
["file10", "file2", "file1"].sort(nat.compare);  // ["file1", "file2", "file10"]

Memorize this one: for user-facing sorts in JavaScript, Intl.Collator is almost always the right answer. Do not write your own comparator.

Case folding vs. collation level

Earlier we said case folding is for comparison. That is true, and collation gives you a different, finer-grained way to do the same thing: set the sensitivity to base (primary-level only) or accent (primary and secondary). For equality checks — "does the user's input match this list entry, ignoring case and accents?" — a base-sensitivity Intl.Collator.compare(a, b) === 0 is generally the best answer, because it uses the locale's rules rather than a locale-free case-fold.

Sorting a list of user names correctly

Here is the pragmatic recipe.

Determine the locale. Per-user preference if you have it; document language otherwise; fall back to und (undetermined) if nothing.
Use a proper collator (Intl.Collator, java.text.Collator, etc.).
Pass options you care about: numeric sort, case sensitivity, accent sensitivity.
Let the collator sort.

function sortUsers(users, locale = "en") {
  const coll = new Intl.Collator(locale, {
    sensitivity: "base",
    numeric: true,
    usage: "sort",
  });
  return users.slice().sort((a, b) => coll.compare(a.name, b.name));
}

What not to do: compare normalized-lowercased strings with < / >. That will mostly work for ASCII names and fail in interesting ways on the 200th user whose name starts with Ñ.

Stable and deterministic sorts

One last note. UCA's default behavior is that strings with the same primary weight are "equal" at that level; tie-breaking may descend to secondary, tertiary, and sometimes quaternary weights, and after that may fall back to byte order. This means UCA-sorted lists are not quite a strict total order in all cases; different implementations may disagree about exactly how ties break.

Most real sort algorithms today are stable (they preserve the order of equal elements), which is what you want: if two items are equal at the collation level you care about, their original order is preserved. JavaScript's Array.prototype.sort has been required to be stable since ES2019. Python's sorted and list.sort have always been stable. Java's Collections.sort is stable. Go's sort.SliceStable is stable, but sort.Slice is not.

With comparison and sorting in hand, we are ready to look at how real languages — each with its own internal string model — actually expose all of this to you.

The Working Programmer's Cheat Sheet, Per Language

Every language picked a model for how its string type works, and those models are not the same. When you move between languages, you are moving between models, and the bug that bites you is usually at that seam. This chapter goes through the main languages you are likely to use, and for each one tells you: what a string is, what length counts, and where to reach when the built-ins aren't enough.

We will use a single reference string throughout: "Hi 👋🏽" — H, i, space, and a waving hand with medium skin tone. That last grapheme cluster is two code points: U+1F44B (waving hand) + U+1F3FD (Fitzpatrick modifier type 4).

Grapheme clusters: 4.
Code points: 5.
UTF-16 code units: 7 (the two supplementary code points are surrogate pairs).
UTF-8 bytes: 11 (H=1, i=1, space=1, waving hand=4, modifier=4).

Python 3

String type: str. A sequence of Unicode code points. Since Python 3.3 (PEP 393), CPython uses a flexible internal representation — 1, 2, or 4 bytes per code unit depending on the largest code point in the string — but the semantics are uniform: indexing and len() count code points.

Bytes type: bytes. A sequence of bytes, unrelated to text semantically. str.encode(...) goes from str to bytes; bytes.decode(...) goes the other way. You must always specify an encoding at the boundary.

>>> s = "Hi 👋🏽"
>>> len(s)                 # code points
5
>>> s[3]                    # one code point — the waving hand, without skin tone
'👋'
>>> len(s.encode("utf-8"))
11

What len() counts: code points. Not bytes, not grapheme clusters.

Where the built-ins are enough:

str.encode / bytes.decode for encoding conversion.
unicodedata stdlib module: normalization, category lookup, combining class.
casefold() for case-insensitive comparison.

Where you need a third-party library:

Grapheme clustering: the regex module (not re) supports \X, or use the grapheme package.
Full Unicode collation: PyICU (icu.Collator) or pyuca.
Locale-aware operations: PyICU.

Gotchas:

sys.getfilesystemencoding() governs filenames on disk. Linux is "utf-8" on all modern systems; macOS is "utf-8"; Windows is "utf-8" as of Python 3.6+ on Windows 10+, was "mbcs" historically.
Python 3 introduced the surrogateescape error handler specifically so that filesystem paths containing bytes that aren't valid UTF-8 can round-trip.

JavaScript

String type: a sequence of 16-bit UTF-16 code units. This is not exactly the same as a sequence of code points: supplementary code points appear as pairs of code units. Strings can contain lone surrogates, which are code unit values that don't pair properly — these are not valid Unicode, but they are valid JavaScript strings.

Bytes type: Uint8Array or ArrayBuffer. You go between a string and bytes via TextEncoder / TextDecoder.

const s = "Hi 👋🏽";
s.length;                   // 7 — UTF-16 code units
[...s].length;              // 4 — iterator splits by code point, but here counts 5? 
// actually:
[...s].length;              // 5 — iterator yields code points; our string has 5 of them

Wait — the string iterator yields code points, not grapheme clusters. So [...s].length gives 5 for our example, not 4. To get 4, use Intl.Segmenter:

const seg = new Intl.Segmenter();
[...seg.segment(s)].length;   // 4 — grapheme clusters

What .length counts: UTF-16 code units. This is the infamous gotcha: "😀".length is 2, not 1.

Where the built-ins are enough:

String.prototype[Symbol.iterator] for code-point iteration ([...s], for..of).
String.prototype.codePointAt(index) for reading a full code point (returns the whole supplementary code point even when index lands on a high surrogate).
String.fromCodePoint(cp) for constructing.
TextEncoder / TextDecoder for encoding to/from UTF-8 (WHATWG standard).
Intl.Collator for locale-aware comparison and sorting.
Intl.Segmenter for grapheme, word, and sentence segmentation. (ECMA-402 2022; shipping in all modern browsers and Node ≥ 16.)
String.prototype.normalize("NFC" | "NFD" | "NFKC" | "NFKD") for normalization.

Gotchas:

"😀".length === 2. If a user sees one character, your tweet-length counter shouldn't say two.
"😀"[0] is the unpaired high surrogate, not the emoji. Indexing by UTF-16 is almost never what you want for display.
Regular expressions with the u flag iterate code points; without it, they iterate code units and [\uD83D\uDE00] is a character class of two code units, not the emoji.
JSON: JSON.stringify("\uD83D") returns '"\\ud83d"', a lone surrogate embedded as an escape. Strict JSON parsers may reject this.

Java

String type: java.lang.String, backed by a char[] where each char is a 16-bit UTF-16 code unit. Since Java 9 (JEP 254), the JVM may store ASCII-only strings in a byte[] internally (compact strings), but the API semantics are unchanged.

Bytes type: byte[]. You go via String::getBytes(Charset) and new String(byte[], Charset).

String s = "Hi 👋🏽";
s.length();                  // 7 — UTF-16 code units
s.codePointCount(0, s.length()); // 5 — code points
s.getBytes("UTF-8").length;  // 11 — bytes in UTF-8

What length() counts: UTF-16 code units.

Where the built-ins are enough:

String::codePoints() (since Java 8) gives an IntStream of code points.
String::getBytes(Charset) for encoding to bytes. Always pass a Charset; the no-arg version uses the platform default, which is a latent bug.
java.text.Normalizer for NFC/NFD/NFKC/NFKD.
java.text.Collator for locale-aware comparison.
java.text.BreakIterator.getCharacterInstance() for grapheme clustering.

Gotchas:

s.charAt(i) returns a char, which is a UTF-16 code unit — not a code point, not even a full Unicode character in the surrogate case. Use s.codePointAt(i) if you need a code point.
String::equalsIgnoreCase uses the Unicode "simple" folding rules, which is better than toLowerCase().equals(...) but still not the same as a full Unicode case-insensitive collator.
Charset.defaultCharset() is UTF-8 from Java 18 onward (JEP 400); before that it depended on the platform.

Go

String type: string, which is an immutable sequence of bytes. Go strings are conventionally UTF-8 — the compiler encodes string literals as UTF-8, and the standard library assumes UTF-8 almost everywhere — but a string can legally contain any bytes.

Rune type: rune is an alias for int32 and is used to represent a single code point. for _, r := range s iterates over code points (runes); invalid UTF-8 bytes yield utf8.RuneError (U+FFFD).

Bytes: []byte. Convertible to/from string directly.

s := "Hi 👋🏽"
len(s)                           // 11 — bytes
utf8.RuneCountInString(s)        // 5 — code points

for i, r := range s {
    fmt.Printf("%d %U %q\n", i, r, r)
}
// 0 U+0048 'H'
// 1 U+0069 'i'
// 2 U+0020 ' '
// 3 U+1F44B '👋'
// 7 U+1F3FD '🏽'

What len() counts: bytes.

Where the built-ins are enough:

range s for code-point iteration.
unicode/utf8 for counting and validation.
strings.ToValidUTF8 for sanitizing.

Where you need golang.org/x/text:

Normalization: golang.org/x/text/unicode/norm.
Collation: golang.org/x/text/collate.
Grapheme clustering: rivo/uniseg or golang.org/x/text/unicode/norm (for some use cases).
Case folding: golang.org/x/text/cases.

Gotchas:

s[i] gives a byte, not a rune. Indexing into a UTF-8 string by byte is almost never what you want.
len(s) gives bytes.
The gap in the iteration above (index 3 to index 7) reflects the 4-byte UTF-8 encoding of U+1F44B.

Rust

String type: String (owned, growable) and &str (borrowed slice). Both are guaranteed valid UTF-8. Invalid UTF-8 cannot exist in a &str — if you need possibly-invalid bytes, use [u8] or Vec<u8>.

Char type: char is a 32-bit Unicode scalar value (a code point, excluding surrogates). s.chars() iterates over them.

Bytes: &[u8] / Vec<u8>. as_bytes() on &str is free (it is literally a &[u8]).

#![allow(unused)]
fn main() {
let s = "Hi 👋🏽";
s.len();                    // 11 — bytes
s.chars().count();          // 5 — code points

for c in s.chars() {
    println!("{:?}", c);
}
}

What len() counts: bytes.

Where the built-ins are enough:

chars() for code-point iteration.
char::from_u32, u32::from(char) for conversions.
to_lowercase(), to_uppercase() — return iterators because one code point can map to many (e.g., ß → ss).

Where you need crates:

Normalization: unicode-normalization.
Grapheme clustering: unicode-segmentation.
Full ICU behavior: the icu crate (ICU4X).
Case folding: caseless or unicase.

Gotchas:

&s[0..1] slices by byte index and panics if the boundary falls in the middle of a code point. This is safer than silent corruption, but it surprises newcomers.
Rust cannot build a char from a surrogate code point. This is by design.

Swift

String type: String, whose primary view is over grapheme clusters (Swift calls them Characters — note the capital letter and the distinction from C's char).

Bytes / code units: s.utf8, s.utf16, s.unicodeScalars give views over bytes, UTF-16 code units, and code points respectively.

let s = "Hi 👋🏽"
s.count                      // 4 — grapheme clusters
s.unicodeScalars.count       // 5 — code points
s.utf16.count                // 7 — UTF-16 code units
s.utf8.count                 // 11 — UTF-8 bytes

What .count counts: grapheme clusters. Swift is the one mainstream language whose default "character count" matches what users expect.

The tradeoff: s.count is O(n) — it has to walk the string and apply grapheme-break rules. Byte-length operations in Swift require you to explicitly ask for the utf8 or utf16 view, which is O(1).

Gotchas:

Random access by integer index is not directly supported. Use String.Index, which String provides methods for computing.
Different views over the same string are not interchangeable.

C and C++

This is the hardest section to write because "C" and "C++" span decades of assumptions and platforms.

C: `char ` and `wchar_t `

A C char * is a pointer to bytes. Whether those bytes represent ASCII, Latin-1, UTF-8, or something else depends on your program's conventions and the current locale. The C standard library's string functions (strlen, strcmp, strchr) work byte-by-byte, which is correct for UTF-8 as byte operations but doesn't give you any Unicode-aware semantics.

wchar_t is an implementation-defined wide-character type. Its width is:

16 bits on Windows (because Windows committed to UCS-2/UTF-16 early).
32 bits on Linux, macOS, and most other Unixes.

This means C code using wchar_t is not portable in the way you might assume: a wchar_t * means UTF-16 code units on Windows and UTF-32 code units on Linux.

C11: `char16_t` and `char32_t`

C11 added char16_t and char32_t types, intended for UTF-16 and UTF-32 respectively, to get past the wchar_t portability mess. They are rarely used in practice; most C code that cares about Unicode has by now migrated to UTF-8 in char * and uses ICU for the hard operations.

The pragmatic recipe

For new C/C++ code that needs to deal with text:

Internal encoding: UTF-8 in char * (or std::string, std::u8string in C++20).
Windows interop: convert to UTF-16 at the system call boundary. There is no shortcut; Windows's Unicode APIs take UTF-16.
Heavy lifting: ICU. Not C++'s <locale>, which is a quiet disaster.
String length: either strlen (bytes) or an explicit helper that counts code points / grapheme clusters depending on your need. Never trust that "character" is well-defined in a C context without looking up how it was counted.

C++20 and beyond

C++20 has std::u8string (a std::basic_string<char8_t>) which is a UTF-8 string at the type level, and std::format has some Unicode awareness. These are improvements, but for anything real, still reach for ICU or {fmt}.

A compact matrix

Language	`length` counts	String type	Grapheme support
Python 3	code points	`str` (code points)	third-party (`regex`, `grapheme`)
JavaScript	UTF-16 code units	UTF-16	built-in (`Intl.Segmenter`)
Java	UTF-16 code units	UTF-16	built-in (`BreakIterator`)
Go	bytes	UTF-8 bytes	third-party (`rivo/uniseg`)
Rust	bytes	UTF-8 bytes	crate (`unicode-segmentation`)
Swift	grapheme clusters	grapheme clusters	built-in, default
C `wchar_t`	code units (platform-dependent width)	pointer to wide chars	ICU

The pattern: modern languages have converged on either UTF-8 (Go, Rust, C++20 u8string) or UTF-16 (Java, JavaScript, C# — mostly UCS-2 inheritance) as the internal representation. Python 3 went its own way (code-point-indexed, variable internal width). Swift went its own way on the other axis (grapheme-cluster-indexed). Both approaches are legitimate; each has its cost model.

Whatever language you are in, start by asking: what unit am I currently looking at? If you can answer that accurately, the rest follows.

Regular Expressions and Unicode

Regular expressions are where Unicode assumptions become visible. A regex of [a-z] looks innocent until a user with an accented name shows up. This chapter walks through what your regex engines actually do with Unicode text, and how to tell them what you mean.

Why `[a-z]` doesn't match `á`

A character class like [a-z] is a range over code points (or code units, depending on the engine). The range [a-z] is the code points U+0061 through U+007A. That's it. Those are the lowercase ASCII letters; it says nothing about á (U+00E1), č (U+010D), or ω (U+03C9).

If you write ^[a-zA-Z]+$ as your "name must be alphabetic" check, you are saying "only ASCII English letters are acceptable." That is a specific business decision, not a neutral default. If you didn't mean to make it, you have a Unicode bug.

The fix depends on what you actually meant. If you meant "any letter," the fix is Unicode character properties.

Unicode character properties

Every code point has a set of properties assigned by the Unicode Standard. The most important for regex work are:

General Category — a two-letter classification. Ll (lowercase letter), Lu (uppercase letter), Lo (other letter — used for scripts without case, like Chinese), Nd (decimal digit), Pc (connector punctuation), etc. The one-letter prefixes group them: L is all letters, N is all numbers, P is all punctuation, Z is all separators, C is all control/unassigned/private-use.
Script — which writing system the code point belongs to. Latin, Cyrillic, Greek, Han, Arabic, Devanagari, etc.
Block — where in the Unicode code point space the character sits. Rarely what you want (see below).
Derived properties like Alphabetic, White_Space, Emoji.

In Unicode-aware regex, you access properties with \p{…}:

\p{L} matches any letter (any L* category).
\p{Ll} matches any lowercase letter.
\p{N} matches any numeric character.
\p{Script=Latin} or \p{Latin} matches any code point in the Latin script.
\p{Emoji} matches any emoji (derived property).
\P{…} is the negation.

So the "any letter" regex is \p{L}, not [a-zA-Z].

Block vs. Script

A common mistake is to write \p{InGreek} (the block) when you mean \p{Script=Greek}. Blocks are ranges of code points and have no semantic meaning beyond their position: the "Greek" block contains Greek letters and some unassigned slots and some specifically-coptic letters, and does not contain the Greek Extended block. The Script property is what you almost always want.

Worse, in some engines [\u0370-\u03FF] is the Greek block's byte range and looks correct, but it misses the Greek Extended block (U+1F00–U+1FFF, where polytonic Greek lives). \p{Script=Greek} includes both.

Unicode-aware vs. code-unit-aware regex

Most regex engines have a flag or mode that controls how they interpret the pattern against the input.

JavaScript

Two relevant flags: u (Unicode) and v (Unicode sets, ES2024).

"😀".match(/./);       // matches U+D83D (the high surrogate); .length === 1
"😀".match(/./u);      // matches U+1F600 (the whole code point)
"😀".match(/\p{Emoji}/u);  // works with /u

Without the u flag, JavaScript regexes operate on UTF-16 code units. . matches one code unit. [\u{1F600}] is a syntax error. \p{…} is not recognized.

With the u flag: . matches one code point. [\u{1F600}] is allowed. \p{…} is enabled.

With the v flag (superset of u): string character classes, set operations (intersection, subtraction), and more powerful property escapes. [\p{Script=Greek}--\p{Letter}] (code points in the Greek script that aren't letters).

Python

The re module is Unicode-aware by default in Python 3. \w, \d, \s match all Unicode word characters, digits, and whitespace. Use the re.ASCII flag to revert to ASCII-only.
re does not support \p{…} or grapheme clusters. For those, use the third-party regex module.
regex (not re) supports \p{L}, \X for grapheme clusters, \N{name} for code points by name, possessive quantifiers, and more.

import regex
regex.findall(r"\X", "Hi 👋🏽")     # ['H', 'i', ' ', '👋🏽']
regex.findall(r"\p{L}", "café 1")  # ['c', 'a', 'f', 'é']

Java

java.util.regex.Pattern supports \p{L}, scripts as \p{IsGreek}, and category escapes like \p{Ll}. To enable Unicode-aware \w, \d, \s, use the Pattern.UNICODE_CHARACTER_CLASS flag or the (?U) embedded flag.

Java does not have a grapheme cluster regex construct. Use java.text.BreakIterator.getCharacterInstance().

Go

regexp has limited Unicode property support: \p{L}, \p{Greek}, etc. work. The engine (RE2) is designed for guaranteed linear-time matching and intentionally excludes some features like backreferences. No grapheme cluster escape; reach for rivo/uniseg.

Rust

The regex crate supports Unicode character classes, \p{L}, \p{Script=…}, case-insensitive Unicode matching. No grapheme cluster escape; unicode-segmentation crate.

PCRE / PCRE2 / Perl-style

Under the u modifier (or UTF8 / UCP options), PCRE treats the input as UTF-8, . matches one code point, \p{L} is recognized, \X matches a grapheme cluster.

Case-insensitive matching under Unicode

A naive CaseInsensitive regex flag should match k to K. Under Unicode, it should also match:

k to K to K (U+212A, KELVIN SIGN, which is a Latin K in compatibility form).
ß to SS — the German eszett's uppercase is two letters, so ß/i should match ss.
i to I to İ (Turkish dotted capital I), depending on locale.

Whether your engine does any of this varies. java.util.regex.Pattern.CASE_INSENSITIVE on its own is ASCII-only; pass UNICODE_CASE also. JavaScript's i flag under u handles simple folds (ſ → s) but not ß → ss. Python's re with re.IGNORECASE folds under Unicode, but ß to ss is not handled (the engine uses simple case folding).

If you need robust case-insensitive matching, case-fold both the pattern and the input up front, then match with a case-sensitive regex. This works around most of the engine-specific gaps.

Grapheme-aware regex

Some engines have a \X escape meaning "one grapheme cluster."

regex (Python third-party): yes.
PCRE: yes.
Perl: yes.
Go: no.
Java's java.util.regex: no.
Rust's regex crate: no.
JavaScript: no.

If you need grapheme-aware matching in a language without \X, segment the string into grapheme clusters first (Chapter 4) and then apply per-cluster logic.

Practical recipes

Validating usernames

"Letters, digits, underscores, 3–20 characters."

// Unicode-letter aware:
const valid = /^[\p{L}\p{N}_]{3,20}$/u;
valid.test("user_01");      // true
valid.test("用户_1");         // true
valid.test("user name");    // false (space)

Remember to also apply a confusables check (Chapter 10) if this username is user-visible. Allowing any letter means allowing mixed-script usernames, which is a vector for homograph attacks.

Stripping accents

import unicodedata as ud
s = "Café résumé naïve"
decomposed = ud.normalize("NFD", s)
stripped = "".join(ch for ch in decomposed if not ud.combining(ch))
# 'Cafe resume naive'

Regex isn't the right tool here; normalization is.

Counting words

import regex
text = "Rouge et noir — 三国演义"
regex.findall(r"\p{L}+", text)
# ['Rouge', 'et', 'noir', '三国演义']

\p{L}+ is the closest you can get to a language-neutral word boundary. For real word segmentation — which matters in Chinese, Japanese, and Thai, where there are no spaces between words — you need java.text.BreakIterator.getWordInstance() or ICU's equivalent.

The regex tax

Regular expressions are a terrific pattern-matching tool. They are also one of the places where "it worked on my English test input" most reliably fails in production. Before you ship a regex, always ask:

Is . matching a byte, a code unit, or a code point?
Are [a-z], [A-Z], \w, \d, \s doing what I want for non-English input?
Does case-insensitive matching handle the fold pairs I care about?
Am I matching user-visible characters (grapheme clusters) or code points?

If any of those answers is "I don't know," find out before the regex goes near live data.

Next we look at input and output: what actually happens when your bytes move between programs.

Input, Output, and Interchange

Text does not live inside your program. It arrives from somewhere — a file, a socket, a form submission — and it leaves for somewhere. Every boundary between "in-memory strings" and "bytes on the wire" is a chance to decode or encode incorrectly, and most real-world Unicode bugs happen at these boundaries. This chapter maps the boundaries and the conventions that govern them.

The single most important principle: encoding is a transport concern, not an identity concern. At any boundary, you need to know which encoding is in use. The encoding is metadata, not data — it is information about the bytes, and it must be carried alongside the bytes (or established by convention between sender and receiver).

HTTP

HTTP carries encoding information in the Content-Type header for text bodies:

Content-Type: text/html; charset=utf-8
Content-Type: application/json
Content-Type: text/plain; charset=iso-8859-1

text/* types may specify a charset parameter. If absent, the historical default (RFC 2616) was ISO-8859-1, but modern practice and most tools default to UTF-8 or infer from the body.
application/json has a special rule: RFC 8259 requires JSON to be UTF-8 on the wire, and a charset parameter is not standard. Don't add one; some clients will reject it.
application/xml and text/xml are ambiguous; the XML declaration inside the body is more authoritative than the Content-Type.

For requests (forms, uploads):

application/x-www-form-urlencoded: form fields are percent-encoded. The encoding of the underlying bytes before percent-encoding is implicit; browsers use UTF-8 for forms on UTF-8 pages.
multipart/form-data: each part can have its own Content-Type with a charset parameter, but most senders don't set it.

The lesson: for APIs you design, document the encoding, use UTF-8, and never rely on inference.

HTML

HTML's encoding rules are layered, and browsers consult them in a specific order:

The byte-order-mark, if present (UTF-8 BOM EF BB BF, UTF-16 BOM FE FF or FF FE).
The HTTP Content-Type header's charset parameter.
The <meta charset="utf-8"> tag (or old-style <meta http-equiv="Content-Type" content="text/html; charset=utf-8">), if it appears in the first 1024 bytes.
Browser's character-set-sniffing heuristics.
The user's locale default.

A well-formed HTML file in 2026 has:

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  ...
</head>

The <meta charset> tag must appear early enough that the encoding is known before any non-ASCII text is parsed. Browsers spec-mandate that it be in the first 1024 bytes.

XML

XML's encoding is declared in the XML declaration at the top of the file:

<?xml version="1.0" encoding="utf-8"?>

The declaration is ASCII-only (required to be, so that any ASCII-compatible encoding will allow the declaration itself to be parsed). If no declaration is present, the spec says:

If there is a BOM, use the corresponding encoding.
Otherwise assume UTF-8.

XML processors are required to support at least UTF-8 and UTF-16.

JSON

JSON is required to be Unicode. RFC 8259 (the current JSON spec) says JSON must be UTF-8 when exchanged between systems. The older RFC 7159 allowed UTF-8, UTF-16, or UTF-32; RFC 8259 tightened to UTF-8 only.

Inside a JSON string, non-ASCII characters can be escaped as \uXXXX (four hex digits = one UTF-16 code unit), or written literally. Supplementary code points must be written as surrogate pairs in escape form:

{"face": "\uD83D\uDE00"}      // valid, 😀 encoded as surrogate pair
{"face": "😀"}                 // also valid, literal emoji in UTF-8

This surrogate-pair escape is a leaky abstraction from JavaScript — the only reason JSON expresses supplementary code points this way is that its lineage runs through JavaScript, which expresses strings in UTF-16. A strict JSON producer can choose which style to use; most modern producers write the literal UTF-8 bytes and only escape what must be escaped (quote, backslash, control characters).

A subtle pitfall: lone surrogates. JSON does not forbid them at the grammar level, so a JavaScript string containing a lone surrogate can be round-tripped through JSON.stringify and JSON.parse. But that JSON is not valid UTF-8 when the lone surrogate is written literally, and many parsers will reject escape-form lone surrogates too. This is one of the real sources of interoperability failure.

File I/O

The cardinal sin of `open()` without an encoding

# WRONG in the general case:
with open("data.txt") as f:
    text = f.read()

In Python 3, open() in text mode uses a default encoding, which depends on the platform and environment:

Since Python 3.15 (and configurable since 3.10 via PYTHONUTF8=1 or -X utf8), the default is UTF-8 on all platforms.
Before that, the default was locale.getpreferredencoding(), which could be cp1252 on Windows, UTF-8 on Linux, and so on. Same code, different platforms, different results.

Always specify:

with open("data.txt", encoding="utf-8") as f:
    text = f.read()

And if you're reading bytes of uncertain provenance (a UTF-8 file that may have a BOM), use encoding="utf-8-sig", which strips a leading BOM if present.

Binary mode

Bytes are unambiguous. open("data.bin", "rb") returns bytes, no encoding involved. If you are dealing with structured binary or with data of unknown encoding, open in binary mode and decode explicitly.

Line endings

Every platform has made its own peace with line endings: \n on Unix, \r\n on Windows, \r on old Macs. Text-mode I/O in Python 3 translates by default (reads: \r\n → \n; writes: \n → platform's line separator). Pass newline="" to disable translation.

This matters for encoding because CSV files and other line-based formats are sensitive to line terminators: a Windows-authored CSV read in text mode on Unix will have a trailing \r on every field if you don't use the csv module or specify newline="".

Environment and locale

LANG, LC_ALL, LC_CTYPE, and friends are Unix environment variables that together determine the locale — a triple of (language, territory, codeset). LC_CTYPE=en_US.UTF-8 says "American English, UTF-8 codeset."

Programs that do any locale-sensitive operation consult these. The catch: which locale variable wins depends on the program and the operation. LC_ALL overrides everything. LC_CTYPE governs character classification and conversion. LANG is the default for anything unset.

If your terminal is showing ? or mojibake:

Check LANG and LC_ALL. If they are empty or C or POSIX, you are in an 8-bit-no-Unicode-locale.
Set LANG=en_US.UTF-8 (or your preferred locale). Run locale -a to see what's installed.
Make sure your terminal emulator is configured for UTF-8.

Python before 3.15 honored these via locale.getpreferredencoding(). Go and Rust essentially ignore them for program logic (they use UTF-8 internally regardless) but rely on the terminal's interpretation for output.

Databases

MySQL and the `utf8mb4` footgun

The most famous database Unicode gotcha: MySQL's utf8 collation is not UTF-8. It is a three-byte-maximum subset of UTF-8 that does not support code points above U+FFFF. So it handles the BMP but fails on emoji (all of which are supplementary code points) and on supplementary CJK.

The correct encoding is called utf8mb4 ("UTF-8, maximum 4 bytes per character") and it is what you want. MySQL 8.0 finally changed the default character set to utf8mb4; earlier versions defaulted to utf8.

If you have a pre-8.0 MySQL database, migrate to utf8mb4 at the database level, the table level, the column level, and the connection level. All four. Missing one causes silent truncation on emoji.

PostgreSQL

PostgreSQL supports UTF-8 as a first-class encoding via the UTF8 (or SQL_ASCII, which is really "no encoding enforcement") server encoding. Set this at database creation time; it cannot be changed afterward.

PostgreSQL's text, varchar, char types all store the same bytes; the difference is only length enforcement. No nvarchar needed.

Collation is per-column (or per-database). Choose C collation for ASCII-byte-order, a locale like en_US.UTF-8 for locale-aware ordering, or "und-x-icu" (or any ICU locale) for true UCA behavior if your build has --with-icu.

SQL Server

SQL Server's native varchar is single-byte (a codepage determined by the database's collation); nvarchar is UCS-2 (not UTF-16 — meaning supplementary code points are stored as surrogate pairs but not handled as single characters by all functions).

Since SQL Server 2019, varchar columns can have a UTF-8 collation (Latin1_General_100_CI_AS_SC_UTF8), which stores data as UTF-8 bytes. nvarchar with a collation ending in _SC ("supplementary character") handles supplementary code points correctly.

The TL;DR for SQL Server in 2026: use varchar with a UTF-8 collation for new databases.

SQLite

SQLite is UTF-8 by default and has been for a long time. text is UTF-8. There are also UTF-16 variants of some APIs, which nobody uses.

Email (MIME)

Email is the oldest Unicode-bearing protocol, and it shows. MIME allows any encoding in message bodies via Content-Transfer-Encoding: base64 or quoted-printable with a Content-Type: text/plain; charset=utf-8 header. But email headers (From, To, Subject) are ASCII-by-default and use RFC 2047 "encoded-word" syntax for non-ASCII content:

Subject: =?UTF-8?B?SGVsbG8g8J+RiyE=?=

That is "Hello 👋!" in UTF-8, base64-encoded, in an email header. Parsers are expected to decode this transparently.

Internationalized email addresses (IDN in the local part and in the domain) are supported via the SMTPUTF8 extension (RFC 6531), but support is uneven, so most email infrastructure still converts domains to Punycode (ASCII-safe mojibake for the transport).

URLs and IDNs

URLs are defined by RFC 3986 to contain only ASCII characters. Non-ASCII text in a URL must be percent-encoded after first being encoded as UTF-8:

https://example.com/résumé
→ https://example.com/r%C3%A9sum%C3%A9

Domain names (the host portion) are different: they use IDN (Internationalized Domain Names) with Punycode encoding:

http://café.com
→ http://xn--caf-dma.com   (IDN, ASCII-Compatible Encoding)

We will return to IDN in Chapter 10, because it is a rich vein of security problems.

The consistent rule

At every boundary where bytes cross into or out of your program, know the encoding. Write it down. Put it in the Content-Type, the <meta charset>, the open(..., encoding="utf-8"), the database connection string, the MIME header.

The bug is almost always somebody — a framework, a legacy tool, a junior developer — who thought the encoding "would be obvious." It isn't, ever. It is metadata, and metadata has to be written down.

Next: the security side of Unicode. Characters that look like other characters, and what to do about them.

Unicode Security

Unicode is a universal character set, which means it contains every character that looks like every other character — and, in the hands of an attacker, that is a security issue.

This chapter is organized around the techniques: homograph attacks, Trojan Source, normalization mismatches, and invisible characters. For each, we cover the mechanism and the mitigations. The authoritative reference is UTS #39: Unicode Security Mechanisms.

Homograph attacks

A homograph is a character that looks identical (or indistinguishable) to another character but is a distinct code point. The Cyrillic lowercase а (U+0430) and the Latin lowercase a (U+0061) render identically in nearly every font. A reader cannot tell them apart. But as bytes, as code points, as URL characters, they are distinct.

An attacker who registers раypal.com — spelled with a Cyrillic р and a Cyrillic а — and sets up a fake login page has an asset that looks to a user like paypal.com but points to an attacker-controlled host. This is the classic IDN homograph attack. (IDN stands for Internationalized Domain Name, the mechanism that allows non-ASCII characters in URLs.)

Mitigations in browsers

Modern browsers deploy IDN display policies to reduce this risk:

Chrome's policy combines whole-script confusables detection, script-mixing detection, and a large blocklist of known confusable patterns. A domain that triggers the policy is displayed in Punycode (the ASCII-safe encoding) in the address bar: xn--ypal-uye.com instead of раypal.com.
Firefox has a similar policy, configurable via network.IDN_show_punycode and related preferences.
Safari has its own heuristics, generally aggressive.

The policies are not identical across browsers, and an attacker who finds a gap in one can sometimes exploit it. Mixing-script detection (a URL whose host contains both Latin and Cyrillic characters) catches the most common attacks but not single-script attacks where the attacker uses a wholly non-Latin script that contains visual look-alikes of Latin letters.

UTS #39 "confusables"

Unicode publishes a confusables.txt file that lists known visual confusables. You can use it to build your own checks: "does this username, when reduced to its confusable skeleton, collide with an existing user?" The skeleton is a deterministic reduction: map every character to its "representative" form, so that а (Cyrillic) and a (Latin) both map to the same skeleton character.

# Using the pyicu binding or the third-party `confusables` package.
import confusables

confusables.is_confusable("раypal", "paypal")       # True
confusables.skeleton("раypal") == confusables.skeleton("paypal")   # True

For any identifier your system treats as unique to a user (username, project name, organization name), always check for confusable collisions with existing values.

Restricted identifier profiles

UTS #39 defines restriction levels for identifiers:

ASCII-Only: only ASCII characters.
Single Script: only characters from one script.
Highly Restrictive: single script, or one of a small set of common script combinations (Latin+Han+Hiragana+Katakana for Japanese, Latin+Han for traditional Chinese, etc.).
Moderately Restrictive: same as above, plus Latin added to any single non-Latin script.
Minimally Restrictive: allows broader mixing with some exclusions.
Unrestricted: anything.

The right level for usernames is Moderately Restrictive at most. Anything below that permits mixing arbitrary scripts, and that is where homograph attacks come from.

Trojan Source

In 2021, security researchers Nicholas Boucher and Ross Anderson published Trojan Source: a family of attacks that use Unicode's bidirectional override characters to make source code look different to a human reader than it does to a compiler.

The mechanism

Unicode has bidirectional formatting control characters that affect how text is rendered, without changing what the text actually is:

U+202A LEFT-TO-RIGHT EMBEDDING (LRE)
U+202B RIGHT-TO-LEFT EMBEDDING (RLE)
U+202D LEFT-TO-RIGHT OVERRIDE (LRO)
U+202E RIGHT-TO-LEFT OVERRIDE (RLO)
U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
U+2068 FIRST STRONG ISOLATE (FSI)
U+202C POP DIRECTIONAL FORMATTING
U+2069 POP DIRECTIONAL ISOLATE

These exist for a legitimate reason: Arabic and Hebrew are right-to-left scripts, and mixed-direction text requires a grammar for how directionality flows. Unicode's Bidirectional Algorithm (UBA, UAX #9) uses these overrides to specify exceptions to the default.

When an attacker places these overrides inside source code — in a comment, in a string literal — they can cause the code to render in a deceptive order while being parsed in its actual order. Consider this C snippet:

access_level = "user";
if (access_level != "user‮ ⁦// Check if admin⁩ ⁦") {
    // grant admin
}

What you see on screen is roughly "if (access_level != "user" // Check if admin) { grant admin }" — looks like a comment, and the string comparison reads as "user". But the bytes the compiler sees contain the full string "user <RLO> <LRI>// Check if admin<PDI> <LRI>", and the actual comparison is against a much weirder string. The visible and the parsed interpretations diverge.

This is Trojan Source. The attack works against any language that allows these characters in comments or string literals — which is almost all of them.

Mitigations

Lint rules: modern linters flag bidirectional controls in source code. Run rg --pcre2 '[\u202A-\u202E\u2066-\u2069]' over a codebase to find them. Rust's compiler emits a warning since 1.56. GCC has -Wbidi-chars. Git since 2.35 warns when bidi controls appear in diffs.
Editor rendering: VS Code and most modern editors display a warning marker for files containing bidi controls in source text.
Code review: be suspicious of large innocuous-looking diffs that come from unfamiliar contributors. Trojan Source is not common in normal attacker traffic, but the technique is well-documented, and a curious reviewer can save their team.

Invisible and zero-width characters

Unicode has a number of characters that are, or render as, nothing:

U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE (also the BOM; there is no way to tell which role it plays from the bytes alone)
U+00AD SOFT HYPHEN
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+E0000 – U+E007F (tag characters, covered in Chapter 14)

If you allow these in identifiers, you allow usernames that look identical to existing ones. admin and admin\u200b render the same and collide in many UIs but differ byte-for-byte. An account with the latter name can log in; to a moderator scrolling a table, it looks like the former.

Mitigation: normalize inputs (NFKC or similar), strip Default_Ignorable_Code_Point characters, apply a restriction level, and case-fold.

Normalization mismatches

If part of your system normalizes a string and another part doesn't, attackers can exploit the difference.

Classic example: a URL normalizer maps example.com/café to example.com/café (NFC). An authentication filter looks for the literal path /café, sees the user's non-normalized version (cafe + combining acute), says "this isn't the protected path," and waves them through. The application then receives the normalized path and serves the protected resource. Normalization mismatch.

Where to watch for this

URL path handling vs. ACL matching.
Filename comparison in security-sensitive code.
Username comparison vs. username storage.
Header name matching (HTTP is usually ASCII, but email headers can contain non-ASCII).

The rule

Pick one normalization form. Apply it at the boundary. Store, index, and compare in that form. Never compare an unnormalized input against a normalized stored value.

Punycode, IDNs, and the `xn--` story

A Punycode string is the ASCII-compatible encoding of an IDN label. Domain names in DNS are restricted to ASCII plus hyphens and digits, so a non-ASCII label like café is encoded as xn--caf-dma, where xn-- is the IDN prefix and caf-dma is a Punycode-encoded representation.

Important for security:

IDNA 2008 (the current IDN standard, RFC 5890-5894) specifies which code points are permitted in IDN labels. The set is restrictive and designed to reduce homograph risk.
Browsers convert IDNs back to Unicode for display only if the domain meets the display policy. Otherwise they show the Punycode form.
Applications that generate or parse URLs should use a proper IDN library (Python's idna package, for instance — not the built-in encodings.idna, which implements the older, less safe IDNA 2003).

Defensive code patterns

A safe username validator

import unicodedata as ud
import re

# Allow letters and digits from a restricted set of scripts plus underscore.
ALLOWED = re.compile(r"^[\p{L}\p{N}_]{3,20}$")

def safe_username(name: str) -> str | None:
    n = ud.normalize("NFKC", name)
    if any(0x200B <= ord(c) <= 0x200F for c in n):  # zero-width
        return None
    if any(ud.category(c) == "Cf" for c in n):      # format controls
        return None
    if any(0x202A <= ord(c) <= 0x202E for c in n):  # bidi overrides
        return None
    if not ALLOWED.match(n):
        return None
    # Enforce moderately-restrictive script profile (example; production code
    # should use UTS #39 data via ICU or the confusables package).
    return n

This rejects bidi overrides, zero-width characters, format controls, and anything outside a basic letter/digit/underscore set after NFKC normalization. A production implementation would also compute the confusable skeleton and check it against existing usernames.

Displaying untrusted text

Strip or replace bidirectional control characters before displaying untrusted text in a terminal or web UI.
Render zero-width characters visibly (many editors do this; → [ZWSP]).
When displaying a URL, display its Punycode form if the IDN display policy fails.

What this chapter is not

We have not covered:

Buffer overflows from miscalculated character counts (Chapter 7 is the prevention).
Injection attacks (SQL, HTML, shell) where Unicode can sometimes bypass filters — the mitigation is always "filter after decoding, decode to a canonical form, and use parameterized queries / escape libraries," not regex-on-bytes.
SSL certificate impersonation using IDNs, which is largely a certificate authority policy problem.

With the security tour done, the next chapter goes back to something much more fun: emoji.

Emoji, Properly

Emoji are the most visible success story of Unicode, and the most concentrated source of Unicode-related confusion in practice. A single emoji can involve multiple code points, a variation selector, a modifier, a font lookup, and a fallback cascade — all to render one tiny picture.

This chapter takes them seriously. If you understand how emoji work in Unicode, you understand most of Unicode's advanced text machinery.

How emoji got into Unicode

Emoji (絵文字, "picture characters") originated in Japan in the 1990s. Japanese mobile carriers — DoCoMo, KDDI, SoftBank — each had their own proprietary encoding for a few hundred pictographs used in messages. When iPhone launched in Japan in 2008, Apple had to interoperate with these encodings, and Google followed for Android.

In 2010, with Unicode 6.0, emoji entered the standard. A core set of about 700 emoji was assigned code points, most of them in the new Supplementary Multilingual Plane (U+1F000 and up) precisely because the Basic Multilingual Plane was running out of room.

This was controversial. Some Unicode Consortium members argued that pictographs were not text in the sense that letters and numbers are text, and didn't belong in a character set. The counterargument was that people were already using these characters in text on their phones, and the choice was whether to standardize or watch proprietary encodings proliferate. Standardization won, and Unicode gained — and continues to gain — hundreds of emoji per year.

Variation selectors

Some symbols can be rendered as either a text glyph (monochrome, aligned to the baseline) or an emoji glyph (colored, possibly more stylized). The default depends on the platform and the symbol.

Unicode defines two variation selectors that force one or the other:

U+FE0E VARIATION SELECTOR-15 (VS15): forces text presentation.
U+FE0F VARIATION SELECTOR-16 (VS16): forces emoji presentation.

These are invisible code points; they attach to the preceding character and change how that character is rendered.

U+2764 HEAVY BLACK HEART: ❤
U+2764 U+FE0E: ❤︎   (text presentation)
U+2764 U+FE0F: ❤️   (emoji presentation)

In practice, you mostly see VS16 in the wild, appended to symbols that Unicode considers borderline (text-by-default but often wanted as emoji). You will sometimes copy a "heart emoji" from a web page and get U+2764 U+FE0F — two code points — and wonder why your length count is off. Now you know.

Fitzpatrick modifiers (skin tone)

In 2015, Unicode 8.0 added five emoji modifiers corresponding to skin tones from the Fitzpatrick scale, a dermatological classification:

U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 (light skin tone)
U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 (medium-light skin tone)
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 (medium skin tone)
U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 (medium-dark skin tone)
U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 (dark skin tone)

An emoji that supports skin tone (not all do) can be followed by a modifier to produce a tinted version. The base emoji without a modifier is conceptually the "default" (typically yellow in most font designs, specifically to avoid implying a particular skin tone).

👋         U+1F44B
👋🏻         U+1F44B U+1F3FB   (light skin)
👋🏼         U+1F44B U+1F3FC
👋🏽         U+1F44B U+1F3FD
👋🏾         U+1F44B U+1F3FE
👋🏿         U+1F44B U+1F3FF   (dark skin)

A modifier applied to an emoji that doesn't support it will render as the base emoji followed by a colored square, or as just the base emoji with the modifier ignored, depending on the font.

Zero-Width Joiner sequences

The real combinatorial explosion lives in ZWJ sequences. The zero-width joiner (U+200D, ZWJ) joins adjacent emoji into a ligature if the font defines one.

Consider the family emoji 👨‍👩‍👧‍👦. Its code points:

U+1F468 MAN
U+200D  ZWJ
U+1F469 WOMAN
U+200D  ZWJ
U+1F467 GIRL
U+200D  ZWJ
U+1F466 BOY

A font that "knows" this sequence renders it as a single family glyph. A font that doesn't renders it as 👨 👩 👧 👦 — the individual emoji, possibly with visible gaps where the ZWJ is.

Other common ZWJ sequences:

Profession: 👩‍⚕️ = WOMAN + ZWJ + U+2695 (MEDICAL SYMBOL) + VS16 = "woman health worker."
Relationships: 👩‍❤️‍👩 = WOMAN + ZWJ + HEAVY BLACK HEART + VS16 + ZWJ + WOMAN.
Hair color: 👨‍🦰 = MAN + ZWJ + U+1F9B0 (EMOJI COMPONENT RED HAIR).
Flag of Scotland: 🏴󠁧󠁢󠁳󠁣󠁴󠁿 (uses tag characters, Chapter 14).

Each of these is one grapheme cluster, rendered as one glyph (if supported), spanning many code points.

Regional indicator flags

Flag emoji for sovereign countries work differently from everything above, and have a uniquely elegant design.

Unicode did not assign one code point per country flag. That would have been inflexible and politically fraught. Instead, Unicode defined 26 regional indicator symbols — U+1F1E6 (🇦) through U+1F1FF (🇿) — corresponding to the 26 Latin letters. A pair of consecutive regional indicators that spells an ISO 3166-1 alpha-2 country code renders as that country's flag.

🇺 + 🇸  (U+1F1FA + U+1F1F8)  → 🇺🇸
🇯 + 🇵  (U+1F1EF + U+1F1F5)  → 🇯🇵
🇪 + 🇺  (U+1F1EA + U+1F1FA)  → 🇪🇺

Each flag is one grapheme cluster, two code points.

This design means Unicode doesn't have to take a position on "is this a country?" for every disputed territory — it just provides the mechanism, and fonts decide what to render. For codes that don't correspond to a recognized country, the fonts typically render the letters as individual regional indicators without flag treatment.

Subnational flags (Scotland, Wales, England) use a different mechanism: tag characters (Chapter 14).

Rendering is the font's job

A critical clarification: Unicode defines what the code points are. It does not define what they look like. Every emoji on your screen is a font glyph, chosen by your operating system or browser.

This is why the "same" emoji looks so different across platforms:

Apple's 🦒 is drawn one way; Google's is drawn another; Microsoft's is different again.
The pistol emoji (U+1F52B) was redesigned by most vendors around 2017 from a realistic handgun to a water pistol. The code point did not change; the glyphs did.
Some vendors have experimented with not including certain emoji in their fonts (Facebook's Messenger initially omitted some characters); the code point exists and is interchanged, the rendered image is platform-dependent.

If your user pastes an emoji that your font doesn't know, it will render as a "tofu" — a box — or as a fallback symbol. The bytes are still correct; the font is incomplete. This is why emoji interop problems are usually font problems, not encoding problems.

The consequence for layout

Because emoji are font glyphs, they have a width only at render time. In terminals — which assume monospace with a fixed column width — emoji can be 1 column wide, 2 columns wide, or "does your terminal handle this?" wide. The Unicode East Asian Width property (UAX #11) gives a hint, but it is a hint. Robust CLI tools that display emoji must query the terminal.

Counting emoji correctly

You know this by now. If you need the "tweet length" of text that contains emoji, count grapheme clusters, not code points. A skin-tone family emoji (MAN-LIGHT + ZWJ + WOMAN-MEDIUM + ZWJ + BOY-DARK) might be 18 code points and 1 grapheme cluster, or 1 user-perceived "character."

const seg = new Intl.Segmenter();
const msg = "Celebrating 🎉👨🏻‍👩🏾‍👧🏼";
[...seg.segment(msg)].length;    // the answer a user expects

Detecting emoji

Unicode derives an Emoji property (and several related properties: Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base, Emoji_Component, Extended_Pictographic). You can use these in regex:

/\p{Emoji}/u.test("hello");    // false
/\p{Emoji}/u.test("🎉");        // true
/\p{Emoji}/u.test("1");         // true — digits are Emoji with VS15 variation

Note that surprising result: the digit 1 has the Emoji property set, because 1️⃣ (keycap one) is a valid emoji sequence. If you just want "pictograph-like emoji" — not digits, not #, not * — use \p{Extended_Pictographic}.

For production emoji detection, use a well-maintained library (emoji-regex in JavaScript, the emoji package in Python). Unicode adds emoji every year, and a hard-coded regex goes stale.

A worked example: counting a tweet

Here is what a real "tweet length" counter should do.

function tweetLength(text) {
  const seg = new Intl.Segmenter();
  return [...seg.segment(text)].length;
}

tweetLength("Hello");                    // 5
tweetLength("Hello 👋");                  // 7
tweetLength("Hello 👋🏽");                  // 7 (same — skin tone doesn't add a cluster)
tweetLength("Hello 👨‍👩‍👧‍👦");                  // 7 (family is one cluster)
tweetLength("Hello café");                // 10 (é is one cluster, precomposed or not)

Twitter's actual length counter is more generous than this: it weights certain ranges of code points differently. But the baseline — grapheme clusters as the unit of "characters" — is correct.

Emoji are text now

The reason to take emoji seriously is that they are text. They are searched, they are typed, they are pasted, they are stored in databases, they are included in usernames, they are part of identifiers in some contexts. Every abstraction that works for letters must work for emoji too.

This is, in some sense, the final test of your Unicode code. If it works on Hello world, it works on ASCII. If it works on Café résumé, it works on Latin-script languages with diacritics. If it works on 👨‍👩‍👧‍👦, it works on Unicode.

Next, we look at Unicode in programming language identifiers — and why, even when your language lets you name variables with emoji, you probably shouldn't.

Identifier Characters and Programming Languages

Programming languages have identifier rules: what characters can appear in a variable name, a function name, a class name. ASCII-only is the old default; most modern languages allow at least some Unicode. This chapter covers what is allowed where, what the Unicode Standard recommends, and why your codebase probably shouldn't use non-ASCII identifiers even when it can.

Unicode Annex #31

UAX #31 is the Unicode annex that defines a recommended grammar for identifiers. It has two central properties:

ID_Start: the set of code points that can begin an identifier. Roughly: letters (general category L*) plus letter-numbers (Nl).
ID_Continue: the set of code points that can appear after the start. ID_Start plus digits (Nd), connector punctuation (Pc, which includes _), and a selection of combining marks (Mn, Mc).

A language that conforms to UAX #31 allows identifiers of the form ID_Start ID_Continue*.

UAX #31 also defines stricter variants:

XID_Start / XID_Continue: slightly adjusted sets that guarantee stability under NFKC normalization (NFKC(ident) has the same ID structure as ident).
Pattern_Syntax and Pattern_White_Space: code points reserved for use as syntax in patterns (not allowed in identifiers).

The main guidance of UAX #31: use XID_Start / XID_Continue, normalize identifiers to NFC (or NFKC), and apply a profile from UTS #39 if security matters.

Language-by-language

Python

Python 3 follows UAX #31 closely. Specifically:

Identifier = XID_Start XID_Continue*.
Identifiers are compared after NFKC normalization. This means café and café refer to the same variable regardless of precomposed vs. decomposed form; it also means ﬁnalize (with the U+FB01 ligature) and finalize are the same name.

>>> π = 3.14159
>>> print(π)
3.14159
>>> café = "latte"
>>> café == café     # same name, different spelling
True  # they're literally the same variable after NFKC

Python rejects emoji as identifiers (they are not in XID_Start).

JavaScript

ES2015 onward: identifiers use ID_Start and ID_Continue (not XID_*). JavaScript does not normalize identifiers — café (NFC) and cafe\u0301 (NFD) are different variables, both valid.

const café = 1;         // precomposed
const cafe\u0301 = 2;   // decomposed — different variable

This is a footgun. Some style guides recommend against non-ASCII identifiers in JavaScript for precisely this reason.

JavaScript also allows Unicode escape sequences in identifiers: \u00e9 is equivalent to a literal é in an identifier.

Emoji are not in ID_Start, so they cannot begin an identifier, but a few emoji are in ID_Continue because of their category. In practice, no modern engine lets you name a variable x🙂; the spec allows only specific code points, not all emoji.

Java

Java identifiers use Character.isJavaIdentifierStart / isJavaIdentifierPart, which are based on but not identical to UAX #31. They allow all letters (including all scripts), digits, underscore, and currency symbols.

int π = 3;
String $ = "dollar";    // valid in Java

Java does not normalize identifiers; café and cafe\u0301 are different variables.

Go

Go allows identifiers of letter (Unicode general category Lu, Ll, Lt, Lm, Lo) + letter/digit (Lu, Ll, Lt, Lm, Lo, Nd). Underscore is treated as a letter.

var π = 3.14
func σ(x float64) float64 { return x * x }

Go does not normalize identifiers either. And Go has a visibility rule tied to the identifier: exported names must start with an uppercase letter. This is computed by Unicode's case property: π (lowercase pi) is unexported; Π (uppercase pi) is exported. The rule applies across every script that has case.

Rust

Rust follows UAX #31 strictly. Identifiers are XID_Start XID_Continue*. Rust normalizes to NFC for identifier comparison.

Since Rust 1.53 (the "non-ASCII identifier" RFC), you can write:

#![allow(unused)]
fn main() {
let π = 3.14;
let café = "double espresso";
}

Rust specifically forbids identifiers whose NFKC normalization changes them (preventing a category of confusable bugs), and emits warnings for mixed-script identifiers via the non_ascii_idents lint family.

Swift

Swift identifiers are extremely permissive. The start set includes letters, most symbols (including emoji), and some others; the continue set adds digits and combining marks.

let 🎉 = "party"         // valid
let 🐕 = "dog"

Swift's permissiveness has produced the most photogenic "look how quirky our language is" code samples on social media. It has not produced a lot of real production code using emoji as identifiers.

C and C++

C11 and C++11 allowed limited Unicode in identifiers via \u / \U escapes. C++23 and recent C standards adopted UAX #31. Compiler support varies; GCC and Clang largely conform.

Normalization and identifiers

The identifier equality question has two answers:

Byte-equal: two identifiers are the same iff their code points are identical. (JavaScript, Java, Go.)
Normalization-equal: two identifiers are the same iff their NFC (or NFKC) normalizations are identical. (Python, Rust.)

Normalization-equal is safer, because it prevents a category of confusable identifiers from coexisting. It also costs a little: the compiler must normalize every identifier before comparison. Most modern languages pick normalization-equal; the older ones stuck with byte-equal because that was how their string tables already worked.

Mixed-script identifiers as a security concern

Consider a Python codebase with a variable named admin. An attacker contributing a PR introduces a function using аdmin — where the first letter is the Cyrillic а (U+0430). In a code review, the two look identical. Python's NFKC normalization does not fold Cyrillic to Latin, so admin and аdmin are distinct variables.

The attacker can now define аdmin = True in a module, and the reviewer who reads it as admin = True has no way to tell from the visible source that this is a different variable. Later code that references admin will use the real Latin one; the attacker's definition has no effect, but a clever variant of this attack can introduce bugs, dead code, or subtle vulnerabilities.

The mitigations are the same as for usernames (Chapter 10):

Restrict identifiers to a limited set of scripts (UTS #39 profiles).
Warn on mixed-script identifiers.
Lint for confusables with existing names.

Rust's non_ascii_idents lint family includes confusable_idents, mixed_script_confusables, and uncommon_codepoints. Python has PEP 672 discussing the risks but does not yet enforce a restrictive profile. Most other languages leave this to external tools.

Why your codebase shouldn't use non-ASCII identifiers

Even when your language allows it, the pragmatic recommendation is: don't.

Tooling: grep, diff, and many legacy tools assume ASCII. They will often still work on UTF-8 identifiers, but weirdly.
Keyboards: not every developer has every character on their keyboard. Typing π requires a Compose sequence or a Unicode input method, which slows code contribution.
Merge conflicts: NFC vs. NFD differences that aren't distinguishable on screen can produce git conflicts that look like phantoms.
Editors: older editors or corporate-mandated IDEs may not handle complex scripts (right-to-left, combining marks) correctly.
Search: code search tools may not find café when you search for cafe, because identifiers with accents are less likely to show up in "standard" search queries.
Contributor inclusion: if your project welcomes non-English-speaking contributors, having ASCII identifiers is the lowest-friction common denominator.
Security: mixed-script attacks are possible, as above.

There are counter-arguments for mathematical or scientific code that specifically benefits from symbolic names (π, σ, ∇). Those use cases are narrow and usually restricted to one clearly-scoped module.

The default for a new codebase: ASCII identifiers, with UTF-8-aware tooling for everything else.

String literals are different

None of the above applies to string literals and comments. Your string literals should absolutely be able to contain any Unicode your users will produce: "Hello, 世界", "¿Qué tal?", "🎉". The rules in this chapter are only about identifiers — the names of variables, functions, types, and other things your compiler tracks.

A good default:

Identifiers: ASCII.
String literals: UTF-8, including whatever Unicode the application needs.
Source file encoding: UTF-8, with no BOM.

That set of choices avoids every Unicode-identifier hazard while losing nothing about your ability to handle Unicode data.

Next we look at the Unicode database itself — the data file that backs every property we have discussed.

The Unicode Database

Everything we've discussed — normalization, case folding, collation, scripts, emoji properties, grapheme break rules — is ultimately table-driven. The tables are the Unicode Database (UCD), a bundle of text files published by the Unicode Consortium with every release of the standard. This chapter shows you what's in there, how to read it, and how to query it from code.

Knowing the UCD matters for two reasons. First, when your language's standard library lets you down, you can fall back to the raw data. Second, when someone asks a weird question — "what is the name of U+1F9E6?" — the UCD gives the definitive answer.

What ships in the UCD

The Unicode Database is a directory of files, downloadable from https://www.unicode.org/Public/ under the current version directory (e.g., 16.0.0/ucd/). The core files:

UnicodeData.txt — the most important file. One line per assigned code point with canonical properties.
PropList.txt — additional simple properties (Alphabetic, White_Space, Bidi_Control, etc.).
DerivedCoreProperties.txt — derived properties, including ID_Start, ID_Continue, and many more.
Scripts.txt — the Script property for every code point.
Blocks.txt — which block each code point belongs to.
CaseFolding.txt — case-folding mappings.
SpecialCasing.txt — language-specific casing rules (Turkish I, etc.).
CompositionExclusions.txt — code points that cannot be composed even if they look decomposable.
NormalizationTest.txt — test cases for normalization implementations.
GraphemeBreakProperty.txt — the Grapheme_Cluster_Break property.
emoji/emoji-data.txt — the emoji properties (Emoji, Emoji_Presentation, etc.).
confusables.txt — the confusable characters table (part of UTS #39).

All files are ASCII text, semicolon-separated, with # introducing comments. The format is designed to be parseable by a tiny script; you don't need a library.

Reading UnicodeData.txt

UnicodeData.txt is the canonical per-code-point file. Each line has 15 fields separated by semicolons:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN SMALL LETTER E ACUTE;;00C9;;00C9

The fields, in order:

Code Point (hex).
Name.
General Category (Lu, Ll, Lt, Mn, Nd, etc.).
Canonical Combining Class (integer; non-zero for combining marks).
Bidi Class (L, R, AL, EN, ES, etc.).
Decomposition Mapping (e.g., 0065 0301 for é).
Numeric Type/Value 1 (for decimal digits).
Numeric Type/Value 2 (for digits more broadly).
Numeric Type/Value 3 (for any character with a numeric value, e.g., Roman numerals).
Bidi Mirrored (Y/N).
Unicode 1 Name (historical).
ISO Comment (obsolete).
Simple Uppercase Mapping.
Simple Lowercase Mapping.
Simple Titlecase Mapping.

Most of the time you care about fields 1, 2, 3, 6, 13, and 14. The others are important for specific tasks — bidi algorithm implementations, numeric parsing — but not for everyday use.

Range compression

UnicodeData.txt doesn't list every assigned code point. Large contiguous ranges (like the CJK ideographs at U+4E00–U+9FFF) are represented as paired First / Last lines:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Every code point in the closed range has the properties shown. A parser must expand these.

Looking up properties in code

Every modern language has some way to query the UCD. Here are the most useful.

Python: the `unicodedata` module

import unicodedata as ud

ud.name("é")              # 'LATIN SMALL LETTER E WITH ACUTE'
ud.category("é")          # 'Ll'
ud.combining("̈")          # 230  (the combining diaeresis)
ud.decomposition("é")     # '0065 0301'
ud.normalize("NFD", "é")  # 'e\u0301'
ud.numeric("½")           # 0.5
ud.bidirectional("א")     # 'R' (right-to-left)

Python's unicodedata is bundled with the interpreter and refreshed with each Python release to match a specific Unicode version. It handles the common properties; for less common ones (Script, Grapheme_Cluster_Break), use PyICU or the icu crate's Python bindings.

JavaScript: `Intl` and limited String methods

JavaScript's built-in Unicode queries are narrower. String.prototype.normalize is the main one. For properties, use \p{…} in Unicode-aware regex:

/\p{Script=Greek}/u.test("α");     // true
/\p{General_Category=Lowercase_Letter}/u.test("a");  // true

For programmatic lookup (by code point → property), there is no built-in. Use the unicode-properties package or similar.

C/C++: ICU

ICU's u_charType(cp), u_charName(cp, ...), u_getIntPropertyValue(cp, ...) are the low-level queries. They are fast (ICU ships compiled property tables) and comprehensive.

The command line: `uni` and `unicode`

Two very useful CLI tools:

uni (github.com/arp242/uni): a standalone Go tool for looking up code points by name, identifier, or literal character.
The Perl unicode one-liner: perl -CS -E 'for (0..0x10FFFF) { printf "%04X %s\n", $_, charnames::viacode($_) if charnames::viacode($_) =~ /GRINNING/ }'.

And of course, python3 -c 'import unicodedata; print(unicodedata.name(chr(0x1F600)))'.

Unicode Utilities (unicode.org)

The Unicode Consortium publishes a web-based set of Unicode Utilities at util.unicode.org. The useful ones:

Character Properties: type any character or code point, see all its properties.
List Unicode Characters: regex-style queries over the character set. \p{Sc} to list currency symbols.
Unicode Converter: convert between UTF-8, UTF-16, UTF-32, code points, HTML escapes, etc.
Transform: apply normalization forms, case folding, transliteration.

Bookmark these. They are faster than writing a script for a one-off question.

Staying current

Each Unicode release ships new assigned code points, new property values, and occasionally new properties. The cadence is roughly one release per year. The UCD files have the version number in their directory path.

When you're using a language's built-in unicodedata (or equivalent), you are using whatever Unicode version that language was built against. If you need the latest version — for a new emoji, a new script addition, a recently added property — you may need to install a more current ICU or a third-party library that tracks upstream.

For most production code, being one Unicode version behind is fine. The standard is designed so that older tables never become wrong; they only become incomplete.

The file that will teach you the most

If you want to understand Unicode at a deep level, spend an hour reading UnicodeData.txt. Start from U+0000 and scroll. Notice:

The gaps where unassigned code points sit.
The long runs of CJK ideographs (represented only as First/Last pairs).
The combining marks cluster in the U+0300s and U+0800s.
The mathematical operators in U+2200s.
The emoji starting at U+1F300.
The Private Use Area at U+E000–U+F8FF (15,000 code points reserved for non-standard use).
The supplementary planes starting at U+10000.

It is a map of human writing, as of the current Unicode version. It is also a record of every committee decision the Unicode Consortium has made. You will come to appreciate that Unicode is not a mess — it is a negotiated peace.

Where Unicode Is Still Evolving

The Unicode Standard is not finished. It gets a new major version roughly every year, and each version adds code points, adjusts properties, and occasionally extends the algorithm specs. This chapter covers what changes between versions, how the Consortium decides, and some of the weirder corners where the standard is still being written.

The release cadence

Unicode 1.0 appeared in 1991. The major-version history:

1.0 (1991), 1.1 (1993): early small set.
2.0 (1996): the expansion beyond 16 bits — surrogate pairs introduced.
3.0 – 4.1 (1999–2005): steady growth, first Plane 1 assignments.
5.0 – 5.2: more scripts, Avestan, Bamum, Egyptian Hieroglyphs.
6.0 (2010): the first official emoji. Cancellation of the "maybe emoji don't belong in Unicode" debate.
7.0 onward: roughly annual releases, averaging 5,000–15,000 new code points per release.
16.0 (2024): current version as of this writing. ~155,000 assigned code points.

Every release has its own UAX revisions (Unicode Annexes) — UAX #14 (line breaking), UAX #29 (text segmentation), UAX #31 (identifiers), and others get updated together.

What gets added

Three categories of code point additions dominate:

Historical scripts

New writing systems — usually historical or minority scripts — are added every version. Recent examples: Garay (Unicode 16.0), Gurung Khema, Kirat Rai, Ol Onal, Sunuwar. These additions are driven by scholars and native speakers petitioning for encoding. Once added, they give digital existence to scripts that might otherwise be untypeable and unsearchable.

CJK ideograph extensions

Chinese characters are added in batches called CJK Unified Ideographs Extensions. The current extensions are A through I, with further extensions proposed. Extension G (2020, Unicode 13.0) added ~4,900 characters; Extension H (2022) added ~4,200 more. The need is real: classical Chinese texts, historical personal names, and regional variants all turn up characters that the previous Unicode version didn't cover.

The ideograph extensions tend to live in high supplementary planes — Plane 2 (U+20000–U+2FFFF) and Plane 3 (U+30000–U+3FFFF) — where there's still plenty of space.

Emoji

Each year's Emoji Update adds a set of new emoji. They come from the Unicode Emoji Subcommittee process: anyone can submit a proposal, which is evaluated for compatibility, distinctiveness, and expected usage. Emoji are sometimes added as bare code points (new picture characters) and sometimes as new ZWJ sequences using existing components.

Every year's emoji release also adjusts a handful of existing emoji — adding skin-tone support, adding gender variants, clarifying rendering expectations.

Tag characters

A strange corner of Unicode: tag characters (U+E0000 – U+E007F, 128 code points in Plane 14).

These were originally defined for language tagging — embedding language markers in plain text, like inline HTML lang attributes. The feature was deprecated in 2001; the code points remained but were almost unused.

Then, in 2017, the emoji subcommittee revived them for a different purpose: subnational flag encoding. The flags of Scotland, Wales, and England are encoded as:

🏴 (U+1F3F4 BLACK FLAG)
+ tag sequence encoding the ISO 3166-2 subdivision code
+ U+E007F CANCEL TAG

The flag of Scotland:

U+1F3F4 BLACK FLAG
U+E0067 TAG LATIN SMALL LETTER G
U+E0062 TAG LATIN SMALL LETTER B
U+E0073 TAG LATIN SMALL LETTER S
U+E0063 TAG LATIN SMALL LETTER C
U+E0074 TAG LATIN SMALL LETTER T
U+E007F CANCEL TAG

"gb-sct" — GB subdivision SCT — enclosed in tag characters and terminated by CANCEL TAG. That's seven code points, each of them 4 bytes in UTF-8, for a total of 28 bytes to render a single flag grapheme cluster. The encoding is wildly space-inefficient, but it is compositional: any subdivision code can be encoded, and fonts only need to ship glyphs for the subdivisions they support.

Tag characters are Default_Ignorable_Code_Point — they should not be visible to a user when they appear in text (they are expected to be consumed by the rendering process). If your filter doesn't strip them, though, they can be a vector for invisible-content attacks, similar to zero-width characters. In 2024, several phishing kits were observed exploiting tag characters in URLs to evade visual inspection.

Script property additions

When a new script is encoded, every code point in it gets a Script property. Existing properties (like the Script-specific collation tailoring in CLDR) may need updating.

The Unicode Consortium is currently working on several scripts in various stages:

Proto-Sinaitic, one of the earliest known alphabetic scripts.
Linear Elamite, recently deciphered.
Indus Valley Script, still undeciphered but with enough attested characters for proposal.
Various constructed scripts (Tolkien's Tengwar is encoded in the Supplementary Multilingual Plane).

For languages that are alive, encoding can transform digital life: speakers can finally type in their native script, use it in search engines, and preserve their literature digitally.

Backward compatibility

One of the strongest guarantees of the Unicode Standard is stability. Once a code point is assigned:

Its code point number never changes.
Its name never changes (rarely, a typo is corrected via an alias).
Its General Category, Canonical Combining Class, and Decomposition Mapping are essentially frozen.

This means that UTF-8 files you wrote in 2005 decode identically in 2026. A string normalized to NFC in 2010 is still normalized in NFC terms as of the current version. Emoji from Unicode 6.0 are still valid.

The price of this guarantee is that mistakes don't get corrected. U+FB01 (the ﬁ ligature) would probably not be added today, but it exists and is locked in. U+200B (zero-width space) continues to be a security hazard nobody can remove.

The Consortium process

Unicode is decided by the Unicode Consortium, a nonprofit whose full members include Apple, Google, Microsoft, Meta, Netflix, and several national governments. Proposals for new characters, new properties, or standard changes go through:

Submission to the relevant subcommittee (UTC, Emoji Subcommittee, CJK Ideograph Working Group).
Review and revision, often across multiple quarterly meetings.
Adoption into a specific Unicode version.
Publication with that version's release.

The process is open: anyone can submit a proposal (unicode.org/pending/proposals.html), and technical discussion is largely public. If you care strongly about some corner of Unicode, you can participate.

What's not going to change

Some things are architecturally fixed:

The code point range: U+0000–U+10FFFF. (Determined by UTF-16's capacity.)
The encoding triad: UTF-8, UTF-16, UTF-32.
The surrogate pair mechanism in UTF-16 (because removing it would break every UCS-2/UTF-16 system).
The reserved range U+D800–U+DFFF remaining unassigned (used by surrogates).

There is no credible path toward a Unicode beyond U+10FFFF. Even if all 1.1 million code points were fully assigned, there are currently ~955,000 unused ones, with the vast majority of future writing-system additions comfortably fitting in the remaining space.

Keeping up with new versions

Practically: you probably don't need to. If you're using a modern language whose standard library tracks Unicode, you get upgrades for free when you upgrade the runtime. Python 3.15 ships with Unicode 16.0 data; Node.js tracks the latest ICU.

If you care about emoji specifically, a library like emoji-regex publishes updates within weeks of each Unicode release.

If you care about less-common properties (new scripts, new CJK extensions), you may need to compile against the latest ICU. ICU's release cadence lags Unicode by a few months.

The one place you do need to keep up: your font stack. New emoji added in Unicode 16.0 look like tofu until Apple, Google, Microsoft, Twitter, and the free font projects (Noto Emoji, Twemoji) ship their glyphs. This typically happens 6–18 months after the Unicode release, per vendor.

The tension

Unicode's evolution has a built-in tension: it aims to be universal (every writing system, every symbol people want) while also being stable (no breaking changes, ever). As the standard grows, maintaining stability requires compromises — keeping ligature code points, keeping cruft, keeping security-hostile characters.

Every programmer who works with Unicode long enough starts to see it not as a character set but as a negotiated settlement — an international treaty with an API. That's exactly what it is. Remarkably, it works.

Next, we point you at the best further reading, tooling, and references for continuing the journey.

Acknowledgments

This book owes its existence to Georgiy Treyvus, Product Manager at CloudStreet, who proposed it. His original scoping note — that the goal was "truly grokking what the hell is even going on with Unicode these days" — set the tone of the entire project and saved it from being either a tutorial or a rant.

Thanks also to every developer who ever filed a polite bug report about emoji rendering, mojibake in CSV exports, or a search function that couldn't find café when the user typed cafe. This book is, in the end, for you.

Keyboard shortcuts

Truly Grokking Unicode