The Person Record

The exemplar used for the rest of the book is a single record describing a person. The values are fixed; every chapter that follows encodes exactly these values. The choice of values is not aesthetic. Each one was selected to land at a particular point in the encoding space of the formats under discussion, so that the differences between formats become visible in the bytes rather than hidden behind padding.

This chapter exists to fix the record once, in format-neutral terms, so that subsequent chapters can refer back to it without re-introduction. There is no wire tour here — this is the wire tour's input.

The fields

The record has six fields, declared in this order:

id          uint64            42
name        string            "Ada Lovelace"
email       optional string   "ada@analytical.engine"
birth_year  int32             1815
tags        list<string>      ["mathematician", "programmer"]
active      bool              true

id is a 64-bit unsigned integer. The value 42 was chosen because it is small enough to fit in a single byte under any reasonable variable- length encoding (Protobuf varint, MessagePack fixint, CBOR immediate), and yet large enough that the formats which encode integers in fixed 8-byte words (BSON, XDR, Borsh, fixed-width bincode) will pay the full eight-byte tax. Comparing the encoding of id across formats is the clearest single demonstration that variable-length encoding is a real trade and not an academic one.

name is the twelve-byte UTF-8 string Ada Lovelace. Twelve bytes is short enough that length-prefixed formats can encode the prefix in a single byte (or smaller), and long enough that the bytes themselves dominate the per-field overhead. Every byte of the string is in the ASCII range 0x20–0x7e; there are no multi-byte code points, no ambiguous normalization questions, no zero-width joiners, no right-to-left marks. This is deliberate. The book is not about Unicode pitfalls; if it were, the exemplar would include them.

email is an optional string. Its value, when present, is the twenty-one-byte UTF-8 string ada@analytical.engine. The optionality is the load-bearing part of the field. Formats differ enormously in how they represent "this field is present" versus "this field is absent" versus "this field has its default value." Some formats (MessagePack, CBOR, BSON, JSON-shaped formats) make the question trivial: present means present, absent means the key isn't in the map. Some (Protobuf 3) make it surprisingly hard, because the wire format originally collapsed "absent" and "default" into the same encoding for scalar types, and only later restored the distinction for strings via the optional keyword. Some (Avro) require an explicit union with null. Some (Borsh, SCALE, Postcard) have an Option type with a discriminant byte. Some (NBT, the Apache Cassandra family) cannot natively express "field is absent" at all and require an in-band convention (empty string, sentinel value). Each chapter encodes Person twice when the format's behavior differs based on email's presence: once with the email and once without.

birth_year is a 32-bit signed integer with the value 1815. The choice of 1815 over, say, 1985 is not arbitrary. 1815 fits in eleven bits and therefore in two bytes of a varint encoding, while 1985 also fits in two bytes. But the more important reason for picking a historical year is that the variable-length encoders treat it the same way as any other small positive integer; nothing about being a date affects the bytes. Date semantics live in a higher layer, and the formats that have a date type (Ion, BSON, CBOR via tag) are explicit about the choice; for a plain integer, the encoding is just an integer. Birth year is stored as int32 rather than uint32 because the formats that distinguish signed and unsigned do so in ways that matter (zigzag versus straight varint, in particular), and making the canonical type signed exercises the more interesting code path.

tags is an ordered list of two short strings: mathematician (13 bytes) and programmer (10 bytes). The list has two elements, which is small enough that all surveyed formats encode the count in a single byte. The two elements have different lengths, which means the flat "concatenation of equally-sized values" trick that some formats support for scalar arrays is not applicable; the format must encode each element's length individually. Strings rather than integers were chosen for the list elements because string-of-strings is the case that exercises the most format machinery: every variable-length encoding rule applies, the encoder has to decide whether to share length-prefix overhead, and the columnar formats (Parquet, ORC, Arrow) have to encode dictionaries, definition levels, and offsets all at once. An array of integers would have demonstrated less.

active is a boolean with the value true. Booleans are deceptively varied across formats. Some encode them as a single dedicated byte (MessagePack 0xc3, CBOR 0xf5, BSON 0x01, JSON token true). Some fold them into the tag of an enclosing structure (Thrift Compact uses type code 1 for true and 2 for false, with no payload byte at all). Some encode them as a single bit in a packed bitmap (Apache Arrow's validity bitmap is the obvious case, but boolean arrays use the same structure). Some require explicit DER-style encoding where 0xff is canonical for true (ASN.1). Comparing the boolean encoding across formats is a microcosm of the format's overall philosophy.

What the record is not

It is worth being explicit about what the exemplar omits, because several common features have been left out deliberately.

There are no nested records. A Person does not contain an Address, which does not contain a list of GeoCoordinates. The reason is not that nested records are uninteresting — they are very interesting, especially for the columnar formats — but that adding nesting would make the wire tours twice as long without illuminating axes that the flat record fails to illuminate. Where a format's nesting story is a material part of how it works (Cap'n Proto's pointer arithmetic, Parquet's repetition and definition levels, Arrow's struct columns), the chapter for that format includes a sidebar with a nested example.

There are no floating point fields. Floating point is its own subject: NaN bit patterns, denormals, signaling vs. quiet, IEEE 754 vs. implementation-specific extensions, and the question of whether the format preserves the exact bit pattern or only the value. Including a float field in the exemplar would force every chapter to address the question, and the question is mostly the same across formats (they all use IEEE 754 double-precision, with minor differences in how NaN is handled). Where a format does something interesting with floats, the chapter mentions it.

There are no maps with arbitrary keys. A tags list exercises arrays; a metadata map with string keys would exercise the map encoding. Most formats handle maps fine, but a few (Protobuf 2, FlatBuffers without scaffolding) do not have native maps and require a list-of-pairs convention. Including a map field would force a discussion that is mostly orthogonal to the format's core design. Each chapter notes the format's map story without encoding one.

There are no binary blob fields. Bytes-as-bytes is a fine field type in most formats, and most formats handle it sensibly. Including a blob would inflate the wire tours without showing anything not already shown by the string field.

There are no enums. Enums are interesting precisely because formats disagree about whether unknown values are forward-compatible (Protobuf 3 says yes; Thrift's older code generators said no), but the disagreement is small and well-understood. Discussion goes in the schema evolution chapter, not in every individual chapter.

There are no recursive types. A Person does not contain a list of Persons. Recursive types are beyond the scope of the exemplar; they appear in the FlatBuffers and Cap'n Proto chapters in passing.

There is no field marked deprecated, reserved, or removed. The field list is what it is; the chapter on schema evolution simulates removing email and adding a country field, but the canonical encoding shown in each format chapter uses the current schema only.

Two encodings, when relevant

For formats whose behavior differs materially based on email being present versus absent, both encodings are shown. The difference is usually small — a missing tag, an absent map key, a null branch tag — but the difference is exactly the part of the format that handles optionality, and it is worth showing. For formats where the difference is uninteresting (a key that's just not in the map; a list element omitted), only the present-email case is shown and the absent case is described in prose.

The full email value is ada@analytical.engine. The domain is spurious; Ada Lovelace did not have an email address. The value was chosen because it is exactly twenty-one bytes long, which produces encodings that are short enough to fit on a single line in most hex viewers and long enough to demonstrate the length-prefix mechanism in formats that vary their prefix size based on the encoded length.

Why this exemplar will sometimes feel cramped

A single Person record is small. Some of the formats in this book — Parquet, ORC, Arrow IPC, the streaming column formats — are designed to encode billions of records, and showing a single record's encoding in those formats is, frankly, ridiculous. The bytes you get are dominated by header overhead and metadata that would amortize away across a real workload.

Where this is the case, the chapter shows the encoding of a single record honestly — including all the overhead — and then, in a separate sidebar, shows the per-record cost projected over a million records. The single-record encoding is the apples-to-apples comparison for small payloads; the projected cost is the apples-to-apples comparison for the workloads the format was actually designed for. Both numbers are useful. Reporting only one of them is dishonest.

A note on byte order, alignment, and word size

Throughout the book, hex bytes are shown in network order — that is, the order they appear in the wire — even when the format is internally little-endian. This means that for a little-endian format like BSON, the bytes 94 00 00 00 shown at the start of the document are the literal bytes you would see in a hex viewer, and the value they encode is the 32-bit little-endian integer 148. Where this might surprise the reader, a parenthetical follows the hex.

Alignment is annotated where it matters. FlatBuffers, Cap'n Proto, Apache Arrow, and SBE all have hard alignment requirements, and the hex tours for those formats include the padding bytes explicitly so that the alignment is visible.

Word size assumptions in this book follow the formats themselves: a "word" in Cap'n Proto is 8 bytes, a "word" in CBOR's spec is 4 bits (but bytes are bytes), and a "word" in conventional CPU parlance is ignored. When precision matters, the chapter specifies.

That is the record. Every byte tour in the rest of the book is an encoding of these six fields, in this order, with these values. When the bytes diverge across formats — and they diverge enormously — the divergence is not because the input changed. The input is fixed. The divergence is the design.