Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Foreword

There is a recurring scene in the working life of a software engineer. Two systems need to exchange data. Someone has chosen a format. The choice is defended by a sentence, sometimes a phrase: we use Protobuf, it's all JSON, the analytics team wanted Parquet, Kafka uses Avro. The sentence is then treated as if it had ended the conversation. It has not ended the conversation. It has only deferred the parts of the conversation that turn out, six months or six years later, to be load-bearing.

This is a book about those parts.

The thesis is straightforward and, I think, slightly subversive. Binary serialization formats are not interchangeable, and the differences between them are not stylistic. Each format is a coherent answer to a question of the form: what do you want to optimize for, and what are you willing to give up in exchange? The formats look similar from a distance — they all turn structured values into bytes and back — but the moment you ask anything about cost, evolution, ambiguity, or correctness, they diverge violently. A choice that is fine for one system is malpractice for another. A property that one format treats as foundational, another treats as a bug. None of them is wrong about this. They are answering different questions.

You can read every binary format's specification as a contract. Like any contract, the interesting parts are not the parts that read smoothly. They are the clauses you wouldn't notice unless you were specifically looking for them. Field tags are stable across versions. Map ordering is unspecified. Float NaN bit patterns are not preserved. Unknown fields are dropped silently. The schema must be available to the reader at decode time. Any one of these clauses, taken alone, sounds reasonable. Each one is also a hand grenade with a fifteen-year fuse. Reading the spec is the difference between knowing this and being surprised by it.

The aim of this book is to teach that kind of reading. By the end, the chapter on Protobuf shouldn't make you a Protobuf expert. It should leave you able to pick up the spec for a format I haven't covered — there are always more formats — and locate, within an hour, the four or five clauses that determine what you would and would not bet a system on.

A word on what this book is not. It is not a benchmark shootout. There is an appendix on benchmark methodology, but its main argument is that almost every published benchmark is misleading, often through no fault of its authors, and that the speed of a serialization format is one of the least interesting things about it once you've narrowed the field to a handful of plausible candidates. Speed is a property of an implementation, not a format. Two CBOR libraries can differ from each other by ten times. A bad Protobuf implementation will lose to a good MessagePack implementation even though Protobuf, in principle, encodes much smaller. The interesting questions — will this format hurt me when I add a field, when I remove a field, when a producer and a consumer skew, when a language we hadn't planned for shows up, when the data must be read on disk thirty years from now — are not benchmark questions.

Nor is this book a recommendation engine. There is no chapter that ends with and so you should use X. There is one chapter on decision frameworks, which is mostly a list of questions to ask yourself in the right order, and a chapter on anti-patterns, which is mostly a list of decisions whose costs are more often paid than acknowledged. The honest answer to which format should I use is almost always that depends on six things you haven't told me about your system yet, and the goal here is to give you enough vocabulary to know what those six things are.

Finally, this book is not exhaustive, and I want to apologize in advance to the people whose favorite format I have either skipped or treated briefly. The formats that get full chapters were chosen because each one illustrates a distinct point on the design space. MessagePack and CBOR are both self-describing schemaless binary formats, and so they share a chapter worth of commentary, but they get separate chapters because the small differences between them are exactly the kind of small differences this book is meant to sensitize you to. FlatBuffers and Cap'n Proto are both zero-copy, but their answers to what zero-copy actually means are incompatible enough to be worth contrasting at length. ASN.1 is older than most readers and most authors, and is still the format that runs the cellular network you are reading this on, and so it gets a chapter even though almost nobody chooses it for new work. The omissions are defensible; the inclusions are deliberate.

The book is organized along six axes that, between them, separate any two binary formats you are likely to compare. They are introduced properly in Chapter 2. Briefly: whether the format requires a schema; whether the encoded bytes describe themselves; whether records are laid out in rows or columns; whether the encoding can be read without parsing; whether bindings are generated at build time or constructed at runtime; and what kind of compatibility — forward, backward, both, neither — the format guarantees when the schema changes. Most formats take a clear position on each axis. The formats that try to take both positions on an axis at once tend to be the formats with the most exciting failure modes.

A note on the exemplar. Every format chapter encodes the same record: a person named Ada Lovelace, with an integer ID, an optional email, a birth year, a couple of tags, and an active flag. The values were chosen to be small enough to walk through byte by byte and varied enough to exercise the parts of each format that differ. The same record encoded in MessagePack is around forty-five bytes; in BSON, about a hundred and ten; in Protobuf, around fifty; in Avro with the schema available, under fifty; in CBOR, similar to MessagePack; in SBE, fixed and surprising. Lining these up, side by side, in the appendix is more pedagogically useful than any table of microbenchmark numbers. You can see the design in the bytes.

A note on voice. I have opinions, and rather than launder them through the passive voice I have stated them. Where I think a format's design is an excellent solution to its actual problem, I say so. Where I think a format is being used outside the problem it was designed to solve, I say that too. The opinions are mine; the formats are not on trial, and their authors have generally done excellent work. The point is to help you think about your own system, not to litigate someone else's fifteen-year-old design decisions.

A note on what to expect from each chapter. There is a fixed shape: a short history of who built the format and what they were trying to fix; an explanation of how the format thinks about the world, in its own terms, before any comparison is made; a wire tour, where the exemplar is encoded and every byte is annotated; the rules for schema evolution and version skew, because that is where the format's real costs hide; the state of the ecosystem, including the languages and tools and the gotchas that don't appear in the spec; my honest view of when to choose this format and when not to; and a single sentence at the end summarizing the whole thing, because I have found over the years that a one-sentence summary is the part you actually remember six months later when you need it.

The book is released into the public domain under CC0. If parts of it are useful to you in writing your own documentation, your own training material, or your own decision memos, copy them. If you find an error, the repository is on GitHub and it accepts pull requests; serializers have surface area, and surface area means I have certainly gotten something wrong.

Read the table of contents. Pick a part. The first three chapters build the vocabulary; after that, you can read in any order. The formats are old; the systems built on them will outlast most careers; the questions they answer are not going away. The least we can do is read the contracts before we sign them.

— D.L.

What Serialization Is, and Why "Binary" Is a Category

Serialization is the process of turning a value that exists in a program's memory — a struct, an object, a record, a tree — into a sequence of bytes, in such a way that the same value can later be reconstructed from those bytes by some other program, or by the same program at a later time, or even by the same program in the same instant after the bytes have made an unsafe trip through a buffer somewhere. Deserialization is the reverse. The pair is sometimes called marshaling and unmarshaling, sometimes encoding and decoding, sometimes pickling and unpickling, sometimes something language-specific and embarrassing. The names don't matter. What matters is that for the duration of the byte sequence's existence — its trip across a wire, its sojourn on a disk, its passage through a queue — the bytes are the value. There is no other authoritative representation. If they are wrong, the value is wrong. If they are ambiguous, the value is whatever the next reader decides it is.

This is the part that often gets glossed over. A serialization format is a contract between writers and readers about what bytes mean. Treat it as anything less and you will eventually pay for the misunderstanding.

Three operations, not two

Encoding and decoding are the two operations everyone thinks of, but there is a third that earns its keep: identification. Given an arbitrary sequence of bytes — say, the contents of a file, or a frame off a socket — is this even something I should attempt to decode? In practice the answer comes from outside the bytes (the file extension, the MIME type, the channel the bytes arrived on). But many formats also build in their own answer, in the form of a magic number, a version byte, a framing header. A four-byte PAR1 at the start of a Parquet file. A two-byte PK at the start of a ZIP archive. A 0xc0 introducing a CBOR self-tag. These bytes do no useful information-carrying work. They exist solely so that the wrong reader, encountering the wrong bytes, fails fast and visibly instead of producing garbage.

Most binary formats include some identification mechanism. Most text formats do not, because text is identified by the fact that it is text, which is itself a kind of identification — not a precise one, but enough to disqualify a binary blob without needing to read its first sixteen bytes. The asymmetry is one of the reasons "binary" is a useful category in the first place.

Three boundaries, not one

The same word — serialization — covers three different problems, and formats that excel at one of them often perform badly at the others.

The first is the process boundary. Two parts of the same running program, or two cooperating processes on the same machine, exchange a value. The bytes never leave the machine, often never leave a single shared memory segment. The relevant cost is encode and decode time, not size. The relevant question is whether the format can be made nearly free in the common case. Zero-copy formats (FlatBuffers, Cap'n Proto, rkyv) are designed for this boundary. So is the in-memory layout of Apache Arrow.

The second is the machine boundary. Bytes leave one host and arrive at another, possibly running on a different architecture, a different operating system, a different language runtime, or a different version of the schema. Now byte order matters. Floating point representation matters. The relative cost of a microsecond of CPU and a microsecond of network changes. The format must be portable. It usually must also be versionable, because the two ends are deployed on different schedules. This is the boundary that Protobuf, Thrift, Avro, MessagePack, CBOR, and most of the formats you have heard of were designed for. The language is "wire format" because the wire is the metaphor.

The third is the time boundary. The bytes are written today and read in five years. The reader is not just on a different machine — it is a program that has not yet been written, by a person who has not yet been hired, in a language that may not yet exist. This is the boundary where archival formats (Parquet, ORC, Avro Object Container Files, ASN.1 BER) earn their pedigree. It is also the boundary where formats without a strong evolution story tend to fail catastrophically without warning. A bytes-on-disk format that does not embed enough information to be decoded twenty years later by a stranger is a time bomb. The question to ask is not can I read this today but who can read this when I cannot.

A format can be excellent for one boundary and disqualifying for another. FlatBuffers is exquisite for the process boundary and serviceable for the machine boundary, but I would not choose it for the time boundary unless I was prepared to also commit to archiving the schema and a working compiler for the rest of my life. CBOR is fine for any of the three but optimal for none. Knowing which boundary you are actually crossing is the first decision; most arguments about formats are people debating across different boundaries without realizing it.

What "binary" actually means

The word binary is doing a lot of work, and it does not always do it honestly. Let's get the easy mistakes out of the way.

Binary does not mean compressed. Many binary formats are not compressed in any nontrivial sense. Protobuf is dense but not compressed; you can usually shrink a Protobuf message further by gzipping it. FlatBuffers is barely compressed at all — it intentionally includes padding so that fields can be addressed by offset. Conversely, text formats can be compressed: gzipped JSON is smaller than ungzipped CBOR for almost any realistic payload. Compression is orthogonal to the binary/text distinction.

Binary does not mean fast. A poorly written binary parser will lose to a well-tuned JSON parser. The fastest JSON parser I am aware of can exceed a gigabyte per second on commodity hardware; few binary parsers clear that bar. Speed is a property of an implementation, not a format. What a binary format gives you is a higher ceiling on speed, because parsing structured numbers is faster than parsing decimal strings, and because length-prefixed lengths are faster to scan than delimiter hunting. The ceiling is real. Reaching it is work.

Binary does not mean opaque. The bytes of a Protobuf message are fully understandable to anyone with the schema. Even without the schema, the wire format is regular enough that you can usually reconstruct most of it. Security through obscurity and binary encoding are not the same thing, and conflating them has produced many bad architectural decisions.

Binary does not mean typed. Some binary formats (FlatBuffers, SBE) encode strict types in the bytes. Some binary formats (MessagePack, CBOR) encode type tags but are otherwise schemaless. Some binary formats (XDR, raw Go gob) lean on an external schema. JSON, despite being text, has a richer type system than some self-describing binary formats — though a more limited one than many.

So what does binary mean? Operationally, it means: the format is not designed to be readable by a human without tools. Bytes outside the printable ASCII range appear deliberately. The output of cat is gibberish. grep and sed cannot be trusted on the payload. A junior engineer cannot tell at a glance whether the encoded value is right. To inspect or modify the bytes, you reach for a parser, a hex viewer, or a format-specific debugging tool. That is the category. Everything else — speed, density, schema requirements, evolution strategy — varies enormously within the category.

This framing makes the choice between binary and text more interesting, because it foregrounds the actual cost. Choosing binary means choosing to always carry the parser around. Always: at debug time, at log inspection time, at "we have a production incident at 3 a.m. and a junior engineer needs to understand this" time. The parser has to exist for every language and tool that needs to read the bytes. The parser has to be available offline, in environments without dependencies, on hardware that may not have the bandwidth to download a 30-megabyte SDK. These costs are real. Sometimes they are worth paying. Sometimes they are not. The point is that they are paid, and a format that does not acknowledge them is selling you only half of the trade.

Where binary earns its place

Binary formats are not a default. They earn their place in a system, and the cases where they earn it are surprisingly narrow once you write them out.

The clearest case is density at scale. If you are storing a trillion records, the difference between forty bytes per record and two hundred bytes per record is the difference between a $X storage bill and a $5X storage bill, and the cost of the parser is amortized to zero. Most analytical workloads — data lakes, time series databases, message queues with retention — are here. Parquet won this category not because it is fast (it is, but that is incidental) but because column-oriented binary encoding makes the storage bill go down by an order of magnitude.

The second case is type fidelity. A 64-bit unsigned integer is a distinct value from the string "18446744073709551615", and many text formats lose this distinction in transit. JSON has a famous problem with integers above 2^53. CSV has no concept of types at all. If your data includes nontrivial numeric values, NaN-bearing floats, binary blobs, or precise timestamps, a text format will, at some point, quietly mangle one of them. A binary format with a type system encodes the value in its full precision. This is not a speed argument. It is a correctness argument.

The third case is bandwidth-constrained or latency-constrained links. Embedded systems, satellite links, mobile radio protocols, financial market feeds. Here the cost of a byte is high, and the cost of CPU is fixed in advance because the hardware is fixed. ASN.1 PER, SBE, FlatBuffers, and the various format families used in telecom and finance live in this case. Most readers of this book will never work in this case. The handful who do tend to know it already.

The fourth case is read-heavy in-memory access. If a value will be read many times for every time it is written, and if reads happen in performance-sensitive code, a format that allows access without parsing is a real win. This is the FlatBuffers and Cap'n Proto and rkyv pitch. The bytes on disk and the bytes the program reads are the same bytes; there is no decode step. This is not free — it constrains the format heavily — but where it applies, it applies decisively.

What happens when the bytes are wrong

A surprising amount of a format's character is revealed by what it does when given bytes that aren't quite right. Pure speed and pure size are easy to compare. Failure semantics are not, and they are where the formats differ most consequentially in production.

Consider four kinds of "wrong." There are bytes that are truncated: the writer was interrupted, the network dropped a packet, the file was copied incompletely. There are bytes that are corrupted: a single bit flipped on disk, a buffer was overwritten, a protocol stack stripped or added something. There are bytes that are valid encodings of unexpected schemas: a producer was upgraded before a consumer, an unknown field appeared, an enum value turned up that nobody had heard of. And there are bytes that are adversarial: a malicious or careless party constructed a payload designed to confuse the parser, allocate too much memory, recurse too deeply, or trigger a parser bug.

Different formats handle these four cases very differently, and the choice is rarely advertised on the front page of the spec. Length- prefixed formats fail loudly on truncation: the prefix says ten bytes, you got six, the parser stops. Delimited formats may not notice truncation at all if the truncation happens at a record boundary; streaming Protobuf-over-HTTP has had years of subtle bugs caused by exactly this. Self-describing formats with type tags catch some kinds of corruption — the tag byte says "string" but what follows isn't valid UTF-8, so something is wrong — but they catch single-bit flips in the middle of a string only if a higher layer is doing checksumming. Schema-required formats can fail in surprising ways when an unknown field appears: Protobuf 2 required you to declare which fields were required, and decoders were obligated to fail if they were missing, which is why Protobuf 3 removed the concept; in Avro, an unknown field in a record fails decoding outright unless the reader's schema declares a default; in MessagePack, an unknown field is just a key in a map and is silently passed through.

Each of these is a defensible position. Each is also a position: your application is going to inherit it whether you noticed or not. A format that silently drops fields it doesn't recognize is one that will allow your producers to add fields without coordination, and one in which a typo in a field name will silently throw your data away. A format that hard-fails on unknown fields will catch the typo, and will also turn every schema deployment into a coordinated dance. Neither is wrong; neither is free.

Adversarial input deserves special mention. Most binary formats were designed in environments where the writer and reader were trusted peers, and as a result almost every format has had a memory exhaustion bug — a length prefix claiming a four-gigabyte string, a recursion limit easy to defeat, a tag that triggers an unbounded loop in a specific implementation. The formats that have been deployed in hostile environments for decades (TLS's wire format, ASN.1 in the cellular network, the various IETF protocols) have hardened over time, but every format that grows a public attack surface for the first time grows a CVE list along with it. If your binary format is about to be exposed to untrusted input for the first time, treat the parser as security-critical code and budget accordingly.

Where binary is a mistake

The mistakes are also worth saying out loud.

Configuration files, debugging interchange, logs, and any payload likely to be read by a human in the course of operating the system should not be binary. Yes, you could encode your config in CBOR and write a tool to render it. No, you should not. The cost of cat-ing a config file unmodified is enormous and unreplaceable, and any "savings" from encoding it densely are imaginary because configs are not on the hot path of anything.

Public APIs to undifferentiated clients should generally not be binary, even when the request volume is high enough to justify it. The cost imposed on every consumer of writing a parser will, on average, exceed the savings to the server. Public APIs that already have a sophisticated client ecosystem (gRPC) are an exception, and they exist because the ecosystem solved the parser-distribution problem. Without that ecosystem, demanding that every caller speak your binary protocol is a tax on the world to subsidize your bandwidth bill.

Anywhere a junior engineer needs to be productive without ramping up, text is usually the right default, and the right binary format is one that has a good text projection (Protobuf's text format, Avro's JSON encoding, MessagePack's relationship to JSON) so that the binary representation is visible when wanted and dense when shipped.

The point of this chapter is not to argue you out of using binary formats. The book would be very short. The point is that binary is a category with costs you can see if you bother to look, and the formats inside the category differ from each other along enough axes that the choice between two binary formats can be larger than the choice between binary and text. The next chapter is the map.

The Axes

You cannot make a useful decision between two formats without first agreeing on the dimensions along which they differ. The mistake almost everyone makes the first time they have to choose a format is to pick one or two dimensions — usually speed and size — and try to rank candidates on those alone. The choice that results is usually defensible enough to ship, and usually wrong in some way that takes two years to surface.

This chapter is a map of the dimensions that actually matter. There are seven of them. Each is more or less independent of the others, which means the space of meaningfully distinct formats is large, and also that a format which takes a position on one axis is not constrained on any other. A schemaless format can be deterministic or not. A schema-first format can be row-oriented or columnar. A zero-copy format can use codegen or runtime reflection (though in practice they all use codegen, for reasons we'll get to). The seven axes are not orthogonal in the strict mathematical sense — some combinations are easier to engineer than others — but they are independent enough that thinking of them separately is more useful than not.

For each axis, the question to ask of a format is the same: what position does this format take, and what are the consequences?

Axis 1: schema-required vs. schemaless

A schema is a description, external to the bytes, of what structure those bytes are supposed to have. Schemaless formats do not require one. The bytes carry enough type and structural information to be decoded into a generic representation — a map, a list, a number, a string — without knowing anything in advance.

JSON is the textbook schemaless format, and the binary formats most similar to it in spirit are MessagePack, CBOR, BSON, and Smile. You can decode any well-formed payload in any of these into a tree of generic values, and only then ask: does this look like the thing I expected? The schema, if there is one, lives in your application code and is enforced at the boundary between the generic tree and your domain types. The format itself does not care.

Schema-required formats are the opposite. The bytes do not contain enough structural information to be decoded without consulting the schema. Protobuf, Thrift, FlatBuffers, Cap'n Proto, SBE, and Avro are all schema-required. Without the .proto file, a Protobuf message is a sequence of integer-tagged fields whose meaning — is field 5 a string or a uint32? — cannot be determined from the bytes alone. The wire format encodes a hint at the type (length- delimited, varint, fixed64), but multiple types share each hint, so the schema is not optional.

Avro is the interesting hybrid. Its wire encoding is so compact that, taken in isolation, an Avro payload is meaningless. Field ordering, length, and type are all dictated by the schema. But Avro is almost always paired with a mechanism for distributing the schema alongside the bytes — Object Container Files include the schema in their header, and Confluent Schema Registry indexes the schema by ID and ships the ID inline with each message. Avro is therefore schemaless from the bytes perspective and schema-required from the protocol perspective. The distinction comes up later when we discuss the second axis.

The cost of schema-required formats is operational: the schema must be present everywhere it is needed. The benefit is density and type fidelity. The cost of schemaless formats is wire size and ambiguity: every payload re-states its own structure, and the application is responsible for catching the cases where producer and consumer disagree about what that structure should mean. The benefit is flexibility, especially in environments where the schema is hard to ship in advance — public APIs, polyglot systems, exploratory work.

A common confusion is to claim a particular format can be used "without a schema" by relying on the runtime's reflection facilities. Protobuf has DynamicMessage; Thrift has its Protocol interface; FlatBuffers has reflection via .bfbs. These are not schemaless modes. They are schema-required modes in which the schema is loaded at runtime instead of compile time. The schema is still mandatory. The difference is whether you build it into your binary or ship it alongside.

Axis 2: self-describing vs. external schema

This axis is closely related to the first but not identical, and the distinction is worth pulling apart because confusing them is a common source of architectural error.

A format is self-describing if a payload, taken alone, contains enough information to be decoded. JSON, MessagePack, CBOR, and BSON are all self-describing: every value is preceded by a tag declaring its type. A format requires external schema if you need information that is not in the bytes to decode them. Protobuf, FlatBuffers, Cap'n Proto, SBE, and naked Avro are all external-schema formats.

The interesting cell of the two-by-two table is self-describing, schema-required. This is what an Avro Object Container File is: the file requires a schema to interpret, and it includes that schema in its own header. The bytes of the file are self-contained in the sense that an Avro reader can decode them without consulting any external resource. They are schema-required in the sense that the schema must be present for anything to happen.

This combination is important for the time boundary. A format that is schema-required and not self-describing produces bytes that are only as durable as the schema's availability. If the schema is in a registry that goes dark in 2030, the bytes written in 2024 become uninterpretable. If the schema is in a Git repository that gets deleted, same problem. Avro's container format and Parquet's embedded schema metadata both exist because the people who designed those formats understood that the schema must be archived with the bytes if the bytes are to outlive the producer system.

Self-describing formats also pay a cost: the type tags inflate the wire size, and the receiver has to check at every step whether the type it got is the type it expected. External-schema formats can elide all of that, because the schema has already told them what to expect. The cost is paid in operational complexity instead.

The decision rule worth internalizing: self-describing formats are the right default unless you have a specific reason to pay for the density of an external-schema format. The reasons are real — mostly network bandwidth at scale, or zero-copy access — but they are fewer than people think.

Axis 3: row-oriented vs. columnar

The values of a record can be laid out in bytes in two fundamentally different ways. Row-oriented formats keep all of a record's fields together. To find the next record's email, you skip past the current record's id, name, email, birth_year, tags, and active. Columnar formats keep all values of one field together. To read every record's email, you scan a contiguous block of strings; to read every record's id, you scan a contiguous block of integers.

Almost every format in this book is row-oriented. The columnar exceptions — Apache Arrow IPC, Parquet, ORC, Feather — exist because for analytical workloads, columnar layout is not slightly better but dramatically better.

The reason is twofold. First, columnar layouts compress spectacularly. A column of timestamps has enormous redundancy that disappears when you delta-encode adjacent values; a column of booleans is a bitmap; a column of small integers can be dictionary-encoded down to a few bits per value. Mixing those columns into rows ruins all of these compression opportunities. Second, analytical queries usually touch a few columns out of many, and a columnar layout lets you read only the bytes you care about. A query that selects id and birth_year from a billion-record dataset reads two columns; the row-oriented equivalent reads the whole table.

Columnar formats lose, badly, on transactional access patterns. Reading a single record means scattered reads to every column. Writing a single record means appending to every column buffer. The CRUD-style access pattern that operational systems live on is exactly the pattern columnar formats are bad at.

Most systems that mix the two end up with a row-oriented operational store and a columnar analytical store, with a transcoding pipeline between them. Arrow's specific contribution is to be the row-oriented-shaped in-memory representation that nonetheless shares the columnar layout with the analytical store, so the transcoding can be elided in many cases.

Axis 4: zero-copy vs. parse

Most formats require a parse step: bytes go in, an in-memory object graph comes out. The parse copies, allocates, validates, and constructs. For most workloads this is fine, and the cost is dominated by other things.

A few formats are designed so that the in-memory representation is the byte representation. The bytes are aligned, padded, and laid out so that the program can read fields directly from the buffer with offset arithmetic, without any decode step. FlatBuffers, Cap'n Proto, and rkyv are the canonical examples. Apache Arrow IPC is in this family for analytical workloads.

The benefits are substantial. There is no decode time and no allocation; an mmap'd file becomes an addressable data structure; random access into a large message reads only the touched bytes.

The costs are also substantial. The format must use fixed-size fields or pointers, which means padding and alignment requirements that inflate the encoded size. Variable-length data sits at the end of the buffer and is referenced by offsets. The schema cannot freely change the layout of existing types without breaking compatibility — adding a field is fine, reordering fields is not. The format constrains how fields can refer to each other (no cycles, typically). And the abstractions in your language tend to be generated accessor objects rather than native structs, because the fields you read are not necessarily where the language would put them.

Zero-copy is the right answer when reads dominate writes by orders of magnitude, when latency matters per-read, and when the data is large enough that copying it would itself be the bottleneck. It is the wrong answer when the format will be read once and discarded, when wire size matters more than read latency, or when ergonomics in the host language matter more than microsecond-scale access time.

Axis 5: codegen vs. runtime

Schema-required formats need bindings — the code that reads and writes the schema's types in your language. The bindings can be generated at build time or constructed at runtime. The choice has big consequences for ergonomics and build complexity, and it is often invisible until you try to do something the format authors didn't anticipate.

Codegen produces source files (or class files, or whatever your language uses) that are compiled into your program. The generated types know their schema at compile time, can be checked by the type system, and are usually fast because the access paths are specialized. The cost is a build step. Adding a field means re-running the codegen and recompiling. Cross-language workflows mean running the codegen for every language. CI pipelines acquire opinions about which version of the codegen tool is the canonical one, and disagreements between developers' local installs become mysterious diff churn.

Runtime bindings parse the schema at startup, build an in-memory representation of it, and then either expose a generic accessor (record.get("name")) or use the host language's reflection to build typed wrappers on the fly. The build is simpler. Adding a field means updating the schema and restarting. But the access path goes through a hash lookup or a dispatch table, the type system cannot help you, and certain optimizations the codegen path enjoys are unreachable.

Many formats support both modes. Protobuf has DynamicMessage, Thrift has dynamic protocols, Avro has GenericRecord, JSON Schema validators are by definition runtime. FlatBuffers and Cap'n Proto strongly favor codegen because the zero-copy invariants are easier to enforce when the accessors are generated.

The interesting case is the tooling that lives on top of a format. A schema registry, a wire-format viewer, or a CDC pipeline almost always needs the runtime mode, because those tools want to handle arbitrary schemas they didn't know about at build time. If you are writing such a tool, the format's runtime story matters more than its codegen story. If you are writing an application server, the codegen story usually wins.

Axis 6: determinism

Given the same value, will the format always produce the same bytes? Sometimes yes. Often no. Surprisingly often, the answer is yes for most values, no for some, and the exceptions are the cases where your hash function will quietly start producing different digests for objects you considered equal.

Three positions exist on this axis.

Deterministic by spec. Encoding the same value always produces the same bytes, and the spec mandates this. ASN.1 DER is the most prominent example; it was designed for cryptographic uses where deterministic encoding is essential. Borsh, used in the Solana ecosystem, is deterministic by construction. SBE is deterministic because every field has a fixed offset.

Canonical form available, non-canonical forms also valid. The format permits multiple encodings of the same value, but defines a canonical subset that is deterministic. CBOR has a canonical encoding (deterministic encoding rules in the spec); Protobuf has a "deterministic serialization" mode that is a best-effort, not a guarantee, and explicitly not canonical across implementations or versions. JSON has a long, sad history of canonicalization proposals.

Non-deterministic. The format makes no guarantees, and in practice the encoding will differ between languages, library versions, and sometimes between runs of the same program. Map ordering varies, varint widths can differ, optional padding may or may not appear. Most schemaless formats fall here unless you opt into a canonical mode.

The reason determinism matters is that equality of bytes is a useful primitive. Hashing requires it. Digital signatures require it. Content-addressable storage requires it. Deduplication requires it. Caching keyed on serialized values requires it. If your system will ever do any of these things, the format's stance on determinism is a question you have to answer up front. Deciding to retrofit canonical encoding onto a system that has been writing non-canonical bytes for two years is, generally, a project.

Axis 7: evolution strategy

How does the format handle the situation where the schema changes, producers and consumers run different versions, and you do not get to upgrade them all at once? This is the axis that, in my experience, kills more formats in the wild than any other.

The major strategies are:

Tagged fields. Each field has a stable identifier (a number, sometimes a string) that is independent of its position or name. Adding a new field uses a new tag. Removing an old field leaves the tag retired but reserved. Producers and consumers that disagree on the schema can still parse the bytes, ignoring tags they don't recognize. Protobuf and Thrift work this way. The cost is a small per-field overhead and the discipline of never reusing a tag number.

Schema resolution by reader/writer. Both the reader and writer have schemas, and the format defines rules for resolving differences between them. Avro is the canonical example: a reader's schema and a writer's schema are reconciled at decode time, with rules for field promotion, default values for missing fields, and aliases for renamed fields. The bytes themselves carry no per-field metadata — they rely on the writer's schema being available — but the resolution rules let producers and consumers diverge gracefully. The cost is that the writer's schema must travel with the bytes, or be available through some side channel like a registry.

Position-only. Fields are identified by their position in the record. Adding a field at the end is safe; adding one in the middle is not; reordering is a breaking change. XDR works this way. Borsh works this way. Most "just write the struct's bytes" formats work this way. The cost is that schema evolution is heavily constrained and easy to get wrong.

Hash-based. Each field is identified by a hash of its name and type. Renaming a field is a breaking change; changing its type is a breaking change. This is rare in mainstream formats but appears in some content-addressed systems.

None. Schemaless formats often have no evolution strategy because they have no schema; the application is responsible for handling unknown fields, missing fields, and type drift on its own. The format gives you maps and lists; what they mean is your problem.

Evolution intersects with two further questions: what kinds of compatibility does the format guarantee, and what does it do when bytes don't match the expected schema. The first question yields a four-way classification — backward compatible, forward compatible, full compatibility, no compatibility — which Avro tooling has made standard vocabulary even though the underlying ideas predate Avro by decades. The second question is the decode-failure question from Chapter 1, and the answer is again format-specific: silently drop, surface as an unknown, hard fail.

If a single axis deserves disproportionate attention when choosing a format, it is this one. Performance differences between mainstream formats are usually within an order of magnitude, and within an order of magnitude rarely decides anything. Schema evolution differences between mainstream formats can be the difference between a system that survives ten years of organic change and a system that becomes unmaintainable after eighteen months.

Reading the rest of the book through this lens

Each format chapter that follows takes a position on each of these seven axes. Sometimes the position is explicit and well-documented. Sometimes it is implicit, and the format's authors have not written it down because to them it was so obvious as to need no statement. Part of the job of those chapters is to make the position explicit, so that two formats can be compared on a level field.

When you finish a chapter and don't remember anything else, remember where on these seven axes the format sits. That is enough to know, in any new situation, whether the format is even a candidate.

The Person Record

The exemplar used for the rest of the book is a single record describing a person. The values are fixed; every chapter that follows encodes exactly these values. The choice of values is not aesthetic. Each one was selected to land at a particular point in the encoding space of the formats under discussion, so that the differences between formats become visible in the bytes rather than hidden behind padding.

This chapter exists to fix the record once, in format-neutral terms, so that subsequent chapters can refer back to it without re-introduction. There is no wire tour here — this is the wire tour's input.

The fields

The record has six fields, declared in this order:

id          uint64            42
name        string            "Ada Lovelace"
email       optional string   "ada@analytical.engine"
birth_year  int32             1815
tags        list<string>      ["mathematician", "programmer"]
active      bool              true

id is a 64-bit unsigned integer. The value 42 was chosen because it is small enough to fit in a single byte under any reasonable variable- length encoding (Protobuf varint, MessagePack fixint, CBOR immediate), and yet large enough that the formats which encode integers in fixed 8-byte words (BSON, XDR, Borsh, fixed-width bincode) will pay the full eight-byte tax. Comparing the encoding of id across formats is the clearest single demonstration that variable-length encoding is a real trade and not an academic one.

name is the twelve-byte UTF-8 string Ada Lovelace. Twelve bytes is short enough that length-prefixed formats can encode the prefix in a single byte (or smaller), and long enough that the bytes themselves dominate the per-field overhead. Every byte of the string is in the ASCII range 0x20–0x7e; there are no multi-byte code points, no ambiguous normalization questions, no zero-width joiners, no right-to-left marks. This is deliberate. The book is not about Unicode pitfalls; if it were, the exemplar would include them.

email is an optional string. Its value, when present, is the twenty-one-byte UTF-8 string ada@analytical.engine. The optionality is the load-bearing part of the field. Formats differ enormously in how they represent "this field is present" versus "this field is absent" versus "this field has its default value." Some formats (MessagePack, CBOR, BSON, JSON-shaped formats) make the question trivial: present means present, absent means the key isn't in the map. Some (Protobuf 3) make it surprisingly hard, because the wire format originally collapsed "absent" and "default" into the same encoding for scalar types, and only later restored the distinction for strings via the optional keyword. Some (Avro) require an explicit union with null. Some (Borsh, SCALE, Postcard) have an Option type with a discriminant byte. Some (NBT, the Apache Cassandra family) cannot natively express "field is absent" at all and require an in-band convention (empty string, sentinel value). Each chapter encodes Person twice when the format's behavior differs based on email's presence: once with the email and once without.

birth_year is a 32-bit signed integer with the value 1815. The choice of 1815 over, say, 1985 is not arbitrary. 1815 fits in eleven bits and therefore in two bytes of a varint encoding, while 1985 also fits in two bytes. But the more important reason for picking a historical year is that the variable-length encoders treat it the same way as any other small positive integer; nothing about being a date affects the bytes. Date semantics live in a higher layer, and the formats that have a date type (Ion, BSON, CBOR via tag) are explicit about the choice; for a plain integer, the encoding is just an integer. Birth year is stored as int32 rather than uint32 because the formats that distinguish signed and unsigned do so in ways that matter (zigzag versus straight varint, in particular), and making the canonical type signed exercises the more interesting code path.

tags is an ordered list of two short strings: mathematician (13 bytes) and programmer (10 bytes). The list has two elements, which is small enough that all surveyed formats encode the count in a single byte. The two elements have different lengths, which means the flat "concatenation of equally-sized values" trick that some formats support for scalar arrays is not applicable; the format must encode each element's length individually. Strings rather than integers were chosen for the list elements because string-of-strings is the case that exercises the most format machinery: every variable-length encoding rule applies, the encoder has to decide whether to share length-prefix overhead, and the columnar formats (Parquet, ORC, Arrow) have to encode dictionaries, definition levels, and offsets all at once. An array of integers would have demonstrated less.

active is a boolean with the value true. Booleans are deceptively varied across formats. Some encode them as a single dedicated byte (MessagePack 0xc3, CBOR 0xf5, BSON 0x01, JSON token true). Some fold them into the tag of an enclosing structure (Thrift Compact uses type code 1 for true and 2 for false, with no payload byte at all). Some encode them as a single bit in a packed bitmap (Apache Arrow's validity bitmap is the obvious case, but boolean arrays use the same structure). Some require explicit DER-style encoding where 0xff is canonical for true (ASN.1). Comparing the boolean encoding across formats is a microcosm of the format's overall philosophy.

What the record is not

It is worth being explicit about what the exemplar omits, because several common features have been left out deliberately.

There are no nested records. A Person does not contain an Address, which does not contain a list of GeoCoordinates. The reason is not that nested records are uninteresting — they are very interesting, especially for the columnar formats — but that adding nesting would make the wire tours twice as long without illuminating axes that the flat record fails to illuminate. Where a format's nesting story is a material part of how it works (Cap'n Proto's pointer arithmetic, Parquet's repetition and definition levels, Arrow's struct columns), the chapter for that format includes a sidebar with a nested example.

There are no floating point fields. Floating point is its own subject: NaN bit patterns, denormals, signaling vs. quiet, IEEE 754 vs. implementation-specific extensions, and the question of whether the format preserves the exact bit pattern or only the value. Including a float field in the exemplar would force every chapter to address the question, and the question is mostly the same across formats (they all use IEEE 754 double-precision, with minor differences in how NaN is handled). Where a format does something interesting with floats, the chapter mentions it.

There are no maps with arbitrary keys. A tags list exercises arrays; a metadata map with string keys would exercise the map encoding. Most formats handle maps fine, but a few (Protobuf 2, FlatBuffers without scaffolding) do not have native maps and require a list-of-pairs convention. Including a map field would force a discussion that is mostly orthogonal to the format's core design. Each chapter notes the format's map story without encoding one.

There are no binary blob fields. Bytes-as-bytes is a fine field type in most formats, and most formats handle it sensibly. Including a blob would inflate the wire tours without showing anything not already shown by the string field.

There are no enums. Enums are interesting precisely because formats disagree about whether unknown values are forward-compatible (Protobuf 3 says yes; Thrift's older code generators said no), but the disagreement is small and well-understood. Discussion goes in the schema evolution chapter, not in every individual chapter.

There are no recursive types. A Person does not contain a list of Persons. Recursive types are beyond the scope of the exemplar; they appear in the FlatBuffers and Cap'n Proto chapters in passing.

There is no field marked deprecated, reserved, or removed. The field list is what it is; the chapter on schema evolution simulates removing email and adding a country field, but the canonical encoding shown in each format chapter uses the current schema only.

Two encodings, when relevant

For formats whose behavior differs materially based on email being present versus absent, both encodings are shown. The difference is usually small — a missing tag, an absent map key, a null branch tag — but the difference is exactly the part of the format that handles optionality, and it is worth showing. For formats where the difference is uninteresting (a key that's just not in the map; a list element omitted), only the present-email case is shown and the absent case is described in prose.

The full email value is ada@analytical.engine. The domain is spurious; Ada Lovelace did not have an email address. The value was chosen because it is exactly twenty-one bytes long, which produces encodings that are short enough to fit on a single line in most hex viewers and long enough to demonstrate the length-prefix mechanism in formats that vary their prefix size based on the encoded length.

Why this exemplar will sometimes feel cramped

A single Person record is small. Some of the formats in this book — Parquet, ORC, Arrow IPC, the streaming column formats — are designed to encode billions of records, and showing a single record's encoding in those formats is, frankly, ridiculous. The bytes you get are dominated by header overhead and metadata that would amortize away across a real workload.

Where this is the case, the chapter shows the encoding of a single record honestly — including all the overhead — and then, in a separate sidebar, shows the per-record cost projected over a million records. The single-record encoding is the apples-to-apples comparison for small payloads; the projected cost is the apples-to-apples comparison for the workloads the format was actually designed for. Both numbers are useful. Reporting only one of them is dishonest.

A note on byte order, alignment, and word size

Throughout the book, hex bytes are shown in network order — that is, the order they appear in the wire — even when the format is internally little-endian. This means that for a little-endian format like BSON, the bytes 94 00 00 00 shown at the start of the document are the literal bytes you would see in a hex viewer, and the value they encode is the 32-bit little-endian integer 148. Where this might surprise the reader, a parenthetical follows the hex.

Alignment is annotated where it matters. FlatBuffers, Cap'n Proto, Apache Arrow, and SBE all have hard alignment requirements, and the hex tours for those formats include the padding bytes explicitly so that the alignment is visible.

Word size assumptions in this book follow the formats themselves: a "word" in Cap'n Proto is 8 bytes, a "word" in CBOR's spec is 4 bits (but bytes are bytes), and a "word" in conventional CPU parlance is ignored. When precision matters, the chapter specifies.

That is the record. Every byte tour in the rest of the book is an encoding of these six fields, in this order, with these values. When the bytes diverge across formats — and they diverge enormously — the divergence is not because the input changed. The input is fixed. The divergence is the design.

MessagePack

MessagePack is the format you would design if someone asked you to make JSON smaller without changing what JSON is. That is approximately what Sadayuki Furuhashi did in 2008, and the result has had the kind of quiet, unglamorous success that the better engineering choices usually have: shipped widely, criticized rarely, replaced by nothing.

Origin

MessagePack came out of fluentd, the log aggregation daemon Furuhashi was building at the time. Fluentd's wire format needed to be smaller and faster to parse than JSON — log volumes were already large enough that JSON's overhead mattered — but the data model needed to remain compatible with JSON, because the upstream and downstream tools all spoke JSON and were not going to change. The design constraint was therefore not what is the best binary format but what is the smallest binary format that round-trips through JSON without losing anything. The answer, after some iteration, was MessagePack.

The format was published under an MIT-style license and a website at msgpack.org collected implementations contributed by the community. Implementations now exist in every language that anyone has bothered to write a serializer in: C, C++, Java, JavaScript, Python, Ruby, Go, Rust, Swift, Erlang, Elixir, Crystal, Zig, and roughly thirty others. The format is unchanged in any way that matters since the 2013 specification revision; the small flurry of activity around the "new spec vs. old spec" disagreement that year is now a footnote.

The format on its own terms

MessagePack has nine families of value, corresponding closely to JSON's six (null, bool, number, string, array, object) plus three additions (integer separated from float, raw binary, and an extension mechanism for application-defined types like timestamps). Every value on the wire is preceded by a single tag byte that identifies its type and, for short values, encodes the value or its length directly into the low bits of the tag.

The high-density tag space is the trick. Small integers from 0 to 127 encode as a single byte where the byte is the integer (positive fixint). Negative integers from -32 to -1 encode as a single byte where the low five bits are the value (negative fixint). Strings up to 31 bytes encode their length in the low five bits of a single prefix byte (fixstr). Arrays up to 15 elements use a single prefix byte for the count (fixarray). Maps up to 15 keys use a single prefix byte for the count (fixmap). For anything larger than these inline forms, MessagePack escalates to a multi-byte prefix that explicitly declares the type and width: 0xcc through 0xcf for unsigned integers of 8, 16, 32, 64 bits; 0xd0 through 0xd3 for signed; 0xd9 through 0xdb for strings whose length needs 8, 16, or 32 bits; 0xc4 through 0xc6 for raw binary; 0xdc and 0xdd for arrays; 0xde and 0xdf for maps; 0xca and 0xcb for IEEE 754 floats and doubles.

The result is that small values pay almost no overhead — a byte for the type and the value combined — and large values pay a few bytes of length prefix. A typical mixed payload encodes to around half the size of the equivalent JSON, sometimes less, depending on how white-space-heavy the JSON was. Compression on top of MessagePack yields further gains, but the format alone closes most of the gap.

The data model deliberately does not include Avro-style records, Protobuf-style fields-by-tag, or anything that requires a schema. A MessagePack map is a JSON object: keys are values (almost always strings, though MessagePack permits any value as a key), the order is not significant, and no field is "missing" — it is either present in the map or it is not. This is the same data model as JSON. The format is intentionally not richer than JSON.

There is one exception: the Extension type. An extension is a type-tagged binary blob, where the type code (0 to 127 for application-defined, -128 to -1 for spec-defined) tells the receiver how to interpret the bytes. The spec defines a single extension type itself, the timestamp (type code -1), which encodes 32 to 96 bits of seconds and nanoseconds. Everything else is left to the application. This is the seam through which higher-level type systems are bolted on, and for the most part the seam holds.

Wire tour

Encoding our Person record:

86                                           map of 6 entries
  a2 69 64                                   key "id"
  2a                                           value 42 (positive fixint)
  a4 6e 61 6d 65                             key "name"
  ac 41 64 61 20 4c 6f 76 65 6c 61 63 65     value "Ada Lovelace" (fixstr 12)
  a5 65 6d 61 69 6c                          key "email"
  b5 61 64 61 40 61 6e 61 6c 79 74 69 63
     61 6c 2e 65 6e 67 69 6e 65              value "ada@analytical.engine" (fixstr 21)
  aa 62 69 72 74 68 5f 79 65 61 72           key "birth_year"
  cd 07 17                                   value 1815 (uint16, big-endian)
  a4 74 61 67 73                             key "tags"
  92                                           array of 2 elements
    ad 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician" (fixstr 13)
    aa 70 72 6f 67 72 61 6d 6d 65 72            "programmer" (fixstr 10)
  a6 61 63 74 69 76 65                       key "active"
  c3                                           value true

The total is 104 bytes. The equivalent JSON, minified, is 124 bytes. The difference is not large, and that is the point: MessagePack does not save dramatic amounts of space on small payloads, because the overhead of length tags and key strings dominates the encoding of the values themselves. The wins come at scale, on payloads with many small numbers or boolean flags, where each value's byte cost is roughly halved.

Two observations from the bytes. First, the integer tag byte 0x2a is the value 42 itself; no prefix, no length. This is the positive fixint family at work. Second, the strings are length-prefixed but not null-terminated, which is the consistent choice across all binary formats with length prefixes; null-termination is a feature of C strings and adds nothing once you have a length.

If email were absent, the encoding would simply omit both the key bytes and the value bytes, and the map prefix would be 0x85 (fixmap of 5) instead of 0x86. There is no marker for absence. The map either contains the key or it does not. This is the same model JSON uses, and it has the same consequence: distinguishing absent from null requires the application to encode null explicitly when that distinction matters.

Evolution and compatibility

MessagePack inherits JSON's evolution story, which is to say it has none, and the lack of one is the design. Adding a field means emitting one more key in the map. Removing a field means not emitting it. Renaming a field is a breaking change, the same as renaming a JSON key. There are no field tags, no positions, no schema-resolved defaults. If you want any of those, you build them at a higher layer.

In practice this works because the consumers of MessagePack messages are written with the same casualness JSON consumers are written with: pull the keys you care about, ignore the rest, treat missing keys as missing. The cost is that no automated tool can tell you whether two versions of a producer remain compatible with two versions of a consumer; you have to read the code. The benefit is that nothing in the format requires coordination between producer and consumer beyond the keys themselves.

Schema validation, if you want it, lives in libraries on top: msgpack-schema for Python, various JSON Schema validators that support MessagePack as an alternative input format, and the typed deserializers in language ecosystems that have them (serde for Rust, encoding/json-style decoders for Go, Jackson with the msgpack-jackson module for Java). None of this is in the format.

The format itself has had two backward-compatible changes since the original release. The 2013 spec revision added the str family (distinct from raw bytes) and the bin family. Some early implementations conflated strings and bytes; the new tags disambiguated them. There is a "compat mode" flag in most implementations that lets you produce either the old or the new encoding, and this is occasionally still required when interoperating with very old MessagePack readers. The 2017 timestamp extension added a standardized timestamp encoding; it is namespaced under the extension mechanism, so old readers see it as an unknown extension and either skip it or fail loudly, depending on configuration. No version of the spec has removed anything.

Ecosystem reality

MessagePack's ecosystem is sprawling, mostly high-quality, and occasionally surprising. The reference C implementation (msgpack-c) is reasonably fast and reasonably featureful, but is not always the state of the art in any given language. Modern competition includes ormsgpack in Python (which is dramatically faster than the canonical msgpack library by leaning on Rust), msgpack-cli in .NET, the official msgpack-rust crate, and rmpv (a slightly different Rust implementation focused on dynamic values). In JavaScript, @msgpack/msgpack is the most common choice and is broadly fine. In Java, both Jackson's msgpack module and the standalone msgpack-java package are widely deployed; they produce identical bytes but expose different APIs.

Notable adoptions: Redis used MessagePack internally for its script-replication mechanism for years (the bytes of script arguments crossed a MessagePack-shaped pipe before reaching the Lua VM). Pinterest's engineering blog described using MessagePack extensively in their search indexing pipeline. Influx's protocol options at one point included MessagePack alongside line protocol. Many internal RPC systems at companies that rejected gRPC for some reason settled on MessagePack-over-HTTP as the next-most- reasonable choice. None of these uses are particularly visible because MessagePack does not advertise itself the way Protobuf does; it is a piece of plumbing that ends up in stacks because someone benchmarked it once and was satisfied.

The interoperability story is good but not perfect. A specific gotcha: integer types. MessagePack's wire format encodes integers in the smallest representation that fits, which means the same value can encode in different widths from different producers. A naive Python decoder will produce a Python int regardless of the wire width, which is fine. A typed decoder in a language that distinguishes uint32 from uint64 will sometimes have to decide which to produce, and the heuristics differ between libraries. For most workloads this does not matter; for a few it produces silently wrong types. The fix is to use a typed schema layer on top, at which point you have built half of Avro and may want to reconsider your format choice.

A second gotcha: map key ordering. MessagePack does not specify that maps preserve insertion order, and most libraries do not. For deterministic output, sort the keys before encoding, or use a library that supports a deterministic-encoding mode. Several do not. This is the source of the small but persistent stream of bug reports against msgpack-related signing and content-addressing schemes.

A third gotcha: the str-vs-bin distinction. Some older code, and some newer code that imitates older code, treats the str and bin families as interchangeable. They are not, and decoders that reject the wrong family will break ingestion. The compat-mode flag is worth knowing about specifically because of this.

When to reach for it

MessagePack is the right choice when you would otherwise use JSON and you care about size or parse speed enough that the cost is worth a binary format. The classic case: high-volume internal RPC in a polyglot environment where you control both sides, the schema changes constantly, and the operational ergonomics of "use a typed schema language" are not worth the cost. MessagePack gives you JSON's schemaless flexibility, JSON's data model, and roughly half the bytes.

It is also the right choice for binary blobs in places where you want the field-level structure to remain inspectable: a Redis value, a Kafka message body, a queue payload. Tools exist that can pretty-print MessagePack from any of those, and the inspectability is approximately as good as JSON's once you have the tool installed.

When not to

When you have a stable schema that you control on both ends, the density advantage MessagePack offers over JSON is smaller than the density advantage Protobuf or Cap'n Proto offer over MessagePack. At that point you are paying for schemalessness you do not benefit from. Switch to a schema-first format.

When you need any of: deterministic encoding without library- specific support, formal schema evolution, language-level type safety, zero-copy access, columnar layout, or guarantees about what unknown fields do, MessagePack does not give them to you, and the workarounds (sort keys yourself, validate at the boundary, use a typed wrapper, accept full parses, use rows, hope for the best) are all things you have to remember. The format will not remind you.

When the consumers are public clients and you do not control them, MessagePack imposes the parser-distribution problem that all binary formats impose. Public APIs almost always end up with JSON for this reason. MessagePack-over-the-public-internet exists, and some popular APIs (some game backends, some IoT services) use it, but the choice is an organizational one, not a technical one.

Position on the seven axes

Schemaless. Self-describing. Row-oriented. Parse rather than zero-copy. Runtime bindings, with no codegen step (and no codegen tools, because there is nothing to generate from). Non-deterministic by default, canonical encoding possible if you sort keys and use minimum-width integers but no library guarantees this without configuration. No evolution strategy: keys are strings, types are tags, the application deals with the rest.

This stance is unusual for a binary format only in the sense that most binary formats have abandoned the schemaless side of the schema axis. Picking MessagePack is a deliberate decision to remain on that side while still spending bytes more efficiently than JSON does. Picking MessagePack and then trying to bolt schemas on top produces something worse than picking Avro or Protobuf to begin with — the bolt-on lacks the formal evolution rules, the wire-level optimizations, and the codegen ergonomics, while paying most of the operational cost of having a schema at all.

The corollary is that the cases where MessagePack is the right choice are exactly the cases where its stance on the seven axes matches what your system actually needs. If you find yourself wishing for any of schemas, deterministic encoding, zero-copy access, columnar layout, or compatibility checking, MessagePack is not what you want, and the version of MessagePack you build by adding those things one at a time will be worse than picking the right format from the start. This is, in my experience, the single most common way to misuse MessagePack: choosing it because it sounds like "a better JSON," and then discovering, two years in, that what you actually wanted was Protobuf.

A note on the timestamp extension

A footnote on the timestamp extension is worth including, because it is the one place where MessagePack's spec actually defines a non-trivial type beyond the JSON model. The extension uses extension type code -1 and supports three widths: 32 bits (seconds since the epoch, no nanoseconds, valid through 2106), 64 bits (32 bits of nanoseconds plus 34 bits of seconds, valid past 2514), and 96 bits (32 bits of nanoseconds plus 64 bits of seconds, valid for any timestamp representable in twos-complement). The encoding is straightforward and is supported by most of the major libraries, but not all of them, and not always by default. If your data includes timestamps and you want them to round-trip across language boundaries, check that both sides understand the extension. If they do not, the fallback is usually to encode timestamps as ISO-8601 strings or as Unix epoch integers — both of which work, both of which lose some fidelity, and both of which sidestep the extension mechanism entirely.

The existence of the timestamp extension is the strongest evidence that MessagePack's authors knew the format would be used for more than just JSON-equivalent payloads, and the deliberate restraint in not adding more extensions (no UUID, no decimal, no big-int) is the strongest evidence that they wanted the format to remain small in surface area. The trade has held up; no widely-deployed system has, to my knowledge, complained that MessagePack's type system is too small to be useful, and the few that need richer types reach for CBOR or Ion instead.

Epitaph

MessagePack is JSON's binary doppelgänger: same data model, half the bytes, none of the schema. Reach for it when the only thing wrong with JSON is the size.

CBOR

If MessagePack is the format engineers chose, CBOR is the format the IETF chose. The two are close enough that any serious user of one knows the other exists; the differences between them are small in cardinality but large in temperament. CBOR was designed in a standards body, by people whose day jobs involve writing protocols that must be implemented correctly by hundreds of independent vendors over decades, and the format reflects that audience and that lifespan.

Origin

CBOR was published as RFC 7049 in 2013, authored by Carsten Bormann and Paul Hoffman. The acronym expands to Concise Binary Object Representation, and the title of the RFC is honest about the design goals: small enough to be useful in constrained environments, simple enough to be implemented from scratch, extensible without breaking existing decoders, and standardized through the IETF process so that protocol authors could rely on the format being available, frozen, and fully specified.

The immediate motivator was CoAP, the Constrained Application Protocol, which was being designed for IoT-scale devices that could not afford the overhead of HTTP, JSON, or, in some cases, even TCP. CoAP needed a binary payload format that would survive on devices with kilobytes of RAM, and JSON was not going to. MessagePack was an obvious starting point, but its specification at the time was a community artifact, not an IETF document, and the IETF had a process for producing the latter. RFC 7049 was the result.

The format was revised in 2020 as RFC 8949, which is the current authoritative document. RFC 8949 added several clarifications, defined a deterministic encoding subset, and incorporated the experience of seven years of deployment. The wire format itself did not change in any incompatible way; existing CBOR encoders and decoders continue to work without modification.

CBOR is now an obligate dependency of a small fleet of higher-level protocols. COSE (CBOR Object Signing and Encryption, RFC 8152) is the CBOR analogue of JOSE/JWT and is used by FIDO2, WebAuthn, and the IETF's ACE family of authorization protocols. CWT (CBOR Web Token, RFC 8392) is the CBOR analogue of JWT and is used wherever the bearer token needs to be small. The EU Digital COVID Certificate uses CWT inside QR codes; the bytes of those QR codes are CBOR. Matter, the smart-home interop protocol led by Apple, Google, Amazon, and Samsung, uses CBOR throughout. So does the IETF's Software Updates for Internet of Things working group. CBOR is the format you reach for when you are designing a protocol that needs to be small, durable, and fully specified, and you do not control the implementations.

The format on its own terms

CBOR is built from a single, ruthlessly regular encoding rule. Every value begins with one byte. The high three bits of that byte encode the major type (one of eight). The low five bits encode either the value directly (if the value fits) or the size of a follow-on field that contains the value. The eight major types are: unsigned integer, negative integer, byte string, text string, array, map, semantic tag, and a catch-all called simple values and floats.

The five-bit additional information field follows a uniform convention. Values 0 through 23 are the value itself (when the major type is an integer, this means the integer is 0-23 directly; when it is a string, the length is 0-23 directly). Values 24, 25, 26, and 27 mean "the value or length is in the next 1, 2, 4, or 8 bytes, respectively, big-endian." Values 28, 29, and 30 are reserved. Value 31 means "indefinite length," which is CBOR's mechanism for streaming: the length is unknown at the time of encoding, and a "break" byte (0xff) terminates the value when it has been fully emitted.

This regularity is the design's central virtue. A CBOR decoder is about a hundred lines of C. There is one parsing routine that handles all eight major types, because the framing is identical across them. The dispatch on major type happens after the framing is parsed, which means a partial parser can skip a value it does not understand without knowing what the value is. Skipping unknown values is exactly the property that makes a format extensible without breaking decoders, and CBOR is structurally equipped to do it; the application layer gets this for free, with no per-format machinery.

The semantic tag major type — major type 6 — is the format's extensibility hinge. A tag is a small integer that wraps a single following value, telling the decoder the value should be interpreted as. Tag 0 means "the following text string is an ISO 8601 date"; tag 1 means "the following number is a Unix epoch timestamp"; tag 32 means "the following text string is a URI"; tag 64 means "the following byte string is an array of unsigned 8-bit integers." The IANA maintains a registry of assigned tag values; new tags are allocated through a lightweight first-come-first-served process. Tags do not change the wire format; a decoder that does not recognize a tag returns the inner value untagged, and the application either copes or rejects.

The simple values family (major type 7) is the home of false (0xf4), true (0xf5), null (0xf6), undefined (0xf7), and the floating-point types: half-precision (0xf9 + 2 bytes), single-precision (0xfa + 4 bytes), and double-precision (0xfb + 8 bytes). The break byte (0xff) lives in this family too. The space is otherwise reserved for spec-defined or application-defined simple values that need to fit in a single byte.

CBOR is not a strict superset of JSON. The data model differs in small but real ways: CBOR has byte strings (JSON does not, except through base64), CBOR has integers up to 64 bits (JSON's spec is silent past 53 bits), CBOR has tagged semantic values (JSON has no analogue), CBOR distinguishes null from undefined, and CBOR's map keys can be any value (JSON's must be strings). For most payloads the round-trip CBOR → JSON → CBOR is lossy, and the JSON sequence specification (RFC 8259, in JSON Text Sequences and related documents) tries to bridge this gap with mixed success.

Wire tour

Encoding our Person record:

a6                                           map of 6 entries
  62 69 64                                   key "id" (text string len 2)
  18 2a                                      value 42 (uint, 1-byte follow)
  64 6e 61 6d 65                             key "name" (text string len 4)
  6c 41 64 61 20 4c 6f 76 65 6c 61 63 65     value "Ada Lovelace" (text string len 12)
  65 65 6d 61 69 6c                          key "email" (text string len 5)
  75 61 64 61 40 61 6e 61 6c 79 74 69 63
     61 6c 2e 65 6e 67 69 6e 65              value "ada@analytical.engine" (text string len 21)
  6a 62 69 72 74 68 5f 79 65 61 72           key "birth_year" (text string len 10)
  19 07 17                                   value 1815 (uint, 2-byte follow, big-endian)
  64 74 61 67 73                             key "tags" (text string len 4)
  82                                           array of 2 elements
    6d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician" (text string len 13)
    6a 70 72 6f 67 72 61 6d 6d 65 72            "programmer" (text string len 10)
  66 61 63 74 69 76 65                       key "active" (text string len 6)
  f5                                           value true

Total: 105 bytes — one more than MessagePack. The single byte of overhead is in the encoding of id: 42 is greater than 23 and so does not fit in CBOR's immediate-value range, requiring the byte 0x18 (additional info 24, meaning "1-byte follow") plus the value 0x2a. MessagePack's positive fixint encoding goes up to 127 in a single byte and so encodes 42 in one byte total.

This single-byte difference is essentially the entire density gap between MessagePack and CBOR for typical payloads. It applies to integers in the range 24 through 127, where MessagePack uses one byte and CBOR uses two. For larger integers, smaller integers, all strings, all arrays, all maps, the formats encode in identical counts. The two formats really are extraordinarily close.

The byte-level difference that does have practical consequences is the array and map prefix encoding. MessagePack uses fixarray (low nibble = count) for arrays up to 15 elements; CBOR uses immediate (low five bits = count) for arrays up to 23 elements. For maps and strings the same difference applies. In typical telemetry data, both ranges are large enough that the difference rarely appears.

If email were absent, the encoding would shrink by 28 bytes (the key plus the value), and the map prefix would change to 0xa5 (map of 5). As with MessagePack, absence is encoded by omission, not by a sentinel.

Evolution and compatibility

CBOR's evolution story is the same as MessagePack's at the wire-format level — keys are strings, the application treats unknowns as it sees fit — but with one substantial addition: the semantic tag. A tag is a versioning hook by design. A producer that wants to introduce a new interpretation of an existing value type can wrap that value in a new tag; consumers that recognize the tag interpret the value specially, and consumers that do not recognize the tag fall through to the underlying value unchanged. This is a forward-compatibility mechanism baked into the wire format, and it is genuinely useful.

The deterministic encoding subset defined in RFC 8949 is the second addition. Deterministic encoding means: integers use the smallest representation that fits; floats use the smallest representation that exactly preserves the value; map keys are sorted by their byte encoding; indefinite-length encodings are not used; tags use minimal encoding. A CBOR producer that emits deterministic encoding produces the same bytes for the same value every time, and two producers emitting deterministic encoding produce the same bytes as each other. This makes CBOR usable for content addressing, signing, and deduplication — all the use cases where MessagePack's lack of a canonical form forces extra coordination.

The COSE family of protocols depends on deterministic encoding for its signature schemes. The IPLD ecosystem (the content-addressed data layer underlying IPFS and Filecoin) uses dag-cbor, a slight restriction of CBOR that mandates deterministic encoding plus a few additional rules. ACE/OAuth tokens lean on the same property. Without deterministic encoding, none of these protocols would work.

The CDDL specification language (RFC 8610, Concise Data Definition Language) gives CBOR a formal schema language and the ability to validate payloads against schemas. CDDL is not required to use CBOR, and most uses of CBOR do not involve CDDL, but it exists for the cases where you want it. CDDL is broadly comparable to JSON Schema in scope, with type rules better suited to CBOR's richer type system.

Ecosystem reality

CBOR's ecosystem is structurally different from MessagePack's. The implementations are fewer in number and more uniformly high-quality, because the format's primary user base is in standards-driven ecosystems where wire-level interoperability is non-negotiable. The canonical C implementation is libcbor, and there are mature implementations in Rust (ciborium, serde_cbor), Go (fxamacker/cbor — the de facto Go implementation, with explicit support for COSE and deterministic encoding), Java (jackson-dataformat-cbor and co.nstant.in.cbor), Python (cbor2), and JavaScript (the cbor and cbor-x packages). Most of these libraries support deterministic encoding via configuration, and the ones that do not are usually discoverable as the inferior choice.

The deployments that pull CBOR in are concentrated in standards- driven domains. WebAuthn assertions are CBOR; FIDO2 attestations are CBOR; the EU's COVID certificate format is CBOR; ACE tokens are CBOR; Matter messages are CBOR; the IETF's CORE working group's protocols are CBOR; the COAP-WG's payloads are CBOR; OpenID Connect's CIBA flow can use CBOR. None of these deployments would have considered MessagePack, because none of them would have accepted a format without an RFC. CBOR's existence is the reason none of them had to invent their own format.

The format also shows up in a few less-standards-driven places. LightStep used CBOR in some of its tracing libraries. Rust's serde ecosystem uses CBOR through ciborium and serde_cbor for general purpose binary serialization. The Cap'n Proto JSON encoding is, in some configurations, CBOR-shaped. None of these are dominant uses; the standards-driven cases are.

Two ecosystem gotchas worth noting. First, indefinite-length encoding is permitted by the spec but disallowed by deterministic encoding, and a surprising number of CBOR producers use it for streaming output. Decoders that work in protocols requiring deterministic encoding must reject indefinite-length values; some libraries do not by default. Configure carefully if you are implementing one of the COSE-family protocols.

Second, the half-precision float type is real and is occasionally emitted by libraries that aggressively shrink. Some decoders do not handle half-precision floats correctly (or at all). If your data includes floats, test the round-trip; do not assume.

When to reach for it

CBOR is the right choice when MessagePack would be a candidate but the deployment context is one of the following: a public protocol where the spec must be referenceable and immutable; a deterministic encoding requirement (signing, content addressing, deduplication); a constrained-device deployment where the spec stability matters more than fewer bytes; or a deployment that intersects with one of the existing CBOR-using protocols (COSE, CWT, FIDO, Matter), where piggy-backing on CBOR avoids a format-translation layer.

CBOR is also a defensible default for new internal binary protocols, even when none of the above apply, on the principle that a format governed by an IETF RFC is harder to accidentally diverge from than a format governed by a community-maintained website. The handful of extra bytes per integer is rarely the bottleneck.

When not to

When you have a stable schema you control on both sides, the schemaless self-describing nature of CBOR is paying for flexibility you do not need; switch to a schema-required format. When you need zero-copy access, CBOR's varint-prefixed strings disqualify it. When you need columnar layout, CBOR's row-oriented structure disqualifies it. When the payload is going to be edited by humans, CBOR's binary encoding disqualifies it (use the JSON sequence equivalent, which is not the same format). When you simply want the densest possible schemaless binary encoding, MessagePack will be one byte per integer shorter on small values, which over a billion records is a real size; consider both.

Position on the seven axes

Schemaless. Self-describing. Row-oriented. Parse rather than zero-copy. Runtime bindings; CDDL is available for those who want a schema, but the format does not require one. Deterministic encoding subset specified, used widely in protocol contexts, opt-in elsewhere. Evolution by application convention, with semantic tags as a forward-compatibility hook.

CBOR's stance on the seven axes is identical to MessagePack's on six of them and stronger on the seventh: where MessagePack has no canonical form by spec, CBOR does. This single difference is the reason CBOR exists and the reason MessagePack does not show up in COSE, WebAuthn, FIDO2, or Matter. For protocols where bytes are signed or hashed or addressed by content, the lack of a canonical form is a deal-breaker; for protocols where they are not, the difference is academic. The choice between MessagePack and CBOR is therefore mostly the choice of which side of that line your system sits on.

A note on the JSON-equivalence question

CBOR is sometimes pitched as "JSON in binary" the way MessagePack sometimes is, and the pitch is approximately as honest in either case. Most JSON values round-trip through CBOR fine. The exceptions are: integers larger than 2^53 (CBOR preserves them, JSON does not specify how), byte strings (CBOR has them, JSON does not), and date semantics (CBOR has tags for dates, JSON has only strings). Going the other direction — CBOR to JSON — is lossier still: tagged values cannot be represented in JSON without a convention, byte strings have to be base64-encoded, and floats with half-precision have to be promoted.

Practically, this means the MessagePack and CBOR communities have settled on similar mental models for their formats — like JSON, but — with the understanding that the but clause is doing a lot of work. The CBOR community has been more honest about this than the MessagePack community, perhaps because the IETF-style specification process forces a level of precision that a community-maintained website does not.

Epitaph

CBOR is MessagePack with standards-body discipline: same data model, same density, but with a normative spec, a deterministic encoding mode, and a thriving graft of higher-level protocols. Reach for it when the protocol must outlive its authors.

BSON

BSON is the easiest format in this book to misunderstand. The name suggests binary JSON, the spec lives at bsonspec.org, and the wire format encodes JSON-shaped values into bytes that are denser than JSON and smaller than nothing else in particular. Read at face value, BSON sounds like a competitor to MessagePack and CBOR. It is not. BSON was designed to solve a different problem, and judging it as a binary JSON on size alone produces the entirely correct conclusion that it is inferior to MessagePack and CBOR at the job. Understanding BSON requires understanding what it was actually built to do, which is to serve as a storage and indexing format for a document database whose queries needed to skip through large documents quickly.

Origin

BSON was designed at 10gen, the company that became MongoDB Inc., in 2009. MongoDB was built around the idea that the database's storage format and the application's data format should be the same: documents are JSON-shaped objects, on disk and in memory, and queries operate on those documents directly without translation through a relational schema. This required a binary serialization for the documents on disk. The serialization had to be writable, readable, and traversable by query engines that wanted to skip past fields they did not care about.

The constraints that emerged from those requirements turned out to be quite different from MessagePack's. MessagePack optimizes for size: small values get small encodings, length prefixes are minimal, integer widths shrink to fit. BSON optimizes for traversal: every value is preceded by a length-or-equivalent that lets a reader skip the value without parsing it; integers are fixed-width so a comparison can happen in place; field names are null-terminated strings so they can be matched against query predicates byte-for-byte.

The result is a format that is consistently larger than MessagePack or CBOR, sometimes by a factor of two or more, and consistently faster to traverse for the access patterns MongoDB's query engine cares about. If you are not running a document database, you are not paying for BSON's strengths and are paying full price for its weaknesses. This is why BSON is not common outside MongoDB.

The format on its own terms

A BSON document is a contiguous block of bytes with the following shape: a 4-byte little-endian integer giving the total document size (including the size field itself); a sequence of elements; a single trailing byte 0x00 marking the end of the document. The document size field exists so that any reader, given a pointer to a document, can allocate exactly the right buffer or skip to the byte after the document without examining its contents. Embedded documents (which appear in the wire format as ordinary values) carry their own size field for the same reason.

An element has three parts: a single type byte, a null-terminated field name, and a value. The type byte is one of about twenty defined codes: 0x01 for double, 0x02 for UTF-8 string, 0x03 for embedded document, 0x04 for array, 0x05 for binary, 0x07 for ObjectId (a MongoDB-specific 12-byte identifier), 0x08 for bool, 0x09 for UTC datetime (a 64-bit milliseconds-since-epoch), 0x0a for null, 0x0b for regex, 0x10 for int32, 0x11 for timestamp (a MongoDB internal type, not a generic timestamp), 0x12 for int64, 0x13 for the 128-bit decimal type added later, and a handful of deprecated codes that earlier versions used and that current encoders avoid.

The field name is a C-style null-terminated string. The value's encoding depends on the type byte. Strings are length-prefixed (little-endian 32-bit length, including the null terminator), followed by UTF-8 bytes, followed by an explicit null. Arrays are encoded as embedded documents whose field names are the strings "0", "1", "2", and so on; this means an array of one million elements has one million field names, all of which are integers formatted as decimal strings, all of which are null-terminated. This is one of the most striking design choices in the format and the one that produces most of BSON's size overhead. Integers are fixed-width little-endian. Doubles are IEEE 754 little-endian. Booleans are a single byte: 0x00 for false, 0x01 for true.

Several of the type codes encode types that have no analogue in JSON at all. ObjectId is twelve bytes structured as four bytes of seconds-since-epoch, five bytes of a per-process random value, and three bytes of a sequential counter; the structure is documented because tools want to extract the timestamp portion without consulting MongoDB. UTC datetime is a single 64-bit integer giving milliseconds since the epoch; it is signed, which means negative values are valid and represent dates before 1970. Decimal128 is the IEEE 754-2008 128-bit decimal type, designed for financial data where binary floats lose pennies. The type set is wider than MessagePack's and CBOR's because BSON is the storage format for a database that has to preserve types its applications care about even when JSON does not.

Every element is implicitly tagged with both type and length. The length is implicit for fixed-width values and explicit for variable-width ones. The trailing null on strings is redundant given the length prefix; the spec keeps it for compatibility with code that wants to treat strings as null-terminated C strings, which is in fact what some of MongoDB's older internals did.

Field name ordering in a BSON document is preserved on the wire and by most consumers, which is a consequential detail: BSON documents that were emitted in different orders are different bytes, even if they are equal as JSON objects. Some MongoDB query operations rely on this ordering, and the canonical encoding rules for digital signatures over BSON depend on it.

Wire tour

Encoding our Person record:

94 00 00 00                                  document length 148 (LE)
12 69 64 00                                  type 0x12 (int64), key "id"
  2a 00 00 00 00 00 00 00                    value 42 (LE int64)
02 6e 61 6d 65 00                            type 0x02 (string), key "name"
  0d 00 00 00                                  string length 13 (12 + null, LE)
  41 64 61 20 4c 6f 76 65 6c 61 63 65 00       "Ada Lovelace\0"
02 65 6d 61 69 6c 00                         type 0x02 (string), key "email"
  16 00 00 00                                  string length 22 (21 + null, LE)
  61 64 61 40 61 6e 61 6c 79 74 69 63 61
     6c 2e 65 6e 67 69 6e 65 00              "ada@analytical.engine\0"
10 62 69 72 74 68 5f 79 65 61 72 00          type 0x10 (int32), key "birth_year"
  17 07 00 00                                  value 1815 (LE int32)
04 74 61 67 73 00                            type 0x04 (array), key "tags"
  2c 00 00 00                                  inner doc length 44 (LE)
  02 30 00                                     type 0x02 (string), key "0"
    0e 00 00 00                                  length 14
    6d 61 74 68 65 6d 61 74 69 63 69 61 6e 00    "mathematician\0"
  02 31 00                                     type 0x02 (string), key "1"
    0b 00 00 00                                  length 11
    70 72 6f 67 72 61 6d 6d 65 72 00             "programmer\0"
  00                                           inner doc terminator
08 61 63 74 69 76 65 00                      type 0x08 (bool), key "active"
  01                                           value true
00                                           document terminator

148 bytes — about 40% larger than MessagePack and CBOR. The overhead is in the places you would expect: 64-bit integer for id, fixed 32-bit length prefixes on every string (even when one byte would suffice), explicit null terminators on every key and string, explicit type byte even where the value byte already implies the type, and the array-as-document encoding that names the elements "0" and "1". The largest single source of overhead in this particular record is the pair of array key bytes, which together take six bytes (30 00 31 00 plus their preceding type bytes) to say what MessagePack says in zero.

The id field is also worth noting. BSON has both 32-bit and 64-bit integer types, but most BSON producers default to 64-bit when the source language's integer is 64-bit, regardless of whether the value fits in 32 bits. This is the conservative choice; it preserves type fidelity at the cost of bytes. The MongoDB shell will emit id as int64 by default. To force int32 you have to construct a NumberInt() value, which is rarely done. The result is that BSON in practice spends 8 bytes on most numeric IDs, where MessagePack or CBOR would spend one or two.

The trade is what the format is for. In MongoDB, when a query selects records where id == 42, the query engine reads the document length, jumps to the elements, scans field names looking for id\0, and once found compares 42 against the eight bytes that follow. The comparison is a single integer load. No parsing, no allocation, no type-aware traversal. BSON's structure makes this access pattern fast, and that access pattern is the entire point.

Evolution and compatibility

BSON has the same evolution story MessagePack and CBOR have at the field level: keys are strings, the application decides what to do with unknown keys, removing a field means it is no longer in the document. There is no formal schema, no version field, no forward/backward compatibility policy enshrined in the format. The tooling on top — MongoDB's schema validation, the various ODM libraries — provides whatever evolution discipline a particular deployment uses.

The format itself has evolved in three ways since 2009. Decimal128 was added (type 0x13). Several deprecated types from very early versions were removed from the canonical encoding rules but remained in the type byte registry to preserve decoder compatibility; producers no longer emit them, but decoders still handle them. The canonical-extended-JSON specification was introduced as a textual representation of BSON values, used in shell output and tools, distinct from both JSON and BSON. None of these changes are wire-incompatible.

The deterministic encoding question for BSON is a live one. There is no spec-defined canonical encoding, and BSON producers do preserve field order, which makes byte-equality possible if the producer is careful — but only if the producer is careful. MongoDB's internal wire protocol does not require canonical encoding because its uses do not need it; signing-over-BSON schemes (some of which exist in the audit-log space) define their own canonicalization rules.

Ecosystem reality

Outside MongoDB, BSON is rare. Inside MongoDB, it is everything. The driver libraries are mature in every language MongoDB supports officially: C, C++, Go, Java, JavaScript/TypeScript, Python, Ruby, Rust, Scala, Swift, .NET, PHP, and a few others. The Go driver's BSON encoder/decoder is the implementation most often used outside the MongoDB context, partly because it is well-documented and partly because Go's encoding/json shape generalizes naturally. Several non-MongoDB projects use the Go driver's BSON package as a general binary format on the strength of "we already have a dependency on it"; this is a defensible choice, but it is the long way around to MessagePack.

The notable non-MongoDB uses are: GridFS, MongoDB's file-storage abstraction, which is technically a separate format but uses BSON as its metadata representation; some logging and audit systems that wanted BSON's typed values; and a handful of replication and backup tools that interoperate with MongoDB and chose BSON for that reason. None of these are general-purpose binary serialization deployments. None of them suggest that BSON is the right format for a new system unrelated to MongoDB.

The shell and the various MongoDB tools speak a JSON dialect called Extended JSON that round-trips BSON values, including the types that have no JSON equivalent. Extended JSON has two modes: relaxed (which uses regular JSON for round-trippable values and falls back to type-tagged objects for the rest) and canonical (which always uses type-tagged objects, even for values that have JSON equivalents). The mongoexport tool emits relaxed extended JSON by default; many of the tools downstream of mongoexport have to handle the type tags, and this is a common source of pipeline bugs.

When to reach for it

If you are using MongoDB, you are using BSON, and the question does not arise. If you are interoperating with MongoDB at the protocol level — building a tool that reads MongoDB's oplog, implementing a replica of MongoDB's wire protocol, processing MongoDB backup files — BSON is the right format because it is the only format that works.

The question does arise if you are building a document database of your own, and asking whether to use BSON as the storage format because MongoDB did. The answer is usually no: the design tradeoffs that work for MongoDB's specific query engine may not be the right ones for yours. CouchDB, RethinkDB, and a few others made different choices for good reasons.

When not to

Outside the MongoDB context, BSON is the wrong format for almost any new system. It is larger than MessagePack and CBOR. It has no schema language. It has no canonical encoding. It has no formal evolution rules. It has type codes for things you do not need (ObjectId, MongoDB Timestamp, deprecated types) and lacks features you might want (efficient integer encoding, tagged extension, columnar layout). It exists to serve a database, and away from that database it is an inferior implementation of a problem several other formats solve better.

The temptation to use BSON arises specifically because some teams already have it as a dependency, and the temptation should be resisted. Switching to MessagePack or CBOR is not a large project in any language with mature support for both, and the bytes-on-disk savings will pay back the migration in any deployment large enough to make the question matter.

Position on the seven axes

Schemaless. Self-describing. Row-oriented. Parse rather than zero-copy, except in the in-document-traversal sense MongoDB's query engine uses. Runtime bindings. Non-deterministic by spec, but field order is preserved on the wire so deterministic-with-care is achievable. Evolution by application convention.

The unusual cell BSON occupies on this map is self-describing, schemaless, and yet structurally optimized for in-place traversal. MessagePack and CBOR make the same axis choices but optimize for size, not traversal. The result is that BSON looks like the worst of both worlds when judged on size and the best of both worlds when judged on the access patterns MongoDB cares about. Outside that context, the access-pattern advantage evaporates and only the size disadvantage remains.

A note on the array-as-document encoding

The single most-criticized design choice in BSON is the array encoding. Arrays are documents whose keys are decimal-formatted indices: "0", "1", "2", and so on, each null-terminated, each preceded by its element's type byte. For a thousand-element array the keys alone occupy approximately three kilobytes, none of which carry information. The motivation for this encoding is consistency: arrays are documents, documents have keyed elements, and the recursive structure means a generic document parser handles arrays without special cases. The cost is paid at every BSON-encoded array, in every database, every day.

Several proposals to add a more compact array encoding have been made over the years and rejected. The reason for rejection is backward compatibility: every BSON parser ever written assumes that arrays are documents with stringified-integer keys. Changing the encoding would either require all parsers to be updated simultaneously (impossible) or introduce a separate type code for the new encoding (creating two ways to do the same thing, which is worse). The cost of fixing the design exceeds the cost of leaving it alone, and so it has been left alone for fifteen years and counting. This is the canonical example of a wire format whose mistakes are permanent.

Epitaph

BSON is MongoDB's storage format wearing JSON's vocabulary; competent inside that database, awkward outside it.

Smile, UBJSON, Amazon Ion

The previous three chapters covered the canonical members of the self-describing binary family: MessagePack, CBOR, BSON. There are several others. They have not won, but they exist for reasons, and each one illuminates a corner of the design space the canonical three do not. This chapter covers Smile, UBJSON, and Amazon Ion together, not because they are equivalent but because none of them is sufficiently common to deserve its own chapter and the design choices they make are interesting precisely as variations on the same theme.

Smile

Smile is the binary JSON format that ships with Jackson, the JVM's canonical JSON library. Tatu Saloranta, Jackson's primary author, designed Smile in 2010 to occupy the same niche MessagePack does: a schemaless binary format that is smaller and faster than JSON, data-model-equivalent, and conveniently embedded in an existing serialization library that most JVM applications already depended on.

Smile's central design choice is key sharing. JSON-shaped data is typically dominated by repeated keys: every record in an array of records repeats the same field names, and the bytes of those names are most of the encoding. Smile maintains a table of recently-seen keys during encoding and emits a back-reference when a key reappears. For a sequence of fifty user records, the keys id, name, email, birth_year, tags, and active are emitted in full once and then referenced as one-byte tokens for the next forty-nine records. The savings are substantial, and proportional to the repetition. Single-record payloads see no benefit, and Smile's encoded size for our Person record is approximately the same as MessagePack's — but for the 50-record case, Smile is roughly 60% of MessagePack's size.

A Smile stream begins with a four-byte header: :) followed by a version byte and a feature flags byte. The header announces the format and lets a reader detect a Smile stream without further context. After the header come ordinary tokens. The token encoding is similar to MessagePack's in spirit — variable-length tags, some inline-value forms, length-prefixed strings — but with several JVM-specific touches (a back-reference space for short shared strings as well as keys, a distinct VLQ-style integer encoding, explicit tokens for true, false, and null rather than packing them into a fixed table).

Encoding our Person record:

3a 29 0a 03                                  Smile header ":)\x0a\x03"
fa                                           start object
80 69 64                                       key "id" (1-byte length-encoded short string)
24 54                                          value 42 (small integer encoding)
... (remaining fields encode similarly)
fb                                           end object

The full byte tour is around 90 bytes for a single record, with the key-sharing benefits not yet engaged. For a stream of records the tour would diverge from MessagePack's exactly at the second record, where the keys would compress to back-references.

Smile has not spread beyond the JVM. Jackson uses it in jackson-dataformat-smile and a handful of JVM-native systems (Elasticsearch internally serializes some of its on-disk representations as Smile, or did historically; some Hadoop adjacent tools use it) consume it, but no major non-JVM ecosystem has a production-quality Smile implementation. The format is a perfectly reasonable design and a perfectly reasonable choice for JVM-only systems with high payload repetition; outside that context it is hard to recommend.

The interesting analytical point about Smile is that its key-sharing optimization is the inverse of the schema strategy. Schema-required formats (Protobuf, Avro, Cap'n Proto) eliminate key bytes entirely by replacing them with field tags or positions. Smile keeps the keys but compresses them with a context-sensitive back-reference. The compression is less effective than full elimination but does not require a schema. This is a genuine point in the design space, and Smile's small constituency does not mean the choice was wrong — only that the gains are not large enough to overcome the operational cost of being a Java-only format in a polyglot world.

UBJSON

UBJSON, Universal Binary JSON, is the format you would design if you cared more about being mistakable for JSON than about being competitive on size. The encoding is a one-to-one binary mapping of JSON values, with each value preceded by a single ASCII character that denotes its type: i for int8, U for uint8, I for int16, l for int32, L for int64, d for float32, D for float64, S for string, T for true, F for false, Z for null, [ for array, { for object, ] and } for the matching closers, and a few others. The single-character type prefixes mean a UBJSON encoder is almost trivial to write and a UBJSON dump is almost human-readable when viewed in a hex editor: you can see the brace characters, the field name strings, and the type codes without a parser.

Strings are length-prefixed using a nested type-prefixed integer: S i 0c Ada Lovelace says "string, int8 length, value 12, twelve bytes of UTF-8". This recursive type-prefixed encoding generalizes to lengths of any size, but it imposes overhead: every length spends two bytes (the length type tag plus the length value) where MessagePack and CBOR pack the length into a single byte. The uniform syntax simplifies parsers; the cost in bytes is real.

Encoding our Person record (partial):

{                                            object open
i 02 i d                                       key "id" (string, int8 len 2, "id")
U 2a                                           value 42 (uint8)
i 04 n a m e                                   key "name"
S i 0c A d a 20 L o v e l a c e                value "Ada Lovelace"
... (remaining fields)
}                                            object close

The object has no count prefix — the close brace marks the end — which differs from MessagePack and CBOR (both of which prefix the count). UBJSON's syntactic close-brace makes streaming decoders slightly simpler at the cost of disabling some quick-skip optimizations. There is an optional "container with count" form that prefixes the count after the open brace, but it is not universally supported.

The ecosystem is small: a reference C library, a Python module, a few JavaScript implementations, and not much else. UBJSON's constituency is mostly hobby projects and a handful of game-engine asset pipelines that want a simple, easy-to-debug binary format and do not mind paying the bytes. There is no significant production deployment that distinguishes UBJSON from its competitors, and the size disadvantage is real enough that recommending UBJSON over MessagePack or CBOR is hard outside of the specific case where hex-readability matters more than density.

The interesting point UBJSON occupies is that single-character type codes are an honest design choice when the system is small and single-team-owned. They are easy to remember, easy to write a parser for, and easy to read in a dump. The drawbacks compound at scale: every value pays a byte for its type even when context would have made the type obvious, every length is a recursive type-prefixed integer, and there is no compression mechanism to recover the bytes. UBJSON is a teaching example of the cost of being too literal a mapping of JSON.

Amazon Ion

Ion is the binary format Amazon designed to be the long-haul serialization for systems that need typed values, schema flexibility, and round-tripping between binary and human-readable forms. It was released as open source in 2016 after being used internally for years. The format has both a binary and a textual representation; the two are equivalent, both are spec-defined, and any conforming implementation must support both. This is the feature that distinguishes Ion from MessagePack, CBOR, and Smile: a single canonical text representation for every value, designed for human inspection and editing.

Ion's data model is JSON-shaped but extended: integers of arbitrary precision, decimals (not floats — decimals, with exact precision), timestamps with timezone and arbitrary precision, symbols (interned strings, encoded as integer references to a symbol table), blobs (byte strings), clobs (text strings with non-Unicode encoding, preserved verbatim), and annotations (one or more symbols prepended to a value to label its semantic meaning). The annotation mechanism is unique among self-describing formats and is used for versioning, type hints, and schema attachment. Ion calls a value with annotations a typed value; the annotations are part of the value's identity, but a reader that does not understand them can ignore them.

The binary encoding uses a one-byte type-and-length descriptor for each value, where the high four bits encode the type (15 type codes plus a few reserved) and the low four bits encode the length or a length descriptor. Lengths beyond 13 are emitted as a separate varint. Symbol tables are emitted at the start of a binary stream, mapping symbol IDs to their string values; subsequent encoded values use symbol IDs in place of the string keys, dramatically reducing the bytes spent on field names. The symbol table mechanism is conceptually similar to Smile's key sharing but more explicit: the table is a first-class object in the stream, and shared symbol tables can be referenced across multiple streams.

Encoding our Person record:

e0 01 00 ea                                  binary version marker (Ion 1.0)
ee 95 81 83 de 91                            local symbol table header
   87 be 8e                                  symbol list of 6
   83 69 64                                  symbol "id"
   84 6e 61 6d 65                            symbol "name"
   ...                                       (remaining symbols)
de 8e                                        struct with 14 bytes of content
   8a 21 2a                                  field 10 ("id"): int 42
   8b 8c 41 64 61 ...                        field 11 ("name"): string "Ada Lovelace"
   ...

The binary encoding is in the same density range as MessagePack when symbol tables are not amortized, and substantially denser when they are: a stream of a thousand records pays for the field names once, in the symbol table, and then references them by integer ID in every subsequent record.

The text representation of the same record is:

{
  id: 42,
  name: "Ada Lovelace",
  email: "ada@analytical.engine",
  birth_year: 1815,
  tags: ["mathematician", "programmer"],
  active: true,
}

This looks like JSON, with three differences: Ion's text uses unquoted symbols for field names (quoted strings are still allowed), Ion's text supports type annotations (person::{id: 42, ...}), and Ion's text supports the extended types directly (timestamps as 2023-01-01T00:00:00Z, decimals as 1.5d0, blobs as {{base64...}}). The textual representation is what most Ion users see; the binary representation is what gets stored.

Ion's adoption inside Amazon is broad. DynamoDB Streams, the Amazon QLDB ledger, and several internal Amazon services use Ion in their storage and inter-service formats. Outside Amazon, adoption is modest: a handful of analytics tools, the occasional financial-data service, and the obvious alignment with AWS-using teams that want Ion's typed values without choosing a heavier format like Avro. The reference implementations live at amzn/ion-* on GitHub and cover Java, C, C#, Python, JavaScript, and Rust; the implementations are mature and the wire format is stable.

The aspect of Ion most worth understanding, even if you never use the format, is the symbol table mechanism. Ion separates the universe of strings used as identifiers from the encoding of those identifiers in any particular value, and that separation is genuinely useful. Schemas in other formats (Protobuf, Avro) achieve something similar by mapping names to integer tags or positions, but they require a schema artifact to interpret. Ion does it within the stream itself: the symbol table is part of the stream, the stream is self-contained, and the trade between density and self-description is settled by the symbol table's presence rather than by an external schema's presence.

Comparing the three

The three formats in this chapter share a stance on most of the seven axes — schemaless, self-describing, row-oriented, parse, runtime, no formal evolution rules — and differ on the others. Smile is JVM-only and optimizes for repeated keys via back-references; its determinism story is ad-hoc. UBJSON optimizes for syntactic simplicity at the cost of bytes; it has no determinism story. Ion optimizes for type fidelity and human-text round-tripping, with a symbol table mechanism that gives it MessagePack-class density when the table amortizes; it has a deterministic encoding subset.

For a JVM-only system that processes large arrays of similar records, Smile is a defensible choice but a niche one. For a system that wants the simplest possible JSON-shaped binary format, UBJSON is honest but not competitive. For a system that needs typed values, deterministic encoding, and a human-readable text form for debugging, Ion is the strongest choice in this group, and arguably the strongest choice in the schemaless self-describing family overall — but its weight (the symbol table machinery, the broader type system, the dual text/binary spec) is enough that for most deployments MessagePack or CBOR remain easier to live with.

When to reach for any of them

Smile: if you are JVM-only, Jackson is already a dependency, and your data has high key repetition. Otherwise no.

UBJSON: hobby projects, learning exercises, deployments where "a binary format I can read in a hex editor without a parser" is a hard requirement. Otherwise no.

Ion: when you need rich types (timestamps, decimals, blobs), a text representation that round-trips, and an annotation mechanism for in-band type metadata. Strong fit for AWS-using teams and for financial data systems where decimal exactness matters. Defensible choice for general-purpose typed serialization where the alternative would otherwise be Avro or Protobuf and the team does not want to adopt a schema-required format.

When not to

Smile, outside the JVM. UBJSON, when bytes matter. Ion, when the broader ecosystem (libraries in your language, debugging tools, operational comfort) is not strong enough; in non-Java/Python/Rust languages, Ion's tooling is real but thin.

Position on the seven axes

All three: schemaless, self-describing, row-oriented, parse, runtime, evolution by application convention. They differ on determinism (Ion has a canonical encoding; Smile and UBJSON do not specify one) and on whether they have any in-band typing beyond the JSON model (Ion has rich types and annotations; Smile and UBJSON are JSON-equivalent).

A note on the also-rans we left out

Several other formats deserve mention without deserving a chapter. BSON-CXX and EJSON exist as MongoDB-adjacent variants and are covered under BSON. Concise Binary Encoding (CBE) was a 2018 effort by Karl Stenerud to design a successor to MessagePack and CBOR that preserved their best properties while fixing perceived defects; it has not gained adoption. MUMPS-style globals are a structured binary representation with their own forty-year history; they are genuinely interesting and genuinely outside the scope of any survey-of-modern-formats chapter. Lasso and PSON exist as more obscure JVM-adjacent formats with no significant deployment. FastJSON's binary mode exists in some Alibaba-adjacent stacks. None of these would change the analysis if included, and including all of them would make the chapter exhausting without making it more useful.

The point is that the self-describing schemaless binary format space is crowded. The reason MessagePack and CBOR have absorbed most of the oxygen is that they are good enough at the job, with ecosystems that are large enough to mean someone has already debugged your edge case. The variations covered in this chapter exist because good enough did not satisfy somebody, and the dimension on which it failed to satisfy is each format's organizing principle. None of them is wrong. Most of them lost.

Epitaph

Smile is Jackson's binary mode; UBJSON is JSON traced over with a hex marker; Ion is the format Amazon uses when it wants typed values to outlive the system that wrote them.

Protobuf

Protobuf is the schema-first wire format that won. Won in the sense that more services exchange more bytes per second in Protobuf than in any other format whose name appears in this book; won in the sense that it is the default reach for any new system that needs a typed binary protocol; won in the sense that the surrounding ecosystem (gRPC, Buf, the Confluent Schema Registry's Protobuf support, the half-dozen high-quality codegen tools) has so much momentum that even its deficiencies are not enough to dislodge it. Understanding Protobuf is mandatory background for anyone working in modern distributed systems, and the parts of it that are interesting are not always the parts that get the most attention.

Origin

Protocol Buffers was built at Google around 2001 to replace an older internal serialization format whose author had moved on and whose maintenance had become a liability. The new format had three goals: to support schema evolution, because Google's index-build pipeline involved hundreds of binaries running on different release schedules and the format had to tolerate version skew; to be fast and small, because the bytes it carried were the inter-service traffic of a search engine; and to be language-agnostic, because Google had at least three production languages at the time (C++, Java, Python) and needed all three to read and write the same bytes.

Protobuf 1 and Protobuf 2 were internal to Google. Protobuf 3 was the first version made widely available outside Google, in 2008, after several years of internal use refined the format and its codegen pipeline. The 2 → 3 transition was incompatible at the schema level — Protobuf 3 removed the required keyword, changed the default value semantics for scalars, and dropped support for extensions in favor of a more constrained Any type — but the wire format itself was unchanged, which meant Protobuf 2 binaries and Protobuf 3 binaries could continue to talk to each other if their schemas remained compatible.

The wire format has been frozen since 2008. The schema language (.proto files) has gained features, the runtime libraries have churned, the surrounding ecosystem has accreted layers, but the bytes have not changed. This is a remarkable property and one of the principal reasons Protobuf has held up under decades of use.

Protobuf's open-source release in 2008 coincided with the rise of microservices, the broader movement away from REST-with-JSON for internal traffic, and the appetite for a typed schema language that could be checked at build time. gRPC, released in 2015, made Protobuf the default wire format for an HTTP/2-based RPC framework that was easy to consume. Once gRPC took hold, Protobuf's market position became roughly unassailable.

The format on its own terms

A Protobuf message is a sequence of fields. Each field has a number (assigned in the schema, never reused), a wire type (one of six: varint, fixed-64, length-delimited, start-group, end-group, fixed-32), and a value. The wire encoding of a field is a tag byte (or bytes) — a varint that packs the field number and wire type together — followed by the value, encoded according to its wire type.

The wire type and field number are packed into a single varint, with the wire type in the low three bits and the field number in the remaining bits. For field numbers 1 through 15 this is a single byte; for field numbers 16 through 2047 it is two bytes; and so on. The guideline of "use field numbers 1-15 for the most common fields" is not stylistic; it is a real two-byte savings per occurrence.

The varint encoding deserves explanation. A varint is a sequence of seven-bit groups, little-endian, with the high bit of each byte indicating whether more bytes follow. Values 0-127 take one byte; 128-16383 take two; and so on. Negative values, when encoded as varints without zigzag transformation, take ten bytes — the high-bit-padding eats most of the encoding. For signed integer fields, Protobuf offers a sint32 and sint64 type that uses zigzag encoding (mapping signed integers onto unsigned ones such that small absolute values produce small encodings, regardless of sign), which is dramatically smaller for negative values.

There are six wire types but only four widely used. Wire type 0 is varint, used for integers, enums, and booleans. Wire type 2 is length-delimited, used for strings, byte arrays, embedded messages, and packed repeated fields. Wire types 1 and 5 are fixed-64 and fixed-32, used for the fixed64, fixed32, double, and float schema types where the schema explicitly opts into a fixed-width encoding. Wire types 3 and 4 (start-group, end-group) were used by Protobuf 1 and 2's group syntax, which Protobuf 3 deprecated and which is now effectively dead.

The structural property to internalize is that every field is self-describing on the wire. Every field's bytes begin with the tag, which identifies the field number and tells the reader how many bytes the value occupies (directly, for fixed-width types; via the length prefix, for length-delimited types; via the high-bit-terminated varint, for varint types). A reader that does not recognize a field number can still skip the field cleanly. This is the wire-level mechanism that makes Protobuf's schema evolution work.

Optional fields, repeated fields, and message-typed fields all use existing wire types — there is no special "optional" or "repeated" marker. An optional field is simply a field that may or may not appear. A repeated field is one that appears zero or more times. A message-typed field is a length-delimited field whose bytes are themselves a Protobuf-encoded message. The recursion is uniform.

There is no map type at the wire level. Map fields in .proto files are syntactic sugar for repeated of an embedded message with key and value fields. The map type is a feature of the schema language and the generated code, not the wire format.

Wire tour

Schema:

syntax = "proto3";

message Person {
  uint64 id = 1;
  string name = 2;
  optional string email = 3;
  int32 birth_year = 4;
  repeated string tags = 5;
  bool active = 6;
}

Encoded:

08 2a                                        field 1 (id), varint, value 42
12 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65    field 2 (name), len 12, "Ada Lovelace"
1a 15 61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                field 3 (email), len 21, "ada@analytical.engine"
20 97 0e                                     field 4 (birth_year), varint, value 1815
2a 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e field 5 (tags), len 13, "mathematician"
2a 0a 70 72 6f 67 72 61 6d 6d 65 72          field 5 (tags), len 10, "programmer"
30 01                                        field 6 (active), varint, value 1

71 bytes — about a third smaller than MessagePack and more than half the size of BSON. The wins come from three places. First, field numbers replace field names; id (3 bytes in MessagePack) becomes the tag byte 0x08, which packs both the field number 1 and the wire type 0 (varint) into a single byte. Second, repeated string fields share the same tag — 2a appears twice for the two tags entries, not once with an array prefix; this loses to MessagePack for very short repeated strings but wins as soon as the strings are long enough that the per-element tag byte amortizes. Third, integers use varint encoding; 42 is one byte, 1815 is two bytes.

The bytes show several Protobuf-specific design choices. The order of fields in the wire format matches the order they were emitted by the encoder, not the order they were declared in the schema. The schema-declared field number is what matters; field order is not significant on the wire and decoders must accept fields in any order. The repeated string field for tags is not packed — packed encoding is permitted only for repeated fields of primitive numeric types, where the elements are concatenated within a single length-delimited field. Strings cannot be packed, which is why the two tag entries each get their own tag byte.

If email were absent, the encoding would simply omit the 1a 15 ... portion, dropping 23 bytes. There is no marker for absence; the field is either in the bytes or it is not. The generated decoder reports has_email() == false for the absent case, distinguishing it from email == "". This is the optionality mechanism Protobuf 3 restored after Protobuf 3.0 originally removed it; for several years Protobuf 3 conflated absent with default-valued for scalar fields, and the resulting bugs were so widespread that the optional keyword was added back in Protobuf 3.15.

Evolution and compatibility

The evolution model is the single feature that justifies most of Protobuf's design choices. The model in plain terms:

Adding a new field is safe in both directions. New producers emit the new field; old consumers see an unknown field number and skip it (the wire type tells them how). Old producers do not emit the field; new consumers see no field with that number and treat the field as absent (or, for scalar types in early Protobuf 3, as defaulted to zero/empty/false).

Removing a field is safe with a procedure. The schema-level rule is to mark the field as reserved so that no future schema reuses the field number for a different type. Old producers may continue to emit the field; new consumers will skip it as unknown. There is no wire-level effect of removal; the bytes are unaffected.

Renaming a field is safe at the wire level (the field number is what matters; names are decorative). Renaming is unsafe at the source level if any code references the old name, which all of it usually does, and so renaming requires a careful migration of generated code.

Changing a field's type is mostly unsafe but has a small set of exceptions. int32 and int64 are wire-compatible (varint of the same value produces the same bytes). int32 and uint32 are wire-compatible for non-negative values but not for negative. int32 and sint32 are not wire-compatible because sint32 uses zigzag and int32 does not. String and bytes are wire-compatible because both use length-delimited wire type and the bytes themselves are arbitrary. Most other type changes are wire-incompatible and amount to schema-level deletions followed by additions.

Reusing a field number is the cardinal sin of Protobuf schema evolution. If field 5 was a string name and is now a repeated int32 ids, old producers will emit a length-delimited string under tag 0x2a, and new consumers will attempt to decode the string bytes as a packed array of int32s. The result will be silent garbage, occasional crashes when the bytes happen to be malformed varints, and hours of debugging. The reserved keyword exists to make this mistake impossible, and using it religiously is the single most important Protobuf hygiene rule.

The optional keyword in Protobuf 3 (since 3.15) restores field-presence tracking for scalar types. Without optional, a scalar field's default value (0 for numbers, "" for strings, false for booleans) is indistinguishable from absence on the wire and in the generated API. With optional, the field gets a has_ accessor in the generated code, and presence is preserved through encoding and decoding. New code should use optional for any scalar where "unset" is meaningfully different from "default."

Ecosystem reality

The Protobuf ecosystem is enormous, mature, and not always easy to navigate. The reference implementation is protoc, the Google- maintained schema compiler that generates code in C++, Java, Python, Objective-C, C#, Ruby, Dart, PHP, JavaScript, and Kotlin. Generated code in other languages comes from third-party plugins: protoc-gen-go for Go (now the official choice), prost for Rust, tonic for Rust gRPC, protoc-gen-ts for TypeScript, and several others. Quality varies, and the choice of plugin within a language sometimes matters: Go has had two generations of generated code, and the older gogoproto was meaningfully faster than the modern google.golang.org/protobuf for some workloads, at the cost of diverging from the canonical generator.

Buf (buf.build) is the most important development of the last decade in the Protobuf world. Buf provides a schema linter, a breaking-change detector, a remote schema registry (the Buf Schema Registry), and a build system that wraps protoc with sensible defaults. The breaking-change detector is the killer feature: it parses your schema diff in CI and rejects pull requests that change field numbers, change field types incompatibly, remove fields without reserving them, or do any of the other things that produce silent wire-level breakage. Adopting Buf is the single largest upgrade most Protobuf-using teams can make.

gRPC is the dominant RPC framework over Protobuf. The relationship between the two is a well-defined separation: Protobuf defines the message format and the service IDL; gRPC defines the request/response flow, the framing over HTTP/2, the streaming semantics, and the connection lifecycle. You can use Protobuf without gRPC (many do) and you can use gRPC without Protobuf (you can supply alternative codecs), but the canonical pairing is the dominant one.

Confluent Schema Registry has supported Protobuf since 2019. The Kafka ecosystem now has three first-class schema formats — Avro (historical default), Protobuf, and JSON Schema — and the choice between them inside Kafka is mostly aesthetic. Protobuf in Kafka is common enough that several large companies have moved off Avro specifically to align Kafka with their service-RPC schemas.

The text format (prototext) is the canonical human-readable representation of a Protobuf message. It is not JSON. It is also not stable across versions in the strict sense — the bytes round-trip, but the formatting may not. For configuration files, prototext is the right choice. For interchange with non-Protobuf systems, the JSON encoding (Protobuf 3 has a defined JSON mapping) is more portable.

Two ecosystem gotchas worth noting. First, the Python implementation is famously slow; the C++ extension (which most production code uses) is much faster than the pure-Python fallback, and on constrained environments where the extension is unavailable the performance gap is severe. Second, the Any type — Protobuf's mechanism for embedding arbitrary serialized messages — requires the consumer to know the embedded type's schema separately, which in practice means Any fields are usually the wrong choice and a oneof is what you wanted.

When to reach for it

Protobuf is the right default for new typed binary protocols between services you control. It is the right choice for gRPC-based microservices, period; the alternatives are weaker on some axis. It is the right choice for Kafka topics where the producer and consumer are in different languages and the schema must be enforced. It is a reasonable choice for typed configuration formats, though the ergonomics of prototext are dated.

It is also the right choice for any system where forward and backward compatibility under heterogeneous deployment is a hard requirement. The tagged-field model, combined with Buf's breaking-change detector, gives you the strongest static guarantee of compatibility that any wire format provides.

When not to

Protobuf is the wrong choice when the schema cannot be enforced between producer and consumer (public APIs to undifferentiated clients), when zero-copy access is required (FlatBuffers or Cap'n Proto), when columnar layout is required (Parquet, Arrow), when deterministic encoding is required (the Protobuf "deterministic" mode is a best-effort and is not canonical across versions), or when human-editable text is the primary representation (use a YAML or JSON-based config language and convert).

It is also the wrong choice when the operational cost of schema distribution exceeds the savings: small projects, prototypes, public-facing endpoints. JSON is operationally cheaper for those.

Position on the seven axes

Schema-required. Not self-describing (the bytes are tag-and-value without semantic field names; the schema is mandatory to interpret). Row-oriented. Parse rather than zero-copy. Codegen, with DynamicMessage available as a runtime escape hatch. Non-deterministic by spec; the deterministic mode is a best-effort opt-in. Evolution by tagged fields, with reserved as the explicit hygiene tool.

The cell Protobuf occupies — schema-required, tagged-field evolution, varint-dense, codegen-first — is the cell most new typed binary formats are compared against, and the comparison is hard to win.

Epitaph

Protobuf is the typed binary protocol that the rest of the industry spent twenty years failing to displace, and the only meaningful way to lose to it is to choose a format that is better at exactly one thing you care about more than tagged-field evolution.

Thrift

Thrift is the format that Protobuf displaced. It is also the format that arrived at most of Protobuf's design decisions independently and roughly contemporaneously, with a few differences worth understanding. The differences are not subtle, and the parts of Thrift that are better than Protobuf — the explicit required/optional distinction the schema preserved through its lifetime, the in-band protocol selection, the integrated RPC stack — are interesting precisely because they were rejected on operational grounds rather than design ones. Thrift is the working alternate-history of typed binary serialization.

Origin

Thrift was built at Facebook in 2007 by Mark Slee and a small team to solve the same problem Protobuf was solving at Google: heterogeneous service-to-service communication across a polyglot infrastructure where the build pipeline could not afford to redeploy every service in lockstep. Slee's team designed Thrift from scratch rather than adopting Protobuf, partly because Protobuf was not yet public, partly because Facebook needed an integrated RPC story, and partly because Slee believed Protobuf's approach to certain problems (notably field presence, which Protobuf 1 and 2 expressed via the required keyword) was correct and wanted to keep that mechanism while extending the format with multiple wire encodings.

Facebook open-sourced Thrift in 2007 and donated it to the Apache Software Foundation in 2008, where it became Apache Thrift. The project has been in continuous, if slow, maintenance ever since, with the bulk of the activity coming from Apache committers rather than Facebook (which has long since moved to a fork called fbthrift, maintained internally with periodic open-source drops).

The Apache Thrift project's center of gravity is now in the embedded and high-performance computing communities — places where the choice of wire format predates gRPC's dominance and where the cost of migration exceeds the cost of staying put. Thrift remains the external API surface for HBase, Cassandra (until its Thrift API was deprecated in 4.0), several legacy financial systems, and a long tail of internal services at companies that adopted it before 2015. For new projects in 2026, Thrift is rarely the right choice; the ecosystem momentum has shifted decisively to Protobuf and gRPC. But the format is widely deployed enough that engineers will encounter it, and understanding it is worth the chapter.

The format on its own terms

Thrift is unusual among schema-first wire formats in defining multiple wire encodings within a single specification. The two encodings worth knowing are Binary Protocol (the original) and Compact Protocol (the dense one, added in 2008 to compete with Protobuf on size). There is also a JSON Protocol for debugging, a TupleProtocol for fixed-schema fast access, and a Multiplexed Protocol that wraps another encoding to support multiple services over one connection. The encoding is selected by the client and server at connection setup, not by the schema, and a single Thrift schema can be consumed in any of them.

Binary Protocol is straightforward and not particularly compact: each field is encoded as a one-byte type code, a two-byte field ID, and the value. Integers are big-endian fixed-width (i32 takes four bytes regardless of value, i64 takes eight). Strings are length-prefixed with a four-byte length. The encoding is easy to parse, easy to debug in a hex viewer, and pays a substantial size overhead for small values. It exists primarily for backward compatibility; new Thrift deployments use Compact Protocol.

Compact Protocol is the encoding worth understanding. It is philosophically similar to Protobuf: varint-encoded integers, packed field tags, length-delimited strings. The differences are mostly in the framing details. Compact Protocol packs the field type code and a delta from the previous field ID into a single byte; if the delta fits in four bits and the type code fits in four bits, the field tag is one byte total. Boolean values are folded entirely into the type code (type 1 means "bool true," type 2 means "bool false," with no payload byte), which is denser than Protobuf's single-byte boolean encoding. Signed integers are zigzag-encoded, which Protobuf requires the schema to opt into via the sint32 type.

The data model is comparable to Protobuf's, with a few Thrift- specific additions. Thrift has a native set type and a native map type at the wire level, where Protobuf has only repeated fields and synthesizes maps from the schema language. Thrift's binary type is distinct from string at the schema level (Protobuf 3 also makes this distinction, but earlier versions did not). Thrift has no oneof; the closest equivalent is a struct with several optional fields and an application convention.

The most interesting schema-level distinction is field presence. Thrift fields can be declared required, optional, or default. Required means the field must be present, and a decoder that does not see it raises an error. Optional means the field may or may not be present, with no penalty for absence. Default (the default for fields that don't declare otherwise) means the field may be absent on the wire, but the generated code treats absence as the schema-declared default value. Protobuf 2 had the same three modes; Protobuf 3 dropped required and conflated default and optional for scalar types (until 3.15 partially restored the distinction). The Thrift community kept all three and considers the loss of required a strict regression. The argument on the other side — that required is a schema-evolution landmine, because removing a field that consumers expect to be required produces a protocol break — is genuine, and the Protobuf team's experience apparently led them to conclude that the operational cost of required exceeded its safety benefit.

Wire tour

Schema:

struct Person {
  1: required i64 id;
  2: required string name;
  3: optional string email;
  4: required i32 birth_year;
  5: required list<string> tags;
  6: required bool active;
}

Encoded with Compact Protocol:

16 54                                        field 1 (delta 1, type i64=6), zigzag(42)=84
18 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65    field 2 (delta 1, type binary=8), len 12, "Ada Lovelace"
18 15 61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                field 3 (delta 1, type binary=8), len 21, "ada@..."
15 ae 1c                                     field 4 (delta 1, type i32=5), zigzag(1815)=3630
19 28                                          field 5 (delta 1, type list=9)
   28 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   list of 2 strings, "mathematician"
   0a 70 72 6f 67 72 61 6d 6d 65 72                "programmer"
11                                           field 6 (delta 1, type bool=true=1)
00                                           stop field

71 bytes — essentially identical to Protobuf for this payload, which is no accident. The two formats made similar choices about varint encoding, length prefixes, and tag packing, and the small differences (Thrift's zigzag-by-default for signed ints, Thrift's fold-bool-into- type-code, Thrift's explicit list-element-type byte) tend to wash out across mixed payloads. Thrift Compact occasionally wins by a few bytes for boolean-heavy payloads; Protobuf occasionally wins by a few bytes for unsigned-integer-heavy payloads. Neither formula dominates.

The structural differences from Protobuf are visible in the bytes. The list field encoding includes an explicit element-type byte (28 — top nibble is the count 2, bottom nibble is the type 8 for binary/string), which Protobuf does not require because Protobuf's repeated fields are not natively typed at the wire level. The stop field byte (00) marks the end of the struct and is required; Protobuf has no stop marker because messages are length-delimited when embedded and rely on EOF when at the top level.

If email were absent, the encoding would skip those 23 bytes, and the next field's header byte would carry a delta of 2 instead of 1 to compensate. This delta encoding is the small detail that distinguishes Thrift Compact from Protobuf at the wire level and explains how Thrift packs the field tag into a single byte for most fields: the delta from the previous field's number is almost always small.

Evolution and compatibility

Thrift's evolution rules are closely parallel to Protobuf's. Adding a field with a new ID is safe in both directions if the field is declared optional; consumers without the field's schema skip it, producers without it omit it. Removing a field is safe if the removal is coordinated with consumers; the field ID should be retired and not reused. Renaming a field is safe at the wire level (IDs are what matter). Changing a field's type is mostly unsafe with a similar set of exceptions to Protobuf's.

The single substantial difference is the required keyword. A producer that omits a required field, or a consumer that does not recognize a required field, will either fail at decode time or produce undefined behavior depending on the language binding. This makes required a one-way door: once a field is declared required and deployed, removing it is a coordinated migration. The Thrift guidance is therefore "use required sparingly," which produces the question of why the keyword exists at all if every guidance document warns against using it. The Protobuf team eventually concluded that the answer is "it shouldn't" and removed it. The Thrift community kept it because the cases where it's right are real.

The deterministic-encoding question for Thrift is the same as for Protobuf: not specified at the wire level, achievable with care. Thrift has no canonical encoding subset, no mandated map ordering, no specified varint widths. Sign-and-verify schemes over Thrift typically canonicalize the bytes themselves rather than rely on the encoder.

Ecosystem reality

The Thrift ecosystem is mature, fragmented, and slowly contracting. The Apache Thrift project ships generators for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, JavaScript, Smalltalk, OCaml, Delphi, and a few others. Quality varies; the C++ and Java generators are first-class, the Python and Go generators are good, and the long tail is functional but not state of the art.

Facebook's fork, fbthrift, has diverged substantially from Apache Thrift over the years. fbthrift adds streaming support, a ContextStack for cross-cutting concerns, and a number of internal optimizations Facebook needed for its scale. The two are wire-compatible at the Compact Protocol level but not API-compatible at the source level. Most companies pick one or the other; few maintain both.

Twitter's Finagle uses Thrift extensively, and Finagle's Scrooge generator (for Scala) is the canonical Scala Thrift implementation. Apache HBase exposes a Thrift gateway as one of its primary external APIs. Apache Cassandra's Thrift API was the original client interface; it was deprecated in 2.x, removed in 4.0, and is now of historical interest only. Apache Hive uses Thrift internally for its metastore protocol; the Thrift-defined HiveMetaStore service is one of the more durable Thrift deployments in the analytics ecosystem.

Two ecosystem gotchas worth noting. First, the multiple-protocol design means that the wire format is not a property of the schema; it is a runtime configuration choice. A Thrift service in production may speak Binary, Compact, or JSON depending on how the server was started. Tools that snoop on Thrift traffic must detect the protocol from the first few bytes; some tools do this poorly, and traces from the wrong protocol look like noise.

Second, the integrated RPC stack — TServer, TTransport, TProtocol in the canonical naming — is not optional in many of the language bindings. You cannot easily use Apache Thrift's serialization without dragging in its server and transport classes. fbthrift makes this cleaner, as do Finagle's Scrooge bindings. For serialization-only uses, this overhead is real and worth weighing.

When to reach for it

Thrift is the right choice when interoperating with an existing Thrift-using ecosystem: HBase, Hive, fbthrift-using companies, the remaining Cassandra-via-Thrift deployments. It is a defensible choice for new systems where the integrated RPC stack matters and gRPC's HTTP/2 baseline is unwelcome (some embedded environments, some legacy network topologies).

It is the right choice when required field semantics are a hard requirement and the alternative would be to reimplement them in application code over Protobuf.

When not to

Thrift is not the right choice for new microservices in greenfield environments. gRPC plus Protobuf has won that space operationally — better tooling (Buf), better runtime libraries, better integration with modern observability stacks, broader language support — and the few axes on which Thrift is technically superior are not enough to overcome the ecosystem gap.

Thrift is also not the right choice when the operational cost of selecting a wire protocol at runtime is unwelcome. Protobuf's single wire format is, on balance, simpler to reason about than Thrift's three.

Position on the seven axes

Schema-required. Not self-describing. Row-oriented. Parse rather than zero-copy. Codegen-first, with runtime support via dynamic protocols. Non-deterministic by spec. Evolution by tagged fields, with required/optional/default distinguishing presence semantics.

Thrift's stance differs from Protobuf's in two places: the multiple-protocol design (Thrift permits Binary, Compact, JSON, and others; Protobuf has a single wire format) and the preserved required keyword. Both differences are intelligible as choices, and both produced operational costs that the Protobuf team explicitly chose to avoid.

A note on the required-field debate

The argument over required is the single most theologically charged disagreement in the schema-first wire format world, and it is worth getting the shape of it right. The case for required is that it documents an invariant the schema author considers load-bearing: this field cannot be absent without rendering the record meaningless. The decoder-side enforcement turns silent bugs (consumer reads garbage because producer omitted a field) into loud bugs (decode fails). The schema reader can see, at a glance, which fields are critical versus which are optional, and the generated code's API reflects that distinction.

The case against required is that it is a one-way door at the schema level. Once a field is required, you cannot remove it without coordinating the removal across every producer and consumer in your fleet. Coordination across a fleet of services in heterogeneous deployment states is exactly the problem schema-evolution mechanisms are supposed to solve, and required makes one of those problems insolvable: the field cannot be removed because removing it would break decoders, but if the field is no longer used by anyone, it is dead weight that everyone has to keep around forever. The Protobuf team's experience inside Google was that this scenario arose often enough to make required a net liability, and Protobuf 3 dropped it.

The Thrift community's experience appears to be different, perhaps because Thrift deployments are typically smaller and more self-contained than the cross-org Protobuf deployments at Google. A Thrift schema owned by a single team, deployed on a single release schedule, can use required safely. A Protobuf schema crossing dozens of teams and hundreds of services cannot. The disagreement is therefore not really about the keyword; it is about the deployment topology in which schema evolution happens. Both sides are right for their respective topologies.

Epitaph

Thrift is Protobuf's contemporaneous twin, with three wire formats and the courage to keep required; deployed widely, growing slowly, displaced operationally by gRPC.

Avro

Avro is the schema-first wire format that thinks about schemas differently from Protobuf and Thrift, and it is worth getting that difference clear before doing anything else with the format. Protobuf and Thrift identify fields by stable numeric tags emitted on the wire; the schema's job is to map those tags to types and names, and the wire format carries enough information that a decoder can skip fields it does not recognize. Avro identifies fields by position — the wire format is just the field values concatenated in schema order, with no tags at all — and relies on a mechanism called schema resolution to handle version skew between producers and consumers. The two approaches are not subtly different, and the consequences cascade through everything else about Avro: the wire format is denser, the schema is mandatory at decode time, the evolution rules are formal and machine-checkable, and the typical deployment shape involves a schema registry that the typical Protobuf deployment does not.

Origin

Avro was created by Doug Cutting in 2009 as part of the Apache Hadoop ecosystem. Cutting (who had previously created Lucene, Nutch, and Hadoop itself) wanted a serialization format that would be the canonical row representation in Hadoop's MapReduce pipeline and the canonical record format in HDFS. The constraints Cutting faced were the constraints of an analytical pipeline: huge volumes of records, all conforming to a known schema, written by jobs that knew the schema at write time and read by jobs that knew the schema at read time, with the two schemas potentially differing because the pipeline evolved over months or years.

The first design choice that fell out of those constraints was schema-with-data. Hadoop sequence files are large; embedding the schema once at the head of the file amortizes to nothing per record and gives every reader of that file the canonical interpretation of the bytes. The Avro Object Container File format codifies this: a magic number, a schema in JSON form, a synchronization marker, and then a series of compressed blocks of records, each block re-stating the synchronization marker so that a stream-aware reader can skip ahead. The schema is part of the file. The file is self-describing. The records inside the file are not.

The second design choice was positional wire encoding. Once the schema is known at decode time — guaranteed by the file format, and the convention by which Avro is used in Kafka via Confluent's Schema Registry — there is no need for field tags. Records become the concatenation of their field values, encoded in schema-declared order, with each field's encoding determined by its declared type. The result is the densest of the schema-required wire formats in this book: no field numbers, no length prefixes on records, no type tags on values.

The third design choice was schema resolution. When the schema under which the bytes were written (the writer's schema) differs from the schema the consumer wants to interpret them as (the reader's schema), Avro defines explicit rules for reconciling the two. Field reordering, type promotion (int → long → float → double, in that order), default values for missing fields, and aliases for renamed fields are all handled by the resolution algorithm, which runs at decode time and produces the resolved record in the reader's schema.

These three choices — schema-with-data, positional encoding, and schema resolution — define Avro. Everything else in the format is plumbing.

The format on its own terms

Avro schemas are written in JSON. A primitive type is a JSON string: "int", "long", "string", "boolean". A complex type is a JSON object with a type field and additional properties. A record has a name, optional namespace, and a fields array where each field is an object with name, type, and optional default, doc, aliases, and order properties. An enum has a name and a symbols array. An array has an items type. A map has a values type and string keys. A union is a JSON array of types; the most common union is ["null", T], which encodes an optional T. A fixed is a fixed-length byte string with a name.

The wire encoding for each type is small and ruthlessly literal. Boolean: one byte, 0 or 1. Int and long: zigzag varint, exactly the same encoding Protobuf calls sint32/sint64. Float: four bytes, little-endian IEEE 754. Double: eight bytes, little-endian IEEE 754. String and bytes: a long (length) followed by that many UTF-8 or binary bytes. Records: the concatenation of their fields' encodings, in schema-declared order, with no separator. Enums: a long giving the symbol's index in the schema's symbols array. Arrays and maps: a sequence of blocks, each consisting of a count (long) and that many items, terminated by a zero-count block. Negative counts on array and map blocks indicate that a byte size for the block follows the count; this enables decoders to skip whole blocks without parsing their contents, which matters for analytics workloads.

Unions are encoded as a long indicating which branch was selected (0 for the first member of the union, 1 for the second, etc.), followed by the value encoded according to that branch's type. For the common case ["null", "string"], encoding null produces the single byte 0x00, and encoding a string produces 0x02 followed by the string encoding. This is the optionality mechanism: optional fields are unions with null.

The order in which fields appear on the wire is the order they appear in the schema. There is no flexibility, no version information, no per-record schema identifier in the bytes. The bytes are uninterpretable without the schema, and the schema must be known by the decoder through some mechanism the format does not specify (file header, registry lookup, side-channel).

The Avro Object Container File adds the framing that makes self-described files practical. The file format is: a four-byte magic (Obj\x01), a header containing the schema in JSON and a randomly-generated 16-byte synchronization marker, and then a sequence of data blocks. Each block contains the count of records, the byte size of the compressed block, the compressed bytes, and the synchronization marker repeated. Compression is per-block and configurable (deflate, snappy, bzip2, xz, zstandard).

Wire tour

Schema:

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null},
    {"name": "birth_year", "type": "int"},
    {"name": "tags", "type": {"type": "array", "items": "string"}},
    {"name": "active", "type": "boolean"}
  ]
}

Encoded:

54                                           id: zigzag(42) = 84
18 41 64 61 20 4c 6f 76 65 6c 61 63 65       name: len 12, "Ada Lovelace"
02                                           email union branch 1 (string)
   2a 61 64 61 40 61 6e 61 6c 79 74 69 63 61 6c
      2e 65 6e 67 69 6e 65                   email value: len 21, "ada@analytical.engine"
ae 1c                                        birth_year: zigzag(1815) = 3630, varint
04                                             tags array block of 2
   1a 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician"
   14 70 72 6f 67 72 61 6d 6d 65 72            "programmer"
00                                             tags array terminator (0-count block)
01                                           active: true

67 bytes — the densest schema-first encoding of our Person record in the book, narrowly beating Protobuf and Thrift. The wins come from three places. First, no field tags: the record is just the concatenation of values. Second, the union encoding for email is a single byte plus the string, where Protobuf and Thrift each spend a tag byte plus a length prefix. Third, the array encoding uses a single zigzag-varint count rather than a per-element tag.

The cost is also visible. The bytes alone are uninterpretable. There is no field number to scan for, no string key to match against. A reader that does not have the schema cannot tell where one field ends and the next begins, because the field boundaries are not marked in the bytes; they are implicit in the schema. This is the price of positional encoding, and Avro pays it willingly.

If email were absent (the field with default null), the encoding would emit 00 for the union branch (null), which is a single byte rather than the 23 bytes of the present case. The schema's default makes the absence well-defined: encoding a record without the email value produces the null branch, and decoding the null branch produces a record where email's deserialized value is the schema-declared default, which is null.

Evolution and compatibility

Avro's schema resolution rules are the most formal in this book. Given a writer's schema and a reader's schema, the resolution algorithm produces either a successful resolution (the bytes can be decoded into the reader's schema) or a structural error (the schemas are incompatible). The rules:

  • A field present in both schemas with compatible types is decoded through type promotion if the types differ.
  • A field present in the writer's schema but not the reader's is decoded and its value is discarded.
  • A field present in the reader's schema but not the writer's is decoded as the schema-declared default. If no default is declared, the resolution fails.
  • An enum symbol present in the writer's schema but not the reader's is decoded as the reader-schema-declared default. If no default, the resolution fails.
  • A union with members in the writer's schema can be resolved against a reader's schema where the union members are a subset, with rules for null branches and type promotion.
  • Aliases on the reader's schema let a field be matched to a differently-named field on the writer's schema.

The four standard compatibility modes are:

  • Backward compatible: a reader using the new schema can read bytes written by the old schema. Field additions with defaults, field removals, and union promotions all preserve backward compatibility.
  • Forward compatible: a reader using the old schema can read bytes written by the new schema. Field additions are forward- compatible if the reader is willing to drop unknown fields (which Avro readers are, by spec); field removals require the field to have had a default in the old schema.
  • Full compatibility: both backward and forward compatible. Only schema changes that satisfy both rules are allowed.
  • No compatibility: schemas may change arbitrarily; consumers must coordinate with producers.

The Confluent Schema Registry, the canonical schema-distribution service for Kafka-based Avro deployments, enforces the chosen compatibility mode at registration time. A producer that tries to register a new schema that violates the topic's compatibility policy gets rejected at the registry, before any incompatible bytes can be written. This pre-deployment enforcement is the single most valuable feature of Avro-with-registry as an operational pattern, and it is the reason Avro has held up under years of organic schema evolution in large Kafka deployments where Protobuf would have required a lot more discipline.

The deterministic-encoding question for Avro is interesting. Records are encoded in field order with no flexibility, integer encodings are zigzag-varint with no width choice, strings are length-prefixed with no padding option, and the only place non-determinism creeps in is map field ordering (which Avro maps, unlike Avro records, do not specify) and the choice of array block sizes. A canonicalization layer that sorts map keys and forces a single-block encoding is straightforward to add. Avro is not deterministic by spec, but it is closer to deterministic than Protobuf or Thrift.

Ecosystem reality

Avro's primary ecosystem is the Hadoop / Kafka / Spark / Flink analytics stack. Apache Avro is the canonical format for HDFS records in Hadoop deployments; it is one of three first-class formats in Confluent's Schema Registry (alongside Protobuf and JSON Schema), and historically the default; it is the canonical record format for Kafka Streams when typed records are needed; it is supported natively by Spark's from_avro/to_avro functions and by Flink's table API.

Outside the analytics stack, Avro's adoption is modest. The Avro RPC story (Avro IDL plus the Avro RPC spec) exists but is not widely deployed; gRPC and Thrift have taken that space. The Avro schema language has a few warts — JSON syntax for schemas is verbose, the Avro IDL alternative syntax exists but does not have the tooling support — and the runtime libraries vary in quality across languages. The Java implementation is canonical and high-quality. The Python implementation (fastavro) is mature. The Go and Rust implementations are functional but less polished than their Protobuf equivalents.

The single most consequential ecosystem fact about Avro is that the schema-with-data file format and the schema registry pattern were right. Both have aged remarkably well. Both have proven robust to the kinds of long-term schema evolution that Protobuf-without-Buf struggles with. Both have spawned analogues for other formats — the Confluent Schema Registry now supports Protobuf and JSON Schema exactly because the operational pattern is more important than the underlying serialization. If a single Avro contribution outlives the format itself, it will be the registry pattern.

Two ecosystem gotchas. First, logical types — Avro's mechanism for layering semantic types like decimal, UUID, timestamp on top of primitive types — are implemented inconsistently across libraries. A timestamp encoded by the Java client may not round-trip through the Python client unless both are configured to recognize the same logical type. This is a known issue and is improving, but it bites.

Second, the JSON encoding of Avro is a separate spec from the binary encoding, and the two are not the same bytes. JSON representations of Avro values are useful for debugging and for interop with non-Avro consumers, but they should not be confused with Avro itself; the binary encoding is the format, and the JSON encoding is a debugging tool.

When to reach for it

Avro is the right choice for analytical workloads on the JVM-adjacent stack: Hadoop, HDFS, Hive, Spark, Flink. It is the right choice for Kafka topics where the schema-registry pattern is in place and the operational discipline of compatibility-checked schema evolution is desired. It is the right choice for long-lived data on disk where the schema must travel with the bytes (Object Container Files solve this perfectly).

It is a defensible choice for inter-service typed serialization when an operating registry is already in the stack and the team is comfortable with positional encoding.

When not to

Avro is the wrong choice for inter-service RPC in greenfield deployments; gRPC plus Protobuf has won there and the Avro RPC story does not compete. It is the wrong choice when the schema cannot be reliably distributed (public APIs, polyglot multi-organization deployments without registry infrastructure). It is the wrong choice for tiny payloads where the overhead of schema lookup or transmission exceeds the bytes saved.

It is the wrong choice when the team's mental model is schema-as-tagged-fields (Protobuf-flavored) rather than schema-as-positional-record (Avro-flavored); the conceptual mismatch produces operational pain.

Position on the seven axes

Schema-required. Self-describing only via container file framing — the bytes alone are not. Row-oriented (the columnar Avro variants discussed under Parquet inherit only the schema language). Parse rather than zero-copy. Codegen optional; runtime use through GenericRecord is common and idiomatic. Non-deterministic by spec but close to it; canonicalization is straightforward. Evolution by reader/writer schema resolution, with the strongest formal compatibility model in this book.

The cell Avro occupies — schema-required, positionally-encoded, resolution-based evolution — is unique among the formats in this book and is structurally well-suited to the analytical-data workloads it was designed for.

Epitaph

Avro is the schema-first format that decided to put the schema in the file rather than in the bytes, and the operational pattern that followed (the schema registry) is more durable than any specific wire format.

Schema Evolution Compared

The three preceding chapters covered Protobuf, Thrift, and Avro each on its own terms. The choice between them is rarely about wire-level efficiency, which is approximately the same across all three for typical payloads. The choice is almost always about how the format handles the case where the schema changes and the deployment cannot upgrade producers and consumers in lockstep. This chapter sets up that question directly: a small fixed list of schema-change scenarios, applied to all three formats in turn, with the rules and consequences laid out side by side.

The reason for treating this comparison as its own chapter rather than scattering it through the format chapters is that the comparison itself is a useful artifact. The right way to choose a schema-first format for a new system is to walk through the changes you expect to make to the schema over the next five years and ask which format makes the easiest of them, which makes the hardest of them, and where the format's hard cases line up with the changes you actually need to make. Skipping this exercise produces formats chosen for the wrong reasons; doing it produces choices that hold up.

The list of scenarios is deliberately small. Real-world schema evolution is more varied than seven scenarios can capture, but seven is enough to surface the differences between the formats. Each scenario describes the change, the constraints on producers and consumers, and the result for each format.

Scenario 1: Add a new optional field with no default

The cleanest case. We have a Person schema. We want to add a country field that may or may not be present.

In Protobuf, the change is a one-line schema edit: optional string country = 7;. Field number 7 is new. Old producers do not emit it. New producers may emit it. Old consumers ignore it (the unknown-field code path skips it cleanly). New consumers receive either an absent field (if the producer is old) or a present field (if new). Nothing breaks. No coordination is required between producer and consumer deployment. This is the case Protobuf was designed for, and it works.

In Thrift, the change is a one-line schema edit: 7: optional string country;. The behavior is identical to Protobuf's: old producers omit, new producers emit, consumers gracefully handle either. Field IDs are the wire-level identity. Nothing breaks.

In Avro, the change requires a default value: {"name": "country", "type": ["null", "string"], "default": null}. The default is mandatory because Avro reader-writer schema resolution requires every field in the reader's schema that isn't in the writer's schema to have a default. Without the default, an old- producer/new-consumer combination fails resolution at decode time. With the default, the change is forward-compatible (old reader sees bytes from new writer and ignores the new field) and backward-compatible (new reader sees bytes from old writer and fills in the default).

In the schemaless self-describing formats (MessagePack, CBOR, BSON), the change is: emit the new key when the application has a value, don't emit it when it doesn't. There is no schema to update because there is no schema. Old consumers ignore the unknown key; new consumers handle the missing key as the application sees fit. The format is uninvolved.

Scenario 2: Add a new required field

The dangerous case. We want to add a country field that must be present in every record.

In Protobuf 3, the required keyword is gone, so this scenario is technically impossible at the schema level. The closest you can do is add an optional field and enforce required-ness in application code. This is the answer the Protobuf team prefers, and the cost is real: the schema does not document the invariant, and consumers must remember to check.

In Thrift, the change is 7: required string country;. The behavior at first glance looks fine: new producers emit, new consumers expect, both sides updated together. The trap is the deployment sequence. If the producer ships first, old consumers receive a record with an unknown field, which they skip — but old consumers do not know to expect the field, so they are unaffected. Fine. If the consumer ships first, it expects the field to be present, but old producers do not emit it, and the decoder fails loudly. The required field cannot be deployed without a strict deployment order: producers must update before consumers, every time.

In Avro, this scenario is also technically impossible without a default. Adding a field with no default and reading old data fails at resolution time. The pattern that approximates "required" in Avro is: add the field with a default in the schema, and have application code reject records where the value is the default. This is the same pattern Protobuf 3 uses, with the same costs.

In the schemaless formats, the scenario is purely an application concern. The format will not help.

The lesson is that adding a required field is a coordinated deployment in any format. Thrift will make the failure loud. Protobuf and Avro will make it quiet. Neither is automatically safer; both require the same operational care.

Scenario 3: Remove a field

We want to remove the email field. The field is currently optional.

In Protobuf, the change is reserved 3; (or reserved "email";) plus removal of the field declaration. The reserved keyword prevents future schema versions from reusing field number 3 for a new field of a different type. Old producers may continue to emit the field; new consumers see the field number as unknown and skip it. New producers do not emit the field; old consumers see no field with that number and treat it as absent (its default). The change is safe in both directions.

In Thrift, the change is the removal of the field declaration. Thrift does not have a reserved keyword in mainstream syntax, but the rule against reusing field IDs is identical: never reassign a removed ID. Old producers continue to emit; new consumers skip. New producers don't emit; old consumers see absence. If the field was required, the rules are the same as Scenario 2 in reverse: old consumers expecting the field will fail when new producers omit it. Removal of a required field is a coordinated deployment.

In Avro, removing a field requires the field to have had a default in the writer's schema (so that old readers reading new bytes can decode the absent field). The Avro registry will reject the schema change if the consumer-side compatibility mode is "backward" or "full" and the field has no default. This is one of the cases where Avro's resolution-based model is more restrictive than Protobuf's tag-based model: in Protobuf you can remove a field without consequences (modulo reserved), and old data still decodes because the field number is just unknown; in Avro you have to have planned for the removal at the time of the field's introduction by giving it a default.

In the schemaless formats, removal means stop emitting the key. Consumers handle the missing key. There is no policy to enforce.

Scenario 4: Rename a field

We want to rename birth_year to year_of_birth.

In Protobuf, this is a non-event at the wire level: field number 4 still encodes the value, regardless of the field's source- level name. The bytes are unchanged. The cost is in the source code: every reference to the field name needs to be updated, generated code regenerated, and any code that uses reflection-by-name has to be migrated. The wire is fine; the source is the work.

In Thrift, the same: field IDs are wire-level identity, names are decorative. Rename the field, regenerate, redeploy.

In Avro, the wire encoding is positional and does not carry the field name. But the resolution algorithm matches reader-schema fields to writer-schema fields by name (with aliases as the explicit override). To rename, declare the new name and add the old name as an alias: {"name": "year_of_birth", "type": "int", "aliases": ["birth_year"]}. Without the alias, an old writer schema and new reader schema will fail to resolve the renamed field, and the value will be missing in decoded records.

In the schemaless formats, rename means start emitting the new key, optionally keep emitting the old one for compatibility, and have consumers handle both. There is no formal mechanism. The operational discipline is identical to Protobuf's, but spread across application code instead of schema files.

Scenario 5: Change a field's type

We want to change birth_year from int32 to int64. We also want to consider the harder case of changing it from int32 to uint32.

In Protobuf, int32 and int64 are wire-compatible: the varint encoding of small positive values is identical, and the decoder for int64 accepts the int32 wire bytes. int32 and uint32 are wire-compatible for non-negative values; for negative values the encodings differ in sign-extension behavior, which is why this change is documented as "compatible only if all values are non-negative." int32 to sint32 is not compatible because the encodings differ (zigzag vs. straight varint), and this is a common mistake. The compatibility table for Protobuf type changes is well-known and is the kind of thing breaking-change detectors like Buf check automatically.

In Thrift, the equivalent table exists but is shorter: i32 and i64 are wire-compatible because Thrift Compact zigzag-encodes both, i32 and i64 are wire-incompatible with string and binary because they use different wire types. There is no distinction between zigzag and straight varint at the schema level, which means Thrift does not have the int32-to-sint32 trap.

In Avro, type changes go through the resolution algorithm's type promotion rules: int → long → float → double, in that order. Changing birth_year from int to long is a forward and backward compatible change. Going the other direction (long to int) is not, because old data may include values larger than fit in an int, and the resolution will fail for those. There is no support for unsigned integer types in Avro, which is itself a small schema-language difference: schemas that need unsigned values encode them as long and have the application enforce the range.

In the schemaless formats, type changes are an application concern. The wire bytes for an integer carry just enough information to decode the integer; the application interprets the result.

Scenario 6: Reorder fields

We want to declare name before id in the schema.

In Protobuf, this is purely cosmetic. The wire format is keyed by field number, and field numbers are unchanged. Reordering the declarations affects nothing. The bytes are identical.

In Thrift, same as Protobuf. Field IDs are what matter.

In Avro, this is load-bearing. Avro encodes records in the order their fields appear in the schema. Reordering the declarations changes the wire format. A producer with the new schema and a consumer with the old schema will produce a catastrophic mismatch unless schema resolution is in play, which matches by name. Avro's resolution does match by name, and so a field-reorder is technically compatible — but only if both schemas are available to the consumer, and only if the resolution engine matches the names correctly. The wire bytes are different.

The conclusion is that in Avro, the schema is what travels, not just the bytes, and field order in the schema is significant. Treating the schema as plain JSON and pretty-printing it with a reorderer can produce wire incompatibility.

In the schemaless formats, reordering keys is permitted by spec (map ordering is unspecified) and tolerated by typical consumers. Deterministic-encoding requirements may impose a canonical key order, but the format itself does not.

Scenario 7: Change a field from optional to required

We want to make email required.

In Protobuf 3, this is impossible at the schema level (no required keyword). The application enforces required-ness. Switching from optional string email = 3; to a non-optional string email = 3; is a wire-compatible change, but it changes the API surface (the has_email() accessor disappears in newer proto3) and the application semantics (default values become indistinguishable from absence). This is the change Protobuf 3.0 made by accident and Protobuf 3.15 partially undid.

In Thrift, changing optional to required is wire-compatible but operationally hazardous, as covered in Scenario 2. The change must roll out producers-first.

In Avro, an optional field is a union with null and a default of null. Making it required means removing null from the union. This is not compatible: old data with the field absent (encoded as the null branch) cannot be resolved against a reader's schema where the field is a non-null type. The field has to remain optional in the schema, and required-ness has to be enforced elsewhere.

In the schemaless formats, the change is purely application- side. The format is uninvolved.

A summary table

ScenarioProtobufThriftAvro
Add optional fieldTrivialTrivialRequires default
Add required fieldNot in proto3Producers firstRequires default
Remove fieldTrivial w/ reservedTrivialField needs default
Rename fieldSource-onlySource-onlyRequires alias
Type widening (int → long)Wire-compatibleWire-compatibleResolution-promotion
Reorder declarationsCosmeticCosmeticWire-significant
Optional → requiredDiscouragedHazardousIncompatible

What the table actually means

The table is small enough to read quickly, and the differences are real, but the right reading is not "which format has more 'trivial' entries." Every format has roughly the same number of safe and unsafe scenarios. The differences are in which scenarios are safe and what kind of failure happens when you make an unsafe change.

Protobuf's failure mode for unsafe changes is usually silent: the bytes decode, but the values are wrong. Field-number reuse with a type change produces a decode that succeeds but yields garbage. Type-incompatible changes within the same wire type (int32 → sint32) produce values that look plausible but differ from the originals. The remedy is reserved plus a breaking-change detector like Buf, which catches these mistakes at schema-merge time.

Thrift's failure mode is mixed. Compatible changes work cleanly. The required keyword turns some failures loud (the decoder errors on missing required fields), which is helpful in some deployments and harmful in others. There is no equivalent of Buf for Thrift in widespread use, which means breaking-change detection is mostly manual.

Avro's failure mode is loud and early. Schema resolution failures happen at decode time and are explicit, with messages that name the offending field. The Confluent Schema Registry catches incompatible schema changes at registration time and rejects them, which means many failures never reach a decoder. The cost is rigidity: changes that are "harmless" in Protobuf or Thrift (renaming a field, reordering declarations) require explicit metadata in Avro.

The choice between formats is therefore a choice between what kind of evolution discipline you want enforced where. Protobuf asks for discipline at the human level (use reserved, run Buf in CI). Thrift asks for discipline at the deployment level (sequence your rollouts). Avro asks for discipline in the schema itself (declare defaults, declare aliases). Each works; they just shift the cost to different places.

What about the schemaless formats?

MessagePack, CBOR, BSON, and the rest of the self-describing schemaless family have no formal evolution rules. They make every scenario "trivial" at the wire level, and the cost is paid downstream: in application code, in operational coordination, in tests that catch mistakes the schema would have caught.

For small teams, fast iteration, and schemas that change often without strong deployment-skew constraints, this is fine. For large organizations, slow rollouts, and schemas that need to stay compatible across many independent versions, the lack of formal rules is a chronic source of bugs. The right format is the one where the operational pattern matches your organization's deployment topology, and operational topology is the part of the question almost nobody answers honestly when picking a format.

A practical recommendation

If you are starting a new system and asking which schema-first format to use, the right question is not which has the best wire encoding (they are all comparable) but which evolution model you can credibly enforce. If your organization has the operational muscle to run a schema registry and check compatibility at registration, Avro is the strongest choice and has aged exceptionally well. If your organization runs Buf or an equivalent breaking-change detector in CI, Protobuf is the strongest choice and is by far the most common. If neither infrastructure is in place, the schemaless options will produce fewer surprises in the short run and more in the long run; budget accordingly.

The one wrong answer is to choose a format on the assumption that you will adopt the surrounding evolution infrastructure later. Nobody adopts it later. The infrastructure ships with the format or it does not ship at all.

FlatBuffers

FlatBuffers is the format you reach for when the parse step is the bottleneck. The promise is simple and unconventional: the bytes on disk and the bytes in memory are the same bytes. There is no decode. You access fields directly out of the buffer, with offset arithmetic, and the offsets are arranged so that the access patterns common in the calling language are cheap and the access patterns rare in the calling language are still possible. This is not a trivial promise to keep, and the format pays for it in bytes, in code-generation complexity, and in constraints on what schemas can express. The trade is sometimes worth it. When it is, FlatBuffers is the cleanest expression of the idea on offer.

Origin

FlatBuffers was built at Google by Wouter van Oortmerssen, formerly of the game-development industry and at the time on Google's Android games team. The motivating use case was loading game assets at startup: a typical game ships with megabytes of structured asset metadata (level layouts, sprite descriptors, animation curves, audio manifests) that needs to be available immediately, and the parse step in formats like JSON or Protobuf was a measurable contributor to load time. Loading on a constrained device (a phone, a console) without a parse step would let games start faster and use less RAM during the load phase, which is approximately when devices have the least memory available.

FlatBuffers was open-sourced in 2014. Its early adoption was concentrated in games and embedded systems, where the parse-time constraint was tightest. The format then accreted a second constituency: ML model serialization. TensorFlow Lite uses FlatBuffers as its on-disk model format (the .tflite extension is literally a FlatBuffers buffer with a particular schema), partly because mobile inference is performance-sensitive in the same way mobile games are, and partly because FlatBuffers' zero-copy access maps well to the way ML inference engines want to read model weights. A third constituency is high-throughput RPC: Cocos2d-x games, internal Google services where Protobuf's parse cost was identified as a hotspot, and a few financial systems that wanted zero-copy without going all the way to SBE.

The format on its own terms

A FlatBuffers buffer is a self-contained byte sequence with the following high-level structure: a fixed-size root offset at the front pointing to the root table; the root table itself somewhere inside the buffer; and any out-of-line data (strings, vectors, embedded tables) elsewhere in the buffer, addressed by offsets from their referencing tables. The buffer is read by following pointers from the root, and the pointers are 32-bit unsigned offsets relative to the location of the offset itself.

The central data structure is the table. A table corresponds to what other formats call a struct or message: a named container of fields. The wire layout of a table has two halves. The data section holds the inline values (booleans, integers, floats, embedded fixed-size structs) and offsets-to-out-of-line-data (for strings, vectors, and embedded tables). The vtable is a small auxiliary structure that records, for each declared field, the offset within the data section where that field lives. Fields not present in a particular instance simply have a zero offset in the vtable; the reader checks the vtable, sees the zero, and reports the field as absent.

The vtable is shared across instances of the same table type when the field-presence pattern matches, which is a non-trivial optimization for structures emitted in batches. The vtable starts with two int16 fields (vtable size in bytes; inline size of the table's data in bytes) followed by an int16 offset for each declared field. The data section starts with an int32 offset back to its vtable, then the inline fields and the offsets-to-out-of-line.

Strings are stored elsewhere in the buffer. A string consists of a 4-byte length prefix, the UTF-8 bytes themselves, and a single null terminator (the null is not counted in the length but is included for C-string compatibility). The length prefix is at the offset pointed to by the string's containing field; the bytes follow. A string can be referenced from multiple places in the buffer if the encoder chooses to deduplicate, though most encoders do not bother.

Vectors are length-prefixed arrays of values. A vector of scalars holds the values inline; a vector of strings or tables holds offsets to the elements. Vectors of structs (which are inline fixed-size records, distinct from tables) hold the structs directly. The length is a 4-byte prefix; the elements follow.

A struct in FlatBuffers terminology is not a table. A struct is a fixed-layout, fixed-size, non-extensible record whose fields are inlined directly wherever the struct is used. Structs are denser than tables but cannot evolve: the layout is frozen at schema write time, and any change to a struct's fields breaks every buffer that contains it. Tables are the right choice for almost everything; structs are an optimization for known-stable, performance-critical fields where the per-field vtable lookup overhead matters.

Alignment is enforced throughout. Every value is aligned to its natural boundary (4-byte fields on 4-byte boundaries, 8-byte fields on 8-byte boundaries) within the buffer. The encoder inserts padding bytes wherever necessary to maintain alignment. This is why FlatBuffers buffers are larger than the equivalent Protobuf encodings: the format pays in padding for the privilege of direct-load-without-byteswap on aligned reads.

Wire tour

Schema:

table Person {
  id:uint64;
  name:string;
  email:string;
  birth_year:int32;
  tags:[string];
  active:bool;
}
root_type Person;

The buffer for our Person is harder to walk byte-by-byte than the preceding formats because the layout depends on the encoder's choices about ordering and alignment. The actual byte count is approximately 132 bytes when the encoder is a typical implementation (flatc's output, or an equivalent runtime encoder). The structure, described from the front of the buffer:

[ root offset (4 bytes): points to the root table ]
[ ... out-of-line data: strings, the tags vector ]
[ vtable for Person: 16 bytes ]
[ Person table: 32 bytes ]
[ end of buffer ]

The root offset is the first four bytes; reading those tells the caller where the Person table is. Following that offset lands on the Person table's first 4 bytes, which are an int32 offset backward to the vtable (the value is negative because the vtable is typically before the table in the buffer). The vtable is read to find the offsets of each field within the table.

Within the Person table, the field offsets resolve as follows: id is an inlined uint64 (8 bytes); name is a 4-byte offset to the "Ada Lovelace" string elsewhere in the buffer; email is a 4-byte offset to the email string; birth_year is an inlined int32 (4 bytes); tags is a 4-byte offset to the tags vector; active is an inlined byte (1 byte, padded to 4 for alignment). The total inline size is approximately 32 bytes once padding is accounted for.

The strings, stored out of line, take 17 bytes for "Ada Lovelace" (4 bytes length + 12 bytes UTF-8 + 1 null + 0-3 padding) and 26 bytes for the email (4 + 21 + 1 + padding). The tags vector is a 4-byte length prefix and two 4-byte offsets (12 bytes), pointing to "mathematician" (4 + 13 + 1 = 18 bytes plus padding) and "programmer" (4 + 10 + 1 = 15 bytes plus padding).

Adding it all up: ~32 bytes for the table plus ~16 bytes for the vtable plus ~17 + 26 + 12 + 18 + 15 ≈ 88 bytes for the out-of-line data plus 4 bytes for the root offset and a few bytes of inter- field padding gives a buffer in the 130-150 byte range, depending on the encoder.

This is roughly twice the size of Protobuf and Thrift Compact. The cost is the price for the access pattern. Reading person->id() from a FlatBuffers buffer compiles, in C++, to:

*reinterpret_cast<const uint64_t*>(table_ptr + 4)

That is one load, no parsing, no allocation. There is no other format in this book that achieves that property without similar constraints.

If email were absent, the encoder would emit no string for it, the vtable's email offset would be 0, and the table reader's email() accessor would return null. The buffer would shrink by the email string's bytes (about 26-28 bytes) without changing the table's vtable layout or alignment. This is one of FlatBuffers' genuine wins: optional fields cost nothing when absent, and the absence is detectable at the cost of a single zero check in the vtable.

Evolution and compatibility

FlatBuffers' evolution rules are stricter than Protobuf's and Thrift's, and the strictness is the price of zero-copy access.

The rule for tables is fields can only be added at the end of the schema, and once added, their position is permanent. Adding a field in the middle would shift the vtable layout and break every existing buffer. Adding at the end is safe because old vtables simply won't have the new field's slot, and old data won't have a non-zero offset there. New consumers reading old data see the new field as absent; old consumers reading new data ignore the new slot in the vtable.

Removing a field is supported by marking it deprecated in the schema; the field number stays in the vtable but the generated code no longer exposes it. Old buffers continue to work; new buffers omit the field's offset in the vtable.

Renaming a field is purely cosmetic at the buffer level (vtable positions are what matters). The source code change is parallel to Protobuf's: regenerate, redeploy.

Changing a field's type is mostly unsafe. The vtable assumes a specific size for each field (computed from the type), and changing the type changes the size, which corrupts the table layout. The schema-language rule is to add a new field with the new type and deprecate the old one.

Reordering field declarations is safe in FlatBuffers, because the vtable assigns each field a fixed offset based on its declared ID rather than its source order. This is unlike Avro and unlike human intuition, but it's part of the FlatBuffers contract.

Structs cannot evolve at all. Any change to a struct's fields breaks every buffer that contains it. The recommended pattern is to use structs only for layouts you are confident will never change, and to use tables for everything else.

The deterministic-encoding question for FlatBuffers is harder than for the parse-required formats. The encoder has freedom in vtable layout, in deduplication of strings, in ordering of out-of-line data, and in padding choices. Most encoders are not deterministic. A "force defaults" mode and a "minimum sizes" mode exist in the reference encoder but do not produce canonical output across languages. If you need byte-equality on FlatBuffers, you have to either build a canonicalizer or accept that hash-stable bytes are not on offer.

Ecosystem reality

The reference implementation, flatc, generates code for C++, C#, Go, Java, JavaScript, TypeScript, Lobster, Lua, PHP, Python, Rust, Swift, and Dart. The C++ generator is the canonical one; the others vary in maturity. The Rust crate (flatbuffers) is mature and performance-aware; the Go and Java generators are good; the Python generator is functional but the runtime is much slower than the others (Python's lack of cheap pointer arithmetic is the bottleneck).

TensorFlow Lite is the largest single deployment. Every TFLite model is a FlatBuffers buffer with the schema defined in schema.fbs in the TensorFlow source tree. Any tool that reads or writes TFLite models is, implicitly, a FlatBuffers consumer. The choice of FlatBuffers for TFLite was made for the same reasons it's been chosen elsewhere — load latency and memory footprint on constrained devices — and has held up well.

Cocos2d-x and Unity-adjacent toolchains use FlatBuffers for asset manifests. Several internal Google services use FlatBuffers for hot-path RPC; the public-facing services tend to use Protobuf for the ecosystem reasons that argument always falls to. A handful of financial trading systems use FlatBuffers for market data, sitting between SBE (which is even faster but harder to use) and Protobuf (which is easier but slower).

The most common ecosystem gotcha is that FlatBuffers buffers must be aligned in memory when read. On modern x86 and ARM, unaligned loads of common sizes are fine, and no one notices. On older architectures, on certain SIMD code paths, and on some embedded platforms, unaligned loads either crash or are dramatically slower. The reference C++ runtime checks alignment in debug builds and elides the check in release; bugs that bypass the runtime's allocator (e.g., reading from a memory-mapped file at an unaligned offset) can produce platform-specific failures that are hard to diagnose. The FlatBuffers documentation is explicit about alignment, but the failure mode is severe enough that it bites teams new to the format.

The second gotcha is the tooling. Reading a FlatBuffers buffer without the schema is hard. There are dump tools (flatc itself can render a buffer back to JSON given the schema), but the schemaless inspectability of MessagePack and CBOR is not on offer. Operational tools have to know the schema, and the schema has to be versioned alongside the buffer. This is the same operational cost Protobuf imposes; FlatBuffers does not introduce a new problem here, but the lack of a JSON-equivalent text format makes ad-hoc inspection harder than Protobuf's prototext.

The third gotcha is build-system integration. flatc is a separate binary that needs to run as part of the build, and the choices about how (CMake, Bazel, Cargo build scripts, npm scripts) are left to the user. Mixed-language projects often end up with two or three flatc invocations producing parallel outputs. This is manageable but is more friction than Protobuf's mature build integrations.

When to reach for it

FlatBuffers is the right choice when reads dominate writes, when latency-per-read matters, and when the data structure can fit within the format's constraints (immutable buffers, no cycles, fields-added-at-end evolution). The canonical cases: game asset loading, ML model loading, mmap'd file formats with random access, RPC where deserialization cost is the bottleneck.

It is a defensible choice for any high-volume on-disk format where parse-time would otherwise be the cost. TensorFlow Lite is the example; similar choices have been made for graph databases, spatial indexes, and a few message brokers' on-disk segment formats.

When not to

FlatBuffers is the wrong choice when writes are common, when the data is consumed once and discarded, when wire size matters more than read latency, when the schema is volatile (the fields-only-at-end rule is real), or when human-readable text inspection is required.

It is also the wrong choice when alignment-sensitive deployment is unwelcome (some embedded platforms) and when the build-system overhead of flatc is too much for the project size.

Position on the seven axes

Schema-required. Not self-describing. Row-oriented. Zero-copy, which is the format's defining axis position. Codegen, with no serious runtime alternative. Non-deterministic by spec; canonical forms achievable with effort. Evolution constrained: only-at-end for tables, immutable for structs.

The cell FlatBuffers occupies — schema-required, zero-copy, codegen-only, append-only evolution — is unique among the formats in this book and is the strongest expression of the zero-copy idea. Cap'n Proto, the next chapter, occupies a similar cell with different tradeoffs.

Epitaph

FlatBuffers is the format that asks "what if the parse step were not there?" and answers in 130 bytes of aligned padding; indispensable when reads dominate, expensive otherwise.

Cap'n Proto

Cap'n Proto and FlatBuffers are siblings. Both are zero-copy. Both were built by people with substantial prior experience designing wire formats. Both occupy the same approximate cell in the format design space. The differences between them are small in cardinality and considered in detail; they are exactly the kind of differences that illustrate why design choices that look similar from far away diverge sharply when you have to live with them. Cap'n Proto is the zero-copy format with a more thoroughgoing theory and a smaller ecosystem. Both facts are consequences of the same person making the same design choices.

Origin

Cap'n Proto was created by Kenton Varda, who had spent five years on the Protobuf team at Google before leaving to build Sandstorm, a self-hostable web application platform. Cap'n Proto's origin narrative is unusually candid: Varda's blog post announcing the format in 2013 framed it as "Protocol Buffers, in less time," with "less time" referring not to development effort but to wall-clock time spent encoding and decoding. The argument was that Protobuf's design choices, made before the rise of mmap-able persistent storage and high-bandwidth in-memory data exchange, paid for serialization overhead the modern systems they were being deployed in did not need to pay. The argument was substantive and correct.

The format has been stable since 2014 in its 1.x line, with a small number of additive changes (capability tables, more streaming operations, the pack-encoding mode that re-introduces a post-process step in exchange for size). The reference implementation is in C++, with a high-quality Rust binding (capnp and capnpc-rust) and a Java port. Other languages have ports of varying maturity. Sandstorm itself, while now a smaller project than it was at its peak, was the original primary user and shipped the format to enough independent developers to seed the ecosystem.

Outside Sandstorm, Cap'n Proto's most visible deployment is at Cloudflare, which uses it as the control-plane format for the Cloudflare Workers product and several internal systems. The Cap'n Proto RPC protocol — distinct from the serialization format, covered briefly below — underpins much of Cloudflare's distributed systems work. The Sandstorm-era concept of capability-based RPC, where references to remote objects can be passed as first-class values, was adopted by Cloudflare's architecture and has become a significant differentiator between Cap'n Proto's RPC and gRPC's.

The format on its own terms

A Cap'n Proto message is a sequence of segments, each of which is a contiguous block of words. A word is exactly eight bytes; this is fundamental to the format and is non-negotiable. Everything is word-aligned, and most data structures are word-sized or word-multiple. The frame at the front of a message specifies the segment count (minus one) and the size of each segment in words, and the segments follow.

The single most important data structure in Cap'n Proto is the pointer, which is a one-word value that references another location in the message. There are four pointer types: struct pointers (referring to a struct), list pointers (referring to a list), capability pointers (referring to a capability table entry, used by the RPC layer), and far pointers (referring to an entry in another segment). The low bits of a pointer encode the type; the remaining bits encode an offset (relative to the location of the pointer itself) and metadata describing the pointed-to value's shape.

A struct in Cap'n Proto is a fixed-layout record with two sections: a data section of scalars and a pointer section of references to out-of-line data. The data section is laid out by field size, with 8-byte fields first, then 4-byte, 2-byte, 1-byte, and finally bit-packed booleans. The pointer section follows, with one pointer per pointer-typed field, in declaration order. The struct's layout is determined by its schema; the wire format has no per-instance metadata about which fields are present.

This is the central design difference between Cap'n Proto and FlatBuffers. FlatBuffers stores per-instance metadata (the vtable) that records which fields are actually present in each instance. Cap'n Proto stores no such metadata: every instance has all fields, and absence is encoded by zero values (a zero pointer means the referenced field is null/empty; a zero scalar means the scalar is its default). The consequence is that Cap'n Proto's structs are denser than FlatBuffers' tables on average — there is no vtable overhead per instance — but absence and default values are indistinguishable for scalar fields, mirroring (and predating, in fact) the Protobuf 3 design choice that produced years of pain.

A list is a length-prefixed sequence of values. The header for a list (encoded in the pointer that references it) specifies the element count and the element size. Elements can be 0-bit (empty/void), 1-bit (booleans), 1-byte, 2-byte, 4-byte, 8-byte, pointer, or composite (variable-size, with a tag word at the front specifying the element layout). Lists of structs use the composite encoding; the tag word at the start of the list lets readers know the per-element data and pointer section sizes.

Text and Data are special cases of List(UInt8). Text is required to be valid UTF-8 and includes a null terminator (which is not counted in the length); Data has no encoding restrictions. Both are stored as pointers from their referencing struct to a separately-located list elsewhere in the same segment.

Wire tour

Schema:

@0xb59b1b1d4f1c1234;

struct Person {
  id        @0 :UInt64;
  name      @1 :Text;
  email     @2 :Text;
  birthYear @3 :Int32;
  tags      @4 :List(Text);
  active    @5 :Bool;
}

A simplified word-by-word view of the encoded message (each line is one 8-byte word, addresses on the left in word units):

00: 00 00 00 00 12 00 00 00         frame: 0 segments-minus-1, 18 words in segment 0
01: 00 00 00 00 02 00 03 00         root pointer: struct, 2 data words, 3 pointer words
02: 2a 00 00 00 00 00 00 00           struct.data[0]: id = 42
03: 17 07 00 00 01 00 00 00           struct.data[1]: birth_year=1815, active=true (low bit), padding
04: 0d 00 00 00 62 00 00 00           struct.ptr[0]: list pointer for "name"
05: 19 00 00 00 b2 00 00 00           struct.ptr[1]: list pointer for "email"
06: 21 00 00 00 17 00 00 00           struct.ptr[2]: list pointer for "tags"
07: 41 64 61 20 4c 6f 76 65            "Ada Lo"
08: 6c 61 63 65 00 00 00 00            "lace\0" plus padding (Text payload)
09: 61 64 61 40 61 6e 61 6c            "ada@anal"
0a: 79 74 69 63 61 6c 2e 65            "ytical.e"
0b: 6e 67 69 6e 65 00 00 00            "ngine\0" plus padding
0c: 02 00 00 00 02 00 00 00            tags list: 2 elements of pointer type
0d: 09 00 00 00 72 00 00 00            tags[0] pointer
0e: 11 00 00 00 5a 00 00 00            tags[1] pointer
0f: 6d 61 74 68 65 6d 61 74            "mathemat"
10: 69 63 69 61 6e 00 00 00            "ician\0" plus padding
11: 70 72 6f 67 72 61 6d 6d            "programm"
12: 65 72 00 00 00 00 00 00            "er\0" plus padding

Total: 144 bytes (8 bytes of frame header + 18 words of segment). This is comparable to FlatBuffers' size for the same record. The overhead comes from the same place: alignment padding for zero-copy access. Cap'n Proto's word-aligned discipline means even small fields like active (one bit) consume a full slot in the data section's bit-packed area.

Two structural points are worth noting in the bytes. First, the absence of any per-instance metadata about field presence: every field has a slot, and absent fields have zero values in their slots. Second, the locality: the struct's data is in the words immediately following its pointer, and the strings it references are in the words immediately after the struct. This locality is intentional and matters for cache performance; encoders that scatter out-of-line data across the buffer pay measurable performance costs on real workloads.

If email were absent, the encoder would emit a null pointer in the email slot (a zero word) and would not allocate space for the email string. The buffer would shrink by the email string's bytes (24 bytes plus the pointer's slot is unchanged, since the pointer is part of the fixed struct layout). The total size would drop to about 120 bytes. Distinguishing absent from empty for a Text field is straightforward: the pointer is null vs. pointing to a zero- length list. For scalar fields, however, absence is encoded as the default value, which is the trap.

The packed encoding

Cap'n Proto includes a packed encoding mode that compresses the canonical encoding by collapsing runs of zero bytes. The encoding is simple: for each 8-byte block, emit a tag byte where each bit indicates whether the corresponding byte is zero, followed by the non-zero bytes. The result is roughly 60-80% of the canonical size for typical schemas, at the cost of a per-message decompression step. The packed encoding is not zero-copy — reading it requires decompression to a normal buffer first — and so it sacrifices the format's defining feature in exchange for size. The choice between canonical and packed is made per use case.

Evolution and compatibility

Cap'n Proto's evolution rules are looser than FlatBuffers' and closer to Protobuf's, with caveats specific to the wire format.

Adding a new field is supported by appending it to the schema with the next available @N ordinal. The schema compiler computes the new struct's data and pointer section sizes; new structs have larger sections, but old buffers (with smaller sections) are readable because the schema compiler tracks the section sizes per schema version, and readers know to interpret old buffers using the old layout. Pointer fields can be added in the same way; old data has a smaller pointer section, and the new pointer field is absent (null) when reading.

Removing a field is supported by marking it removed; the schema compiler reserves the slot but exposes nothing to the generated code. Reusing a slot is forbidden, with the same severity as reusing a Protobuf field number.

Renaming a field is purely cosmetic. Field IDs (the @N ordinals) are wire-level identity.

Changing a field's type is mostly unsafe. Numeric types of the same size are wire-compatible; types of different sizes shift the data section layout and corrupt old buffers. Text and Data are wire-compatible (both are List(UInt8)). Pointer-typed fields cannot change to scalar-typed fields and vice versa.

The aspect of evolution where Cap'n Proto is meaningfully more flexible than FlatBuffers is struct extension. A Cap'n Proto struct can be evolved by appending fields (new ordinals) without breaking old buffers, because the schema-compiler-tracked section sizes let readers handle the smaller-old-data case. FlatBuffers' vtable mechanism does the same, more dynamically. The two formats arrive at a similar evolution model from different starting points.

The deterministic-encoding question for Cap'n Proto is partially addressed by the canonical encoding mode in the spec, which defines a unique byte representation for any given message: all out-of-line data follows its referencing struct in a specific order, default values are not emitted, and trailing zero data words are stripped. Most encoders do not produce canonical encoding by default; the option is opt-in. With it, byte-equality is achievable. Without it, equivalent messages can produce different bytes.

Cap'n Proto RPC

Cap'n Proto includes a full RPC protocol whose key feature is promise pipelining: a client can issue a call against the return value of an earlier call without waiting for the earlier call to complete. The first call's response is referenced by a capability ID, and the second call uses that ID directly. The result is dramatic latency reduction for chained calls, especially across high-latency links. This feature has no analogue in gRPC (though gRPC's streaming covers some of the same use cases) and is the primary reason Cloudflare's distributed systems infrastructure uses Cap'n Proto rather than Protobuf.

The RPC layer is a separate concern from the serialization format. You can use Cap'n Proto's serialization without its RPC, and many deployments do. The RPC layer's adoption is narrower than the serialization's because few systems benefit enough from promise pipelining to justify the operational difference from gRPC.

Ecosystem reality

Cap'n Proto's ecosystem is smaller than Protobuf's by an order of magnitude and smaller than FlatBuffers' by a smaller but significant margin. The C++ implementation is high-quality and maintained. The Rust crate is mature and idiomatic; capnproto-rust is the most-used non-C++ binding by some margin. The Java port exists but lags. Other languages (Python, Go, JavaScript) have ports that are functional but not first-class.

The user base is concentrated. Sandstorm and Sandstorm-derivative projects use it. Cloudflare uses it heavily. A few research-grade distributed systems and a handful of blockchain-adjacent projects have adopted it. There is not much else in production. This is the opposite of the Protobuf situation, where every modern stack has some Protobuf in it; Cap'n Proto is the format you choose deliberately, with awareness that the language support outside C++ and Rust may be thin.

The most common ecosystem gotcha is the assumption that Cap'n Proto will replace Protobuf in the same role. It will not. The wire formats are different, the schema languages are different, and the ecosystem maturity is different. Cap'n Proto's strengths are specific (zero-copy, RPC pipelining, a slightly more thoughtful evolution model); choosing it for general-purpose typed serialization without those specific needs is choosing a smaller ecosystem for negligible gain.

When to reach for it

Cap'n Proto is the right choice when the workload benefits from zero-copy access and you are working in C++ or Rust and the RPC pipelining feature is either useful or irrelevant. The classic cases: high-throughput RPC on internal systems, mmap-able persistent formats, distributed systems where latency dominates throughput.

It is the right choice for systems that genuinely benefit from capability-based RPC, which is a smaller set than commonly assumed.

When not to

Cap'n Proto is the wrong choice when language support is required beyond C++ and Rust; the bindings exist but are thinner. It is the wrong choice when the target audience for the format is broad and includes teams unfamiliar with zero-copy formats; the operational learning curve is real. It is the wrong choice when wire size matters more than read latency (use Protobuf; or accept the packed-encoding tradeoff).

It is also the wrong choice when the schema is human-edited configuration; the lack of a text projection comparable to Protobuf's prototext is a real ergonomic gap.

Position on the seven axes

Schema-required. Not self-describing (the bytes are schema-dependent). Row-oriented. Zero-copy. Codegen, with no serious runtime alternative. Canonical encoding mode available; non-canonical by default. Evolution by ordinal-tagged fields with schema-tracked section sizes.

Cap'n Proto's stance differs from FlatBuffers' on two axes worth naming. First, the absence of per-instance vtables: Cap'n Proto saves bytes per instance but loses the ability to distinguish absent from default for scalar fields. Second, the canonical encoding subset, which FlatBuffers does not formally specify; this makes Cap'n Proto more usable in signing and content-addressing contexts.

Epitaph

Cap'n Proto is the format Kenton Varda built when he decided his old team had taken Protobuf as far as the architecture they shipped against would allow; word-aligned, capability-aware, and loved by the people who use it.

SBE

SBE — Simple Binary Encoding — is the format for the case where microseconds matter and the cost of the rest of the format ecosystem is worth paying. It is the dominant format in low-latency financial trading, and most of the systems that use it would describe their choice as obvious. To the rest of the world, SBE looks austere, overspecified, and mildly hostile to ergonomics. Both reactions are correct. SBE is what you get when you optimize for one thing — nanosecond-scale encode and decode — at the explicit expense of everything else, and the systems for which that one thing is the binding constraint are happy to pay.

Origin

SBE was designed by Martin Thompson and the team at Real Logic around 2014, after Thompson had spent several years working on the LMAX Disruptor and adjacent low-latency systems. The format was explicitly built for the FIX Trading Community, the standards body behind the FIX protocol that powers most of the world's electronic trading. FIX is a text-based protocol designed in the early 1990s, and by 2014 the latency cost of parsing FIX messages had become a limiting factor for the fastest market participants. SBE was proposed as the binary successor — FIX/SP1, in the standard's nomenclature — and was adopted by the major exchanges and trading venues over the following years.

The design constraints were specific and unusual. The format had to be fast enough that encoding and decoding were essentially free in the hot path: a few CPU cycles per field, with no allocations, no branches that could mispredict, and no cache misses outside the loaded message buffer. It had to be backward-compatible across versions in the way that financial-data formats need to be (a trade booked on Monday must be readable on Friday after the schema has changed twice). It had to be implementable in any of the languages traders use (C++, Java, Rust, with C# and Python for the analytics tier), with consistent wire-level behavior. And it had to be auditable, because every byte of every message is potentially regulator-relevant.

The result is a format that looks more like a memory layout specification than a serialization spec. Reading and writing SBE messages, in C++ or Java, is equivalent to indexing into a fixed struct, and that is the point.

The format on its own terms

SBE messages have three layout regions, in order: a fixed-size block of scalar fields, variable-length data fields, and repeating groups. The fixed block is laid out at compile time according to the schema, with each field at a known byte offset, and accessed with literal struct-style memory references. The variable-length data fields and repeating groups follow the fixed block, in declaration order, with their own framing.

The schema language is XML — a choice that looks dated but is defensible given that XML is the lingua franca of the FIX community. The schema declares the types (composite types, enums, sets), the messages (each with a fixed-block layout and optional variable-length and group sections), and the metadata (schema ID, version, byte order, header types). The schema compiler generates accessor classes that wrap a raw byte buffer and provide typed methods: getId() reads the eight bytes at the declared offset and reinterprets them as a uint64; setId(value) writes the value at the same offset.

Fields in the fixed block have explicit byte offsets, which can be implicit (the compiler computes them in declaration order with appropriate alignment) or explicit (the schema overrides). Fields have presence attributes (required, optional, constant). Optional fields are encoded by sentinel values — a special "null value" defined for each type, indistinguishable on the wire from a value happening to equal the sentinel — which is a constraint inherited from the financial-data world's preference for fixed-width records.

Variable-length data fields are encoded as a length prefix (typically uint16 or uint32, defined by a composite type in the schema) followed by the bytes. They live after the fixed block, in the order they were declared.

Repeating groups are SBE's mechanism for arrays of records. A group has a dimension prefix (typically a composite of blockLength and numInGroup) and then numInGroup entries, each consisting of the group's own fixed block followed by the group's variable-length data. Groups can nest, but the nesting discipline is strict.

The message itself is preceded by a message header, which is a small composite type (defined in the schema, but commonly 8 bytes) giving the block length, template ID (which message type), schema ID, and schema version. The header is what lets a generic dispatcher route a buffer to the correct decoder.

The byte order of every field is specified at the schema level — typically little-endian for x86 deployments, big-endian for some exchange-specific deployments. The schema's byte order is part of the format's identity; bytes are not portable across schemas with different byte orders without an explicit conversion.

Wire tour

Schema (abbreviated):

<sbe:messageSchema id="1" version="1" byteOrder="littleEndian">
  <types>
    <composite name="messageHeader">
      <type name="blockLength" primitiveType="uint16"/>
      <type name="templateId" primitiveType="uint16"/>
      <type name="schemaId" primitiveType="uint16"/>
      <type name="version" primitiveType="uint16"/>
    </composite>
    <composite name="varStringEncoding">
      <type name="length" primitiveType="uint16"/>
      <type name="varData" primitiveType="uint8" length="0"/>
    </composite>
    <composite name="groupSizeEncoding">
      <type name="blockLength" primitiveType="uint16"/>
      <type name="numInGroup" primitiveType="uint16"/>
    </composite>
  </types>

  <sbe:message name="Person" id="1" blockLength="16">
    <field name="id"         id="1" type="uint64" offset="0"/>
    <field name="birthYear"  id="2" type="int32"  offset="8"/>
    <field name="active"     id="3" type="uint8"  offset="12"/>
    <data  name="name"       id="10" type="varStringEncoding"/>
    <data  name="email"      id="11" type="varStringEncoding"
                                     presence="optional"/>
    <group name="tags"       id="20" dimensionType="groupSizeEncoding">
      <data name="value"     id="21" type="varStringEncoding"/>
    </group>
  </sbe:message>
</sbe:messageSchema>

Encoded:

10 00 01 00 01 00 01 00                 message header: blockLength=16, templateId=1, schemaId=1, version=1
2a 00 00 00 00 00 00 00                 id = 42 (LE u64)
17 07 00 00                             birthYear = 1815 (LE i32)
01                                      active = 1 (u8)
00 00 00                                3 bytes padding (block is 16 bytes)
0c 00                                   name length = 12 (LE u16)
41 64 61 20 4c 6f 76 65 6c 61 63 65     "Ada Lovelace"
15 00                                   email length = 21 (LE u16)
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65           "ada@analytical.engine"
00 00 02 00                             tags group dim: blockLength=0, numInGroup=2 (LE u16 pair)
0d 00                                   tag[0] length = 13
6d 61 74 68 65 6d 61 74 69 63 69 61 6e  "mathematician"
0a 00                                   tag[1] length = 10
70 72 6f 67 72 61 6d 6d 65 72           "programmer"

92 bytes. Slightly larger than Protobuf or Avro for this payload, slightly smaller than FlatBuffers and Cap'n Proto. The size is not the headline; the access pattern is. To read id from this buffer, the C++ code generated by the SBE compiler does, approximately:

return *reinterpret_cast<const uint64_t*>(buffer + 8);

Eight is the offset of id after the 8-byte message header. The read is a single load. There is no parsing, no length check, no dispatch on type. The same is true of birthYear (offset 16) and active (offset 20). Variable-length and group fields require walking past the fixed block, but the walk is also straightforward: read the length prefix, advance, repeat.

The optional email field encodes the empty string when absent (length 0, no bytes). Distinguishing "email is empty" from "email is absent" requires a side channel; SBE's presence="optional" on a data field is not a presence flag in the bytes, only a schema-level hint that consumers may treat the empty case specially.

If email were absent (encoded as zero-length), the bytes would shrink by 23 bytes, and the encoded total would be about 69 bytes.

Evolution and compatibility

SBE's evolution rules are the strictest in this book and the strictest by design. The format makes a deliberate, sharp distinction between backward-compatible changes (which can be made to a schema in place) and breaking changes (which require a new schema version and a coordinated rollout).

The backward-compatible changes are:

  • Adding a new field at the end of the fixed block, provided the new schema's blockLength is updated. Old consumers see the same blockLength they expect and read only the fields they know about; new consumers see the larger blockLength and read the new field at the new offset.
  • Adding a new variable-length data field at the end of the variable-length section.
  • Adding a new repeating group at the end of the groups section.
  • Adding new symbols to an enum (with care; the new symbols won't appear in old data).
  • Adding fields to a repeating group's block, with the same rules as for the message-level block.

The breaking changes are: anything that changes the byte offsets of existing fields, anything that changes the blockLength implicitly (without an explicit version bump), anything that reorders fields, anything that changes a field's type, and anything that removes a field. Breaking changes require a new template ID or a new schema version; consumers must check the message header and route to the appropriate decoder.

The strictness is deliberate. SBE was designed for systems where schema changes are rare, audited, and carefully coordinated. The format does not try to make schema evolution graceful; it tries to make sure that schema changes break loudly when they break, so that no producer or consumer is silently misinterpreting bytes.

The deterministic-encoding question for SBE is trivial: the format is fully deterministic. Given a schema and a value, there is exactly one byte sequence that encodes it. This is a consequence of the fixed-offset layout, the absence of variable-width integer encoding, and the lack of optional padding. SBE bytes are hashable, signable, and comparable byte-for-byte without canonicalization.

Ecosystem reality

SBE's primary ecosystem is FIX/SP1 and the broader low-latency trading community. The reference implementation, Real Logic SBE, is open source and maintained on GitHub under the agrona organization. It generates code for Java, C++, C#, Rust, and Go. The generated code is high-quality, low-allocation, and has been audited extensively by the financial firms that use it.

The CME Group, ICE, Eurex, NASDAQ, and most other major electronic exchanges publish their market-data and order-entry schemas in SBE. A trading firm wanting to consume those feeds generates the appropriate code from the published schemas and links it into their trading system. The wire-level details of how each exchange uses SBE differ in small ways (header conventions, optional-field sentinel values, group-prefix sizes), but the format itself is uniform.

Outside finance, SBE is rare. There are a few uses in hardware trading systems, a few in low-latency networking research, a few in academic teaching of binary formats. Aeron, the high-throughput messaging library by the same Real Logic team, uses SBE for its internal message types and is the most-visible non-financial deployment.

The most common ecosystem gotcha is the mismatch between SBE's mental model and what most engineers expect from a serialization format. SBE is not meant to be a generic typed binary protocol; it is meant to be a memory-layout specification. Engineers who try to use SBE as a Protobuf replacement find the schema language clunky, the evolution rules austere, and the ergonomics of optional fields hostile. They are not wrong; SBE is not for them.

The second gotcha is that the schema XML is not always exchanged between trading partners; sometimes a partner will publish a PDF specification that you must convert by hand into an SBE schema. The PDF and the schema are supposed to agree, but the version control story for that agreement is uneven. New SBE adopters benefit from running their generated code against published sample messages before going live.

When to reach for it

SBE is the right choice when latency dominates every other concern and the problem domain has fixed-width fields, well-defined message templates, and a manageable number of message types. The canonical case is electronic trading. Adjacent cases include hardware-in-the-loop simulation, high-throughput sensor pipelines, and any embedded system where the schema is stable and the access patterns are time-critical.

It is a defensible choice for any system where the access pattern is overwhelmingly read, the schema is highly stable, and the benefits of zero-cost field access outweigh the schema-evolution constraints.

When not to

SBE is the wrong choice for almost any other workload. Generic typed binary serialization (use Protobuf). Schema-evolving systems (use Avro or Protobuf). Systems where latency is not the binding constraint (any of the others). Systems where ergonomics matter (anything but SBE).

It is also the wrong choice when the language is one SBE does not support well; the long tail of language bindings is thin compared to Protobuf's.

Position on the seven axes

Schema-required. Not self-describing (the bytes plus the message header tell a generic dispatcher which schema; the schema is still mandatory to decode the body). Row-oriented. Zero-copy, in the strictest sense available in this book — the bytes literally are the in-memory layout. Codegen-only. Fully deterministic by spec. Evolution by strict append-at-the-end with explicit schema version bumps for anything else.

The cell SBE occupies — schema-required, fixed-offset, fully-deterministic, append-only evolution — is the strictest possible expression of the zero-copy idea, and it is the right expression for the workloads it was designed for.

A note on the FIX/FAST predecessor

Before SBE, the FIX Trading Community standardized FIX/FAST, a binary encoding of FIX messages designed to compress the textual form aggressively at the cost of complex stateful encoders and decoders. FAST used template-based delta encoding: each field in a message could be declared as encoded relative to the previous message's value of the same field, or as a constant, or as a copy-from-previous, or as several other operators. The result was extremely small messages — sometimes a quarter the size of FIX text, and competitive with SBE on size — at the cost of an encoder/decoder state machine that was hard to implement correctly and harder to debug.

FAST was deployed at several major exchanges through the 2000s and into the 2010s. Most have migrated off FAST to SBE. The reasons were operational: FAST's stateful encoding meant that a dropped packet on the multicast feed could desynchronize the decoder, with nontrivial recovery; FAST's complexity meant that a small number of vendors dominated the implementation space, with the resulting licensing concerns; and the latency-of-decode advantage of SBE's zero-copy approach was decisive once 10 GbE networking made the size advantage of FAST less critical.

The reason this matters for an SBE chapter is that SBE inherited its constituency directly from FAST, and the design decisions in SBE — fixed-offset layout, sentinel-encoded optionality, strict schema-versioning — are partly reactions against the operational costs of FAST. SBE traded size for simplicity, and in the particular community SBE was built for, the trade was the right one.

Epitaph

SBE is the format for nanoseconds-matter, schemas-rarely-change, hash-the-bytes deployments; austere, audited, and the unspoken default of electronic trading.

rkyv

rkyv is the youngest format in this book, and the only one designed specifically for the type system of a single language. It is a zero-copy serialization framework for Rust, built around the observation that Rust's type system, lifetime tracking, and ownership rules are strong enough to make a zero-copy format type-safe in ways that no language-agnostic format can be. The result is a format that produces astonishingly ergonomic Rust APIs on top of a wire encoding that is competitive with FlatBuffers and Cap'n Proto. The cost is that the format is Rust-only and that its schema is effectively the source code; sharing rkyv buffers across language boundaries is not on offer.

Origin

rkyv (pronounced archive — the name is a phonetic abbreviation) was started by David Koloski in 2020. The motivation was the observation that Rust's serde ecosystem, while excellent for parse-style serialization, did not provide a path to zero-copy deserialization, and the existing zero-copy formats (FlatBuffers, Cap'n Proto) were uncomfortable to use from Rust because their generated APIs did not match Rust's idioms. Specifically, both FlatBuffers and Cap'n Proto produce accessor objects that hold references into a buffer, and integrating those accessors with Rust's lifetime and trait systems required either heavy unsafe-code use or accepting a degraded API.

rkyv took a different approach: rather than generating accessor objects from a schema, it derives an archived form of an existing Rust type via a procedural macro. The archived form is a sister type — ArchivedPerson for a Person — with the same fields but with each field's type replaced by its archived equivalent (ArchivedString for String, ArchivedVec<T> for Vec<T>, and so on). The archived form is laid out in memory exactly the way the bytes on disk are laid out, which means that casting a byte buffer to &ArchivedPerson (with appropriate alignment and validation) gives you direct access to the data through Rust's normal field-access syntax. There is no parsing, no copying, no allocation, and no API ceremony; the archived struct is the buffer.

The format and library are both under active development; rkyv 0.7 is the version most production code targets, with rkyv 0.8 and the move to rkyv 1.0 introducing significant breaking changes to the archived layout and the API. This is one of the few places in this book where I am writing about a moving target, and the chapter accordingly emphasizes the durable design ideas rather than the specific layout of any particular version.

The format on its own terms

An rkyv archive is a contiguous byte buffer containing the archived form of one or more values. Values are placed in the buffer in dependency order: leaf values first (raw bytes, fixed-width primitives), then containers that reference them (strings, vectors, structs containing strings or vectors), and finally the root value. A root pointer at a known position (typically just before the end of the buffer) gives the offset of the root value's location within the buffer.

References between values are encoded as relative pointers: a 4-byte signed offset from the location of the pointer to the location of the pointed-to value. Relative pointers are position-independent; an archive can be loaded at any address in memory, or memory-mapped from disk, and the pointers continue to resolve correctly. This is a stronger property than absolute pointers (which require the buffer to be loaded at a specific address) and a weaker property than absolute offsets (which are what FlatBuffers and Cap'n Proto use); rkyv chose relative pointers for the ergonomic reason that they map naturally onto Rust's lifetime system.

The archived form of a primitive type is the same as its in-memory form, with byte order chosen at archive time. The archived form of a struct is the concatenation of its archived fields, in declaration order, with appropriate alignment padding. The archived form of a String is ArchivedString, which holds a relative pointer to the UTF-8 bytes and a length; the bytes themselves live elsewhere in the buffer. The archived form of a Vec<T> is ArchivedVec<T>, a relative pointer plus a length, with the elements (each in their archived form) stored contiguously elsewhere. The archived form of an Option<T> is ArchivedOption<T>, a discriminant byte plus the archived value (if Some).

The schema is the Rust source code. The #[derive(Archive, Serialize, Deserialize)] proc macro on a struct or enum generates the archived form, the serialization function, and the deserialization function (the latter being optional, since the zero-copy access pattern usually obviates the need for full deserialization). The schema is effectively a Rust type declaration; there is no separate IDL.

Validation — checking that a byte buffer is in fact a well-formed archive of the expected type — is a separate concern, handled by the bytecheck crate. By default, rkyv assumes archived buffers were produced by a trusted writer; reading an untrusted buffer without validation can dereference invalid relative pointers, read out-of-bounds memory, or trigger undefined behavior. The bytecheck-based validation walks the buffer recursively, checking that all relative pointers land within the buffer and that all values are well-formed for their types. Validation is substantially slower than zero-copy access (it touches every byte of the buffer); the rkyv pattern is to validate once on input and then access without further checks.

Wire tour

Schema (Rust source):

#![allow(unused)]
fn main() {
use rkyv::{Archive, Serialize, Deserialize};

#[derive(Archive, Serialize, Deserialize)]
#[archive(check_bytes)]
struct Person {
    id: u64,
    name: String,
    email: Option<String>,
    birth_year: i32,
    tags: Vec<String>,
    active: bool,
}
}

The archive of our Person value is approximately 144 bytes, depending on alignment choices and the specific rkyv version. The high-level layout, with bytes addressed from the start of the buffer:

0x00:  41 64 61 20 4c 6f 76 65 6c 61 63 65    "Ada Lovelace"
0x0c:  00 00 00 00                            padding
0x10:  61 64 61 40 61 6e 61 6c 79 74 69 63
       61 6c 2e 65 6e 67 69 6e 65 00 00 00    "ada@analytical.engine" + padding
0x28:  6d 61 74 68 65 6d 61 74 69 63 69 61
       6e 00 00 00                            "mathematician" + padding
0x38:  70 72 6f 67 72 61 6d 6d 65 72 00 00
       00 00 00 00                            "programmer" + padding
0x48:  d8 ff ff ff 0d 00 00 00                tags[0]: rel-ptr to "math…", len 13
0x50:  e8 ff ff ff 0a 00 00 00                tags[1]: rel-ptr to "programmer", len 10
0x58:  2a 00 00 00 00 00 00 00                Person.id: 42
0x60:  a8 ff ff ff 0c 00 00 00                Person.name: rel-ptr, len 12
0x68:  01 00 00 00 b0 ff ff ff 15 00 00 00    Person.email: Some, rel-ptr to "ada@…", len 21
0x74:  17 07 00 00                            Person.birth_year: 1815
0x78:  d0 ff ff ff 02 00 00 00                Person.tags: rel-ptr to tag list, len 2
0x80:  01 00 00 00                            Person.active: true (with padding)
0x84:  84 ff ff ff                            root pointer: rel-ptr back to Person at 0x58

Total: about 136-144 bytes depending on alignment. The exact bytes of the relative pointers vary with the encoder's chosen layout order; the important point is that 0x84 (the last four bytes of the buffer) is the root pointer pointing back to the Person struct at 0x58. A reader does:

#![allow(unused)]
fn main() {
let archived = rkyv::access::<ArchivedPerson, _>(&buffer)?;
let id = archived.id;                  // direct field access
let name: &str = archived.name.as_str();
}

archived.id is a load from a known offset relative to the buffer. archived.name.as_str() is a load of the relative pointer plus a load of the length, then a slice operation. No parsing, no allocation. The Rust borrow checker enforces that the returned references do not outlive the buffer, which is exactly the correctness guarantee the zero-copy pattern needs.

If email were None, the bytes would shrink by the email string's bytes (about 24 bytes) and the discriminant byte would read 0 instead of 1. The pointer slot would be empty.

Evolution and compatibility

rkyv's evolution story is the youngest and least settled in this book. The format is stable within a single version of rkyv and a single version of the schema; cross-version evolution is an ongoing area of work.

Within a single rkyv version, evolving the schema follows the constraints of zero-copy formats generally. Adding a field to the end of a struct works if the new schema's layout is preserved; removing a field changes the layout and breaks old archives. Reordering fields breaks old archives. Changing a field's type breaks old archives.

The rkyv_versioned and rkyv_dyn crates exist as community contributions to handle multi-version archives — by emitting a schema version tag in the archive header and dispatching to the appropriate archived-form definition — but the version-tagging mechanism is not part of the rkyv core. Production deployments that need long-lived archives usually layer their own versioning on top: a discriminator at the start of each buffer, with explicit migration logic from old archived forms to new ones.

This is the area where rkyv is most clearly a younger format than its competitors. Cap'n Proto and FlatBuffers spent years working out evolution stories; rkyv has spent most of its existence working out the type-system gymnastics that let the archived form be derived from arbitrary Rust types, which is genuinely difficult and a major contribution. The evolution story will firm up over time. For now, rkyv archives in production should be treated as schema-stable: if the schema must change, the migration is explicit.

The deterministic-encoding question for rkyv has the same answer as for the other zero-copy formats. The wire format depends on the encoder's layout choices, alignment decisions, and pointer ordering; canonical encoding is achievable but is not the default. Most rkyv deployments do not require byte-equality and do not configure for it.

Ecosystem reality

The rkyv ecosystem is Rust-only and concentrated. The reference implementation lives at github.com/rkyv/rkyv and is maintained actively. The bytecheck crate handles validation; the rkyv_dyn crate handles trait-object archiving; the rkyv_typename crate handles type identification across versions. A handful of derived crates (rkyv_with, rkyv_versioned, rkyv_codec) provide additional features.

The deployments that use rkyv are concentrated in a few areas. Game development is the largest: several Rust-based game engines and tools use rkyv for asset serialization, where the load-time benefits are decisive. Database internals are another: a few embedded databases and content-addressed stores use rkyv for on-disk record formats. Server-side use is smaller; the canonical RPC frameworks in the Rust ecosystem (tonic for gRPC, tarpc) do not use rkyv, and most Rust services choose Protobuf or MessagePack instead.

The ecosystem gotchas are typical of a young format. Pin-to-version discipline matters: a buffer produced by rkyv 0.7 is not readable by rkyv 0.8, and crates that expose rkyv archives in their public API have to coordinate version bumps with consumers. Validation is opt-in and easy to forget; archives from untrusted sources must be validated, and the failure mode for unvalidated bad data is undefined behavior, not a clean error.

The most subtle gotcha is alignment. Reading an ArchivedPerson from a buffer requires the buffer to be aligned to the struct's alignment requirement (8 bytes, in our example). Reading an unaligned buffer produces undefined behavior on architectures that require alignment, and a slow miss on architectures that tolerate it. The rkyv::AlignedVec type exists to make alignment automatic, but consumers of buffers from external sources (mmap, network) have to handle alignment themselves.

When to reach for it

rkyv is the right choice when you are working in pure Rust, the workload benefits from zero-copy access, and the schema is expected to be stable for the lifetime of the format. The classic cases: game asset loading, on-disk record formats for content-addressed storage, mmap-able caches.

It is the right choice when the alternative is FlatBuffers or Cap'n Proto in Rust, and the developer ergonomics of those formats are unsatisfying. The Rust APIs rkyv produces are substantially nicer than the generated Rust bindings of either of its competitors.

When not to

rkyv is the wrong choice when cross-language compatibility is required; the format is Rust-only and will remain so. It is the wrong choice when long-lived archives across schema versions are required and the migration story is unwilling to be explicit. It is the wrong choice when small-buffer overhead is unacceptable; the relative-pointer machinery has fixed overhead that does not amortize for tiny payloads.

It is also the wrong choice when validation is required on a hot path; bytecheck-based validation is fast but not free, and the cost is meaningful when the buffer is large.

Position on the seven axes

Schema-required (the schema is the Rust type). Not self-describing (reading requires the matching Rust type to be defined; there is no in-band schema descriptor). Row-oriented. Zero-copy. Codegen- only, via proc macros. Non-deterministic by default; canonical encoding achievable. Evolution constrained, with versioning typically layered above the format.

The cell rkyv occupies — schema-required, zero-copy, derived from existing source-language types rather than from an IDL — is the unique consequence of building a zero-copy format inside a language with strong type and lifetime systems. The format is the strongest demonstration in this book of the principle that the right serialization format for a single language can be much nicer than any cross-language format can be, and the strongest argument that cross-language compatibility is a feature with non-zero cost.

A note on the broader serde-binary picture

rkyv exists in a Rust ecosystem with several other binary serialization options, and the choice between them is worth understanding even if you ultimately pick rkyv. bincode is the default serde binary format and produces the densest output of the parse-style binary formats; postcard is a no-std-friendly varint-encoded serde format used in embedded contexts; speedy is a serde-adjacent format that prioritizes decode speed; abomonation is an unsafe zero-copy format that is strictly unsafe and predates rkyv. Each has its constituency.

The dimension on which rkyv is unique among Rust formats is zero-copy with a safe access pattern. bincode and postcard are parse-style; speedy is parse-style with optimization; abomonation is zero-copy but unsafe in a way Rust users have learned to avoid. rkyv's contribution is to make zero-copy access type-safe in Rust, and the type-safety is what justifies the relative-pointer machinery and the derived archived form. The cost is the constraints and the youth of the ecosystem; the benefit is that what you get works inside the Rust type system in a way no other format does.

For users in the Rust ecosystem who don't need zero-copy, the right choice is usually Protobuf via prost or tonic, MessagePack via rmp-serde, or CBOR via ciborium. rkyv is not the default Rust binary format; it is the format for the specific workloads where its zero-copy access is decisive.

Epitaph

rkyv is the format that asks what a zero-copy archive looks like when designed inside Rust's type system rather than around it; the ergonomics are the headline, and the wire format is the consequence.

Apache Arrow IPC

Apache Arrow is two things at once, and the relationship between them is worth getting clear before going further. Arrow the in-memory format is a specification for how columnar data should be laid out in RAM: validity bitmaps, contiguous data buffers, length-prefixed offsets, all aligned and padded so that any reader on the same machine can use the same bytes without copying. Arrow IPC is the wire encoding for moving Arrow-formatted data between processes — over a socket, in a file, through a shared memory segment. The two are designed together, and the wire format is essentially a dump of the in-memory layout with a small amount of framing metadata around it.

This chapter covers the IPC encoding because that is the part that ends up on disk and on the wire. The in-memory layout is described where necessary to explain what the bytes mean.

Origin

Arrow grew out of conversations between Wes McKinney (creator of pandas) and several other dataframe library authors in 2016. The problem they were solving was the Tower of Babel of analytical data: pandas, R data frames, Spark DataFrames, Apache Drill's query engine, Impala, and a dozen other systems all worked with column-oriented data, and all of them used different in-memory representations, and the cost of converting between representations when these systems wanted to interoperate was approaching the cost of the actual analysis. Every pair of systems that wanted to exchange a million rows had to write a converter, and the converter had to read every byte twice — once to parse the source format and once to construct the destination format.

Arrow's design proposition was that all of these systems should use the same in-memory format, so that interoperability would mean sharing buffers rather than converting them. The design committee was unusually broad — McKinney, Hadley Wickham, Jacques Nadeau, Julian Hyde, and others, drawn from the major dataframe and query engine projects — and the resulting specification was deliberately constrained: a small number of layouts (each well-defined), explicit handling of null values via validity bitmaps, and a metadata schema layer (using FlatBuffers, of all things) for describing column types and structures.

Arrow IPC was published as part of the Arrow 1.0 release. Its adoption since then has been remarkable: pandas, Spark, R's arrow package, DuckDB, ClickHouse, Polars, Datafusion, and a long tail of analytical systems all speak Arrow IPC natively. Apache Parquet (covered in the next chapter) and Arrow IPC are the two pillars of the modern analytical data stack; Arrow is the in-flight and in-memory format, Parquet is the at-rest format.

The format on its own terms

Arrow IPC is a sequence of messages. Each message has a small metadata header (encoded as a FlatBuffers buffer, because the Arrow team chose FlatBuffers as the metadata format for the same zero-copy reasons that motivate Arrow itself) and a body of one or more buffers. The body buffers are the raw column data: arrays of validity bits, offsets, and values, laid out exactly as they would be in memory.

Two message types do most of the work. Schema messages describe the column layout: the column names, types, nullability, child columns (for nested types). A schema message has no body buffers; its content is entirely in the metadata. RecordBatch messages carry one batch of rows for a known schema. The metadata header gives the batch's row count and a list of buffer offsets-and-lengths within the body; the body itself is a contiguous block containing each column's buffers in the order the schema describes.

The framing for each message is: a 4-byte continuation marker (0xFFFFFFFF, distinguishing valid messages from end-of-stream), a 4-byte little-endian integer giving the metadata length, the metadata bytes, padding to 8-byte alignment, the body length (implicit, derivable from the metadata), and the body bytes. The end of a stream is marked by a 4-byte continuation marker followed by a zero metadata length.

There are two file formats and one streaming format that wrap the message stream. The Arrow IPC streaming format is the bare sequence of messages described above. The Arrow IPC file format adds a magic header (ARROW1), a footer with a metadata catalog that lists every record batch's location in the file (enabling random access), and a magic trailer (ARROW1). The file format is what the canonical .arrow file extension refers to and is what pyarrow.feather.write_feather produces (more on Feather in chapter 19).

The columnar buffers for a primitive-typed column are: a validity bitmap (one bit per row, indicating presence), and a data buffer (packed values, in the natural width of the type). For variable-length types (strings, binary), there is also an offsets buffer of int32 values, with the i-th string occupying bytes [offsets[i], offsets[i+1]) of the data buffer. For nested types (lists, structs), the column has child columns, each of which follows the same layout recursively.

This layout is what makes Arrow zero-copy across compatible systems. A pandas DataFrame and a Polars DataFrame, both using Arrow as their in-memory format, can share the same buffer for a column of integers. The bytes are aligned, the validity bits are in the standard position, the lengths and offsets are computed in the standard way. No conversion happens.

Wire tour

Encoding a single Person record as a one-row Arrow IPC stream is maximally inefficient — Arrow's overhead is fixed and amortizes over the row count — but produces an honest demonstration of the format. The schema, expressed as a FlatBuffers message, declares six fields with their types; that schema is roughly 400 bytes when encoded. The record batch metadata, also a FlatBuffers message, is about 200 bytes for our six columns; it lists the buffer offsets and lengths within the body. The body is the columnar data:

column 0 (id, uint64):
  validity bitmap: 1 byte (0x01, indicating row 0 is valid), padded to 8
  data: 8 bytes (42 little-endian) padded to 8
  total: 16 bytes

column 1 (name, string):
  validity bitmap: 1 byte (0x01), padded to 8
  offsets: 2 int32 values (0, 12), padded to 8
  data: 12 bytes "Ada Lovelace", padded to 16
  total: 32 bytes

column 2 (email, string nullable):
  validity bitmap: 1 byte (0x01), padded to 8
  offsets: 2 int32 values (0, 21), padded to 8
  data: 21 bytes, padded to 24
  total: 40 bytes

column 3 (birth_year, int32):
  validity: 8 bytes; data: 4 bytes padded to 8
  total: 16 bytes

column 4 (tags, list<string>):
  list validity: 8 bytes
  list offsets: 2 int32 (0, 2), padded to 8
  child: string column with 2 entries
    child validity: 8 bytes
    child offsets: 3 int32 (0, 13, 23), padded to 16
    child data: 23 bytes, padded to 24
    child total: 48 bytes
  list total: 8 + 8 + 48 = 64 bytes

column 5 (active, bool):
  validity: 8 bytes; data: 1 bit, padded to 8
  total: 16 bytes

Body total: 16 + 32 + 40 + 16 + 64 + 16 = 184 bytes. Add the schema message (about 408 bytes including framing), the record batch message (about 240 bytes including framing), and the end-of-stream marker (4 bytes), and the on-the-wire byte count for this single record is approximately 836 bytes. This is the worst single-record showing of any format in this book by a wide margin.

The right way to read this number is to project it over a workload where the format makes sense. A million Person records in Arrow IPC, in a single record batch, would have approximately the same fixed metadata overhead (about 650 bytes) plus the columnar body. The body for a million records is roughly:

id:           ~8 MB (1M rows * 8 bytes)
name:         ~12 MB if names average 12 bytes (offsets + data)
email:        ~21 MB
birth_year:   ~4 MB
tags:         variable (most rows have a small tag list)
active:       ~125 KB (1M bits)

Per record, that is around 50 bytes plus the average tags size, which is competitive with Protobuf and below Avro. The fixed overhead amortizes to nothing. Arrow's per-record cost is low; its per-batch cost is high; and the right read is at the batch granularity.

Evolution and compatibility

Arrow IPC's evolution story is unusual because the format is designed primarily for in-memory interoperability rather than long-term archival. The schema is part of the stream, which makes streams self-describing in the sense that any reader can decode them given just the bytes. The schema can change from stream to stream, but it cannot change within a stream; once a schema message has been emitted, all subsequent record batches must conform to it.

For long-term storage of Arrow IPC files, the schema is in the file's footer, and the file is self-describing. Adding a column requires writing a new file with the new schema; old files with the old schema continue to be readable. There is no in-format mechanism for a single file to contain multiple schema versions, but Arrow files are typically managed by file-level orchestration (partitioned by date, by version) where the schema-per-file arrangement is acceptable.

Within a schema, the column types form a closed set: integers of various widths, floats, decimals, strings, binary, dates, timestamps, durations, intervals, lists, structs, unions, maps, and a handful of others. Arrow has been adding types incrementally (decimal128 was followed by decimal256, the various interval types were added over time), and consumers of older Arrow versions cannot read streams that use types they don't recognize. The canonical advice is to pin to a stable Arrow version for long-lived files and rely on Arrow's compatibility guarantees (which are documented and have held up well) for the interoperability cases.

The deterministic-encoding question for Arrow IPC is partially answered. The columnar layout is deterministic given a value: the buffers are the same bytes regardless of who produced them. The metadata, however, is FlatBuffers, and FlatBuffers is not deterministic by default. Two Arrow producers can produce byte-different metadata for the same logical schema, and so two Arrow IPC streams encoding the same data are not guaranteed to be byte-equal. For applications that need byte-equality, the body bytes can be hashed independently of the metadata.

The Arrow Flight protocol

A note worth including: Arrow IPC is the wire format underneath Arrow Flight, the gRPC-based service interface for moving Arrow data between systems at high throughput. Flight wraps Arrow IPC streams in gRPC's bidirectional streaming and adds metadata for authentication, schema discovery, and parallel data transfer. Flight is to Arrow IPC roughly what gRPC is to Protobuf: the canonical RPC layer.

Flight has been adopted by Dremio, several cloud data services, and many of the dataframe libraries that integrate with Arrow. The performance numbers Flight achieves (multi-gigabyte-per-second single-stream transfers, for typical analytical workloads) are unmatched by any other RPC over typed records, because the wire bytes are the in-memory format and the deserialization cost on both ends is essentially zero.

Ecosystem reality

Arrow's ecosystem is the largest in the columnar-format space and is growing. The reference implementations are in C++, Java, Rust, Go, Python (pyarrow), R (arrow), JavaScript (apache-arrow), C# (Apache.Arrow), and Julia. The C++ and Java implementations are first-class and feature-complete; the Rust implementation (arrow-rs) is mature and increasingly the basis for other high-performance analytical systems (DataFusion is built on it). Python and R use the C++ implementation under the hood.

The deployments that use Arrow IPC are concentrated in analytical workloads. Pandas can read and write Arrow files. Spark uses Arrow internally for its Python UDF execution and for PySpark/pandas interop. DuckDB can ingest and emit Arrow streams natively. ClickHouse has Arrow support for its Arrow and ArrowStream formats. Polars uses Arrow as its native in-memory format, full stop. The dataframe library landscape has converged on Arrow as the lingua franca, and the convergence has been the single largest improvement in analytical-data tooling in the last decade.

The most consequential ecosystem fact about Arrow is that the in-memory layout is genuinely zero-copy across systems. A query engine in Rust and a visualization library in Python can hold the same buffer simultaneously, with no conversion, and have the same view of the data. This is rare; almost no other format achieves it in practice, and Arrow's design is the proof that it is achievable when the systems agree to align their internal representations around the format.

The ecosystem gotchas are worth noting. First, the metadata overhead: for small batches, the FlatBuffers metadata can dominate the message size. The right pattern is to batch aggressively (thousands to millions of rows per batch), and consumers that emit one-row batches are paying enormous overhead. Second, dictionary encoding — Arrow's mechanism for representing low-cardinality strings as integer indices into a separate dictionary buffer — is supported by all the major implementations but with subtle differences in default behavior; producers and consumers should agree on dictionary policy explicitly. Third, the delta dictionary mechanism (which lets dictionaries grow across batches) is supported unevenly; sticking to the more restrictive replacement dictionary mode is safer for cross-implementation interop.

When to reach for it

Arrow IPC is the right choice for analytical data interchange between processes that can keep the bytes in memory: dataframe library interop, query-engine-to-dashboard streaming, parallel data transfer between cluster nodes. It is the right choice as the runtime format for modern analytical engines (DuckDB, Polars, DataFusion all use it natively).

It is the right choice for Arrow Flight services where high throughput is required.

It is a defensible choice for short-term on-disk caching of intermediate analytical results, especially if the next consumer is going to load the file into a dataframe library. The self-describing schema makes such files well-formed without external metadata.

When not to

Arrow IPC is the wrong choice for long-term archival storage; the schema-per-file arrangement is workable but Parquet is purpose-built for the at-rest case and is better at it. It is the wrong choice for transactional data (single records, frequent updates); the batch-oriented layout amortizes badly. It is the wrong choice for inter-service typed RPC where the messages are small business records; Protobuf or Avro fit those workloads better.

It is also the wrong choice when the consumers cannot use the in-memory format directly, because the principal benefit of Arrow is that the bytes are usable as memory. If the consumer is going to translate to its own representation anyway, the cost of Arrow IPC's metadata overhead is harder to justify.

Position on the seven axes

Schema-required (the schema is in the stream). Self-describing (the schema is in the stream — Arrow IPC is one of the few formats where these two are simultaneously true). Columnar. Zero-copy in the strict sense for compatible in-memory consumers. Codegen for the FlatBuffers metadata; runtime-typed otherwise. Body bytes are deterministic; metadata is not by default. Evolution within a stream is forbidden; cross-stream evolution is the application's concern.

Arrow's stance is the strongest expression of the columnar idea in this book and the strongest demonstration that self-describing and schema-required are not antonyms; the schema is part of the self-description.

Epitaph

Arrow IPC is the wire format for the in-memory layout that dataframe libraries finally agreed on; column-oriented, zero-copy across compatible systems, and the reason analytical data tooling in 2026 is so much less painful than it was in 2014.

Parquet

Parquet is the format that won at-rest analytical storage. It is the default file format for almost every modern data warehouse, the canonical output of Spark and Snowflake and BigQuery, the on-disk format that table-format projects (Iceberg, Delta Lake, Hudi) wrap without replacing, and the format that any new analytical system has to support before it can be taken seriously. The wire format is considerably more complex than anything else in this book, and the complexity is in service of a single property — predicate pushdown and column pruning at file granularity — that pays for itself in storage costs and query latency on every workload Parquet was built for.

Origin

Parquet was created jointly at Twitter and Cloudera in 2012-2013, based on the Dremel paper that Google had published in 2010. Dremel's central insight was that columnar storage with explicit repetition and definition levels could represent arbitrarily nested data structures losslessly while preserving columnar access patterns. Twitter's data engineering team had a Hadoop deployment processing petabytes of nested log data, and the existing options — row-oriented sequence files, the early columnar attempts in Hive RCFile and ORC — either lacked the nested-data story or lacked the broad ecosystem support necessary for deployment. Twitter and Cloudera collaborated on Parquet to fill the gap.

The format was donated to Apache in 2013 and graduated to top-level project status in 2015. Adoption since then has been uniform across the analytical-data world: every major open-source data engine supports Parquet for read and write, every major cloud data warehouse can ingest Parquet directly, and every modern table-format project (Apache Iceberg, Delta Lake, Apache Hudi) uses Parquet as its underlying file format. The format's specification has been remarkably stable; the v1.0 spec from 2015 is largely unchanged in its essentials, with the post-2015 additions (Parquet v2.0 encodings, stricter type metadata, the v2 LIST and MAP types) all backward-compatible.

The reason Parquet won is straightforward when the workload is clear. For analytical queries that touch a small fraction of columns out of many, and that filter a small fraction of rows out of many, columnar storage with statistics-driven row-group skipping reduces the I/O cost dramatically. A query that selects three columns from a thousand, and filters to the last day's worth of rows out of a year's, can read 0.1% of the file's bytes if the file is well-organized. The orders of magnitude this represents are why every cloud warehouse storage bill is denominated in Parquet bytes.

The format on its own terms

A Parquet file is a sequence of row groups, each containing column chunks, each containing one or more pages, all preceded and followed by magic numbers and concluded by a metadata footer encoded in Thrift.

The file's overall structure is:

[magic: "PAR1" (4 bytes)]
[row group 1]
  [column chunk 1.1: dictionary page, data page(s)]
  [column chunk 1.2: ...]
  ...
[row group 2]
  ...
[footer metadata: thrift-encoded FileMetaData]
[footer metadata length: 4 bytes little-endian]
[magic: "PAR1" (4 bytes)]

The footer metadata is read first by every Parquet reader. The last 8 bytes of the file give the metadata length and the trailing magic; the reader seeks to the metadata length, reads the metadata backward into memory, and obtains a complete schema, the locations of every row group and column chunk, statistics (min/max/null count) for every column chunk, and the encodings used. This footer-first design means a reader can do all metadata-driven decisions (which row groups to skip, which columns to read, which pages to decompress) without reading any of the data pages.

A row group is a horizontal partition of the file's rows, typically containing a few hundred thousand to a few million rows. Each row group is self-contained: it has all of its columns' chunks, all the statistics, and can be processed independently. The row group is the granularity at which parallelism happens; multiple workers each read different row groups in parallel.

A column chunk is the contiguous bytes for a single column within a single row group. Column chunks contain one or more pages: a dictionary page (if the column uses dictionary encoding) followed by data pages. Each page has its own header, its own encoding, and its own optional compression. The page is the smallest unit at which encoding and compression are applied.

The page-level encodings are where Parquet's density comes from. PLAIN encoding writes the values verbatim (8 bytes per int64, 4 bytes per int32, length-prefixed for variable types). Dictionary encoding replaces values with integer indices into a dictionary page, which is dramatic for low-cardinality columns. RLE (run-length encoding) compacts runs of repeated values. Bit-packing combines small integer values into shared bytes. Delta encoding writes successive values as differences from the previous, which is dramatic for sorted or nearly-sorted columns like timestamps. Combinations of these (RLE + bit-packing, delta + dictionary) are common.

On top of the per-page encodings, each column chunk can be compressed with one of several codecs: Snappy, gzip, Brotli, LZ4, or Zstandard. The choice trades CPU cost for compression ratio; Snappy is the historical default, Zstandard is increasingly preferred for modern hardware where the decode cost is dominated by I/O.

The Dremel-derived part is the encoding of nested data. A non-nested column (an int32 column, say) just has values. A column inside a list, or a column inside a struct that may be absent, requires extra information to know which value belongs to which row, especially when the row's nesting structure is irregular. Parquet encodes this with two streams of integers per column: repetition levels (which level of nesting starts a new list at this value) and definition levels (how many of the optional/list levels are present at this value). For our flat Person record there are no repetition levels and the definition levels are trivial; for a nested record (a Person with multiple addresses, each with multiple phone numbers) the levels become substantial and are the part of Parquet that takes a year to fully internalize.

The schema itself is part of the footer metadata, expressed as a Thrift-encoded recursive structure of primitive types (int32, int64, float, double, boolean, byte_array, fixed_len_byte_array, int96 for timestamps in older versions) wrapped in logical types (string, decimal, date, timestamp, list, map). The split between primitive and logical types is the format's mechanism for adding new high-level types over time without changing the wire format: new logical types can wrap existing primitives, and old readers that don't recognize the logical type fall back to the primitive.

Wire tour

Encoding our single Person record into a Parquet file is, again, maximally inefficient — Parquet's overhead is fixed per file and per row group, and a single-row file pays the full overhead. The structure looks like:

[PAR1 (4 bytes)]
[row group 1]
  [column chunk for id]
    [data page header (~30 bytes)]
    [data page: definition level (1 bit, packed) + value 42 (8 bytes PLAIN)]
  [column chunk for name]
    [data page header]
    [data page: definition level + length-prefixed "Ada Lovelace"]
  [column chunk for email]
    ...
  [column chunk for birth_year]
    ...
  [column chunk for tags (list<string>)]
    [data page: repetition levels, definition levels, length-prefixed strings]
  [column chunk for active]
    ...
[footer: thrift metadata describing schema, row group, column chunks, stats]
[footer length (4 bytes)]
[PAR1 (4 bytes)]

The total size for a single Person record is approximately 1.5 to 2 KB depending on the encoding choices and the verbosity of the metadata. The schema metadata alone is several hundred bytes; each column chunk has a 30-50 byte page header even if the data is tiny; the file footer with statistics for each column is substantial.

The right way to read this is again to project over a realistic workload. For a million Person records in a single Parquet file with row groups of, say, 100,000 rows each:

id (uint64): dictionary-encoded if the IDs are dense (rare for IDs); 
   otherwise PLAIN at 8 bytes/row, ~8 MB total.
name: usually PLAIN-encoded with snappy compression; the length 
   prefix plus UTF-8 bytes plus compression yields ~8 bytes/row 
   on average for English-language names.
email: same, ~14 bytes/row.
birth_year: dictionary-encoded if the year range is small; with 
   ~100 distinct years out of a million rows, dictionary plus 
   bit-packing yields ~1 byte/row. PLAIN encoding would be 4 
   bytes/row.
tags: list<string> with low-cardinality tags is heavily 
   dictionary-encoded; ~3 bytes/row average for our two-tag rows.
active: a single bit per row plus minimal overhead; ~1 KB total 
   for a million rows.

Per record, Parquet on a typical workload averages 20-30 bytes per record, often less. This is below MessagePack and well below JSON or BSON, in storage cost terms, and the storage-cost reduction is without any loss of structure. The format is also queryable: a predicate that filters to active=true can use the active column's statistics to skip whole row groups; a query that selects only id and birth_year reads only those column chunks and ignores everything else.

Evolution and compatibility

Parquet's schema-evolution story is unusual and worth understanding in detail because it intersects with the table-format projects that sit on top of Parquet.

At the file level, the schema is fixed: every row group in a Parquet file conforms to the file's footer-declared schema. Adding a column means writing a new file with the new schema; the old files retain the old schema, and a query engine reading both must reconcile. The reconciliation rules are not part of Parquet itself; they are the responsibility of the table format (Iceberg, Delta Lake, Hudi) or the query engine (Spark, DuckDB) reading the files.

The conventional reconciliation rules are: a column present in some files but not others is treated as null in the files where it's absent; a column whose type has changed is reconciled via a small set of allowed type promotions (int32 → int64, etc.); a renamed column is matched by name across files (if the table format supports field-ID-based mapping, as Iceberg does, the mapping is explicit; otherwise it's name-based).

The deterministic-encoding question for Parquet is interesting. The format is deterministic given a fixed encoding strategy and a fixed compression codec, but the encoder has substantial freedom in row group boundaries, column chunk encoding choices, page sizes, dictionary contents, and statistics emission. Two Parquet files with the same logical content can be byte-different if the encoders make different choices. There is no canonical encoding; applications that need byte-equality (content-addressed storage, immutable file commits) typically hash the logical content rather than the bytes.

The biggest evolution decision Parquet has made was the v2 encodings introduced around 2017: DELTA_BINARY_PACKED for integers, DELTA_LENGTH_BYTE_ARRAY for variable-length strings, DELTA_BYTE_ARRAY for sorted strings, and a few others. These encodings are dramatically denser for the workloads they target (timestamps, sorted ID columns, sorted string columns) but were not universally supported by readers for several years after their specification. As of 2026, all major Parquet readers support v2 encodings, but pinning to v1 encodings is still common in deployments that need to support older readers.

Ecosystem reality

Parquet's ecosystem is the largest of any format in this book. The reference Java implementation (parquet-mr) is the canonical producer for the JVM stack: Spark, Hive, Hadoop, and most of the data engineering tools in that lineage. The C++ implementation (parquet-cpp) is the producer for the C++/Python stack: pandas, DuckDB, ClickHouse, and the broader Arrow-aligned tooling. The Rust implementation (parquet, part of the arrow-rs project) is the producer for the Rust stack: DataFusion, Polars in some configurations, the various analytical Rust projects.

The cloud data warehouses — BigQuery, Snowflake, Redshift, Athena — all support Parquet as their canonical external table format. A Parquet file in cloud object storage is the universal interchange format for analytical data; emit a Parquet file and any of the warehouses will read it.

The table-format projects layered on top of Parquet are the most important architectural development in analytical data of the past decade. Apache Iceberg, Delta Lake (originated at Databricks), and Apache Hudi all use Parquet as the underlying file format and add a metadata layer that handles ACID transactions, time travel, schema evolution, and partition management. The metadata abstraction is what makes data lakes behave like databases, and Parquet is the layer underneath it.

The most consequential ecosystem fact about Parquet is that it has converged with Arrow. The arrow-rs and parquet-mr projects share substantial implementation logic; reading a Parquet file into an Arrow record batch is the standard ingest path for most analytical engines; the Parquet schema and the Arrow schema are designed to map cleanly onto each other. The Parquet-Arrow combination is the at-rest plus in-memory pair that defines the modern analytical stack.

Ecosystem gotchas worth noting. First, timestamp encoding: older Parquet files used the int96 type for timestamps, which has been deprecated in favor of int64 with logical type TIMESTAMP. Mixing int96 and int64 timestamps across files in the same dataset is a common source of bugs. Second, decimal precision: Parquet supports decimals via fixed-length byte arrays with a logical type, and the precision must match between writer and reader. Third, nested type evolution: changing the structure of a list<> or map<> column is not always safe across versions, and Iceberg's field-ID-based mapping is the recommended way to handle nested schema evolution.

When to reach for it

Parquet is the right choice for at-rest analytical data. Period. There is no other format that competes for this role across the breadth of tooling Parquet supports.

It is the right choice as the underlying file format for table formats (Iceberg, Delta Lake, Hudi). It is the right choice for cloud-object-storage-backed data warehouses. It is the right choice for any analytical pipeline whose output will be consumed by more than one system.

When not to

Parquet is the wrong choice for transactional storage; the write-once batch-oriented file format does not support row-level updates without a table format on top. It is the wrong choice for small data sets where the fixed metadata overhead dominates. It is the wrong choice for streaming data ingestion where files need to be visible the moment a record arrives; the row-group discipline imposes batching.

It is also the wrong choice for inter-service RPC, log payloads, or any other workload where the columnar layout is not what the consumer wants. Use Protobuf, Avro, or MessagePack for those.

Position on the seven axes

Schema-required (the schema is in the file footer). Self-describing (the schema, statistics, and encoding metadata are all in the footer). Columnar. Parse rather than zero-copy in the strict sense, though the metadata-then-pages pattern enables substantial read avoidance. Codegen for the Thrift footer; runtime for everything else. Non-deterministic in bytes; canonical encoding not specified. Evolution is file-level; in-format evolution is not supported.

The cell Parquet occupies — schema-required, self-describing, columnar, predicate-pushdown-aware — is the canonical at-rest analytical format and the strongest expression of the columnar idea for storage.

Epitaph

Parquet is the at-rest format the analytical-data world settled on, because columnar storage with metadata-driven pruning is, on realistic queries, an order of magnitude cheaper than every alternative.

ORC

ORC and Parquet are siblings. They were designed in the same year, by overlapping communities, for the same workload, with overlapping goals. The technical differences are real but small; the ecosystem differences turned out to be decisive. Parquet took the Hadoop world, then the cloud world, then everything that came after; ORC retained strong adoption in the Hive-centric stacks where it was born and has been losing ground gradually ever since. Reading the two formats side by side is the clearest way to understand both, and reading ORC specifically — even briefly — is the right way to understand the parts of Parquet that look opaque until you have something to compare them with.

Origin

ORC, Optimized Row Columnar, was developed at Hortonworks in 2013 to replace RCFile, the early columnar format used by Hive. RCFile was rudimentary — fixed-row-group layout, no statistics, limited encoding choices — and Hive's query engine was leaving substantial performance on the table because of its limitations. Hortonworks designed ORC as a wholesale replacement: full statistics for predicate pushdown, type-aware encodings, lightweight indexes, support for nested types, and tight integration with Hive's vectorized execution engine.

ORC was contributed to Apache and released under the Apache 2.0 license. Its trajectory from there ran in parallel with Parquet's but in different orbits. Within the Hortonworks Data Platform — Hive, Tez, Hadoop with Hortonworks distributions — ORC became the default columnar format. Within the broader ecosystem (the Cloudera distribution, the cloud warehouses, the dataframe libraries, the new query engines), Parquet became the default. The Cloudera-Hortonworks merger in 2019 brought the two distributions together, and the combined company has supported both formats officially since then, but the broader gravitational pull of Parquet has continued.

The ORC project remains active. The format has had small additive improvements over the years (bloom filters, additional encodings, compression codecs), and the reference implementations (Java in the Hive lineage, C++ for native consumers) are maintained. ORC files in production environments are common; ORC files written by new pipelines outside the Hortonworks-derived stack are rare.

The format on its own terms

An ORC file is a sequence of stripes, each of which is a row-major partition of the file's rows, followed by a footer section containing metadata. The structure is:

[stripes: stripe 1, stripe 2, ...]
[file footer: thrift metadata, schema, stripe locations]
[postscript: compression info, footer length, version]
[postscript length: 1 byte]

Each stripe contains:

[index data: position-aware indexes per column]
[row data: column streams, encoded and compressed]
[stripe footer: thrift metadata for the stripe]

A column stream in ORC is the per-column equivalent of Parquet's column chunk. ORC further breaks each column down into multiple named streams: a PRESENT stream (the validity bitmap), a DATA stream (the actual values), a LENGTH stream (for variable-length types), a DICTIONARY_DATA stream (for dictionary encoding), and others depending on the column's type. The streams within a column are conceptually similar to Parquet's encoding choices but are exposed as separate, independently-compressed substreams rather than packed into a single page byte sequence.

Encodings in ORC are oriented around type rather than chosen generically. Integers use RLE v2 (a sophisticated run-length encoding that adapts to data patterns), strings use either dictionary or direct encoding depending on cardinality, floats use direct encoding, booleans use bit-packing. Compression is per-stripe and per-stream, with codec choices including Snappy, zlib, LZO, Zstandard, LZ4, and Brotli. The codec is uniform across streams within a stripe but can differ between stripes if the file's metadata declares it.

The metadata format is, like Parquet's, a structured language — ORC uses Protobuf rather than Thrift, which is one of the small historical differences between the two formats and is operationally invisible to most users. The footer includes the schema, the per-stripe statistics (min, max, count, sum, etc.), and the file-level statistics aggregated across stripes. ORC also supports bloom filters on a per-column-per-stripe basis, which are statistics structures that can answer "this stripe definitely does not contain rows where column X equals Y" with high accuracy and low cost. Parquet has bloom filter support too, added later; ORC's was earlier and is more deeply integrated.

The schema follows a similar primitive/logical type split to Parquet's. ORC's nested-type story is via STRUCT, LIST, MAP, and UNION at the schema level, with the same Dremel-style repetition and definition level encoding underneath. The implementation details differ slightly (ORC's nested encoding uses different streams than Parquet's), but the core ideas are the same.

Wire tour

A single Person record in an ORC file follows a structurally similar layout to Parquet:

[stripe 1]
  [index data: minimal (a single row)]
  [row data]
    [column 1 (id) streams: PRESENT (1 bit), DATA (8 bytes)]
    [column 2 (name) streams: PRESENT, LENGTH, DATA]
    [column 3 (email) streams: PRESENT, LENGTH, DATA]
    [column 4 (birth_year) streams: PRESENT, DATA (RLE v2)]
    [column 5 (tags) streams: PRESENT, LENGTH (list lengths), 
      child column STRING streams]
    [column 6 (active) streams: PRESENT, DATA]
  [stripe footer: protobuf-encoded stripe metadata]
[file footer: protobuf-encoded schema, stripe locations, statistics]
[postscript: compression codec, footer length, version]

The single-record file is approximately 1.5 to 2 KB, comparable to Parquet for the same payload. The fixed overhead — postscript, footer, stripe footer, per-stream headers — dominates the encoded size for tiny files, exactly as it does in Parquet, and amortizes to nothing for realistic batch sizes.

For a million Person records in a single ORC file with stripes of ~100,000 rows, the per-record cost averages around 25 bytes, similar to Parquet on the same workload. Where ORC distinguishes itself is in the I/O cost on certain access patterns: ORC's multi-stream per-column layout means that reading just the PRESENT bitmap (to count nulls without reading values) is a smaller I/O than Parquet's equivalent. For specific predicate-pushdown patterns where the access pattern reads one substream per column across all stripes, ORC can win modestly. Whether the modest win matters in practice depends on the workload.

Evolution and compatibility

ORC's evolution story is essentially identical to Parquet's. The schema is in the footer; files are written with a fixed schema; adding columns means writing new files; the reconciliation between files of different schemas is the table format's responsibility, not ORC's.

Within a file, the column types are fixed. Cross-file evolution proceeds by name (or by field ID, if the table format supports it). Type promotion rules are similar to Parquet's: int32 to int64 is safe, smaller-to-larger numeric promotions are safe, narrowing or sign changes are not.

The deterministic-encoding question for ORC is the same as for Parquet: the format is deterministic given fixed encoding strategy and compression codec choices, but the encoder has substantial freedom that is not constrained by the spec. Two ORC files with the same logical content can be byte-different.

The version differences within ORC are worth noting. ORC v0 was the original 2013 release; ORC v1 added improvements through 2015; ORC v2 (sometimes called ORC v1 with additions) brought RLE v2 and broader encoding options. The current spec version is sometimes referred to as ORC v2.0; backward compatibility has been preserved, and modern readers can read v0 files, but very old readers may not read newer files.

Ecosystem reality

ORC's ecosystem is concentrated in the Hadoop-Hive-Hortonworks lineage. Apache Hive uses ORC as its default columnar format. Apache Tez, Apache Pig, and other Hadoop ecosystem tools support ORC for read and write. Apache Spark supports ORC reads and writes; Spark's ORC support is high-quality and is maintained alongside its Parquet support, but the Spark community generally defaults to Parquet when given the choice.

The cloud data warehouses that support ORC do so as a secondary format. AWS Athena, Google BigQuery, and Snowflake can read ORC files, but their canonical external table formats are Parquet, and ORC is the format you reach for when you have legacy data that was already in ORC. Writing new ORC files is uncommon outside the Hortonworks-derived stacks.

The reference implementations are ORC Java (the canonical Hive implementation) and ORC C++ (used by C++/Python tools that need native ORC support). The Apache Arrow project includes ORC support in its C++ implementation, which is the path most modern Python and Rust tools take to read ORC files. The Java implementation is mature and feature-complete; the C++ implementation lags slightly on newer features.

The deployments where ORC retains strong adoption are: Hive metastores (where the metastore's table definitions specify ORC as the storage format), Hortonworks-derived data lakes that have not migrated to Parquet, several large enterprise environments where ORC was chosen in 2014 and the cost of migration has not yet been worth it. The trend has been migration toward Parquet, but the trend is slow, and ORC files in production will remain common for years to come.

The ecosystem gotchas worth noting. First, the timestamps issue: ORC has its own timestamp encoding history, distinct from Parquet's. Mixing ORC and Parquet timestamps in cross-format pipelines is a known source of bugs. Second, bloom filter configuration: ORC's bloom filters are powerful but disabled by default; enabling them requires explicit per-column configuration in the writer, and many ORC files in the wild lack them despite benefiting from them. Third, the nested-type story: ORC's nested-type encoding is structurally different from Parquet's, and converting between the two for nested data is more nuanced than for flat schemas.

When to reach for it

ORC is the right choice when interoperating with a Hive-centric stack: the metastore expects ORC, the existing data is in ORC, the query engines are tuned for ORC. It is a defensible choice when the bloom filter support is a performance differentiator for the workload and the Parquet bloom filter implementation in your chosen reader is weaker.

It is the right choice when you are extending an existing ORC deployment and the cost of switching to Parquet is not yet worth the modest gains.

When not to

ORC is the wrong choice for new analytical pipelines outside the Hive lineage. The ecosystem momentum is overwhelmingly with Parquet, and the technical advantages of ORC are too small to overcome the advantages of being in the format every modern tool prefers.

It is also the wrong choice when interoperating with non-JVM analytical stacks (Polars, DuckDB, modern Rust query engines all prefer Parquet) and when interfacing with cloud data warehouses where Parquet is the canonical external table format.

Position on the seven axes

ORC's stance on the seven axes is essentially identical to Parquet's: schema-required, self-describing (footer), columnar, parse with metadata-driven read avoidance, codegen for the Protobuf footer, non-deterministic in bytes, file-level evolution.

The single point of difference, perhaps, is the texture of the encodings. ORC's per-column multi-stream layout is more granular than Parquet's pages-per-column-chunk; this gives ORC a slight advantage on certain patterns and a slight disadvantage on others. The advantage is real but modest, and the modest size of the advantage is the substantive answer to why ORC has not replaced Parquet despite predating it in some sense and being designed by people with similar expertise.

Why the difference between ORC and Parquet stuck

It is worth spending a few hundred words on why two formats this similar produced such different ecosystem outcomes, because the answer is not that one format is technically superior. The answer is that ecosystem momentum compounded around small early advantages that ended up being decisive.

Parquet had three early advantages. First, Twitter and Cloudera together had broader reach than Hortonworks alone — Twitter's open source contributions had visibility outside the Hadoop world, and Cloudera's distribution was the more popular of the two Hadoop distributions in 2013-2014. Second, Parquet's documentation and specification were stronger early on, with cleaner writeups, more external blog posts, and more conference talks. Third, the C++ implementation of Parquet (parquet-cpp, later folded into arrow-cpp) was started earlier than ORC's, which mattered disproportionately because the dataframe library ecosystem (pandas, followed by everything that imitated pandas) needed C++ support to build Python bindings, and that dependency chain pulled the broader non-JVM ecosystem toward Parquet.

The technical merits diverged less than the ecosystem-level momentum. By 2017, Parquet's encoding repertoire had expanded to match ORC's (the v2 encodings closed most of the density gap), and Parquet's bloom filter support had been added (closing the predicate-pushdown gap). At that point the formats were roughly equivalent on the wire, but the ecosystem had already declared a winner: every new analytical tool added Parquet support as the priority, and ORC support — when it appeared at all — was secondary.

The Cloudera-Hortonworks merger in 2019 was the formal end of the race. The combined company supports both formats officially, but the default for new pipelines is Parquet, and the recommendation for new deployments is Parquet. Existing ORC files continue to be read and processed; new ORC files are written when the existing data is in ORC and the cost of migration exceeds the benefit of switching.

This is the canonical lesson about ecosystem momentum in technology choices. Two technically equivalent formats, designed in parallel by overlapping communities, can end up with wildly different adoption curves based on early decisions that compound. By the time the formats reach technical parity, the choice has been made.

A note on the Hive transactions story

One area where ORC retained a meaningful technical lead for several years is ACID transactions on data lake tables. Hive added transaction support around 2016 using ORC as the underlying file format and a delta-files scheme for concurrent updates: each transaction wrote a new ORC file containing inserts, updates, and deletes, and the Hive query engine reconciled them at read time. The mechanism worked but was specific to ORC and tied to Hive.

The table format projects (Iceberg, Delta Lake, Hudi) eventually generalized this idea: any file format underneath, ACID semantics in the table format layer above. Iceberg explicitly chose to be file-format-agnostic and supports both Parquet and ORC. Delta Lake chose Parquet exclusively. Hudi supports both. The generalization has made Hive's ORC-specific transaction story largely obsolete; the equivalent capability now exists for Parquet via Iceberg or Delta Lake.

This is mentioned because it was, for a period, a genuine reason to choose ORC over Parquet for certain workloads. That reason has been retired by the table-format layer. The cell ORC occupied — columnar storage with ACID semantics on top — is now occupied by Parquet plus Iceberg, with the ACID semantics handled at a layer above the file format and the ORC dependence eliminated.

Epitaph

ORC is Parquet's contemporaneous twin, slightly better on a few axes that didn't matter; preserved by the Hortonworks lineage, displaced everywhere else by ecosystem gravity.

Feather

Feather is a curious entry in this book because it has, in a real sense, been absorbed by Arrow IPC. Feather V2 — the version anyone should be writing in 2026 — is byte-identical to the Arrow IPC file format. Feather V1 — the version that existed from 2016 to 2019 — is a separate format with its own wire encoding and its own quirks. The Feather extension (.feather) lives on as a convention, not a distinct format. Reading the chapter is the right way to understand why a format can survive its own merger and what role a "fast interchange" file format played in the analytical-data ecosystem before Arrow consumed the role.

Origin

Feather was created in 2016 by Wes McKinney (pandas) and Hadley Wickham (the tidyverse, and at the time the dominant figure in the R data ecosystem) to solve a specific problem: pandas DataFrames and R data frames had no common file format that round-tripped quickly. The available options were CSV (slow, lossy on types), RDS (R-only), pickle (Python-only), and HDF5 (slow, complex, language-portable but operationally heavy). Researchers wanting to move a data frame from R to Python — a routine task in the collaborative academic and industrial workflows McKinney and Wickham worked in — had to choose between unsatisfactory options.

Feather was the answer. It was designed in a single weekend at the RStudio offices, with the explicit goal of being the fastest possible format for pandas-and-R interop. The wire format was columnar (because both pandas and R DataFrames are columnar), used FlatBuffers for metadata (because metadata speed mattered), and deliberately avoided compression and indexing (because the goal was speed, not size).

Feather V1 was successful enough that the researchers it was built for adopted it widely. It also turned out to be a useful prototype for the broader columnar-interop problem, and the lessons learned from Feather V1 fed directly into Arrow's design. By 2019, the Arrow project had matured to the point where its file format (Arrow IPC, covered in chapter 16) was a strict superset of what Feather V1 needed to be, and McKinney made the decision to unify: Feather V2 was specified as a stable, equivalent name for Arrow IPC's file format. Existing Feather V1 files continued to be readable; new Feather files would be Arrow IPC files with the .feather extension.

This is the first reason this chapter is short: Feather V2 is Arrow IPC, and Arrow IPC has its own chapter. The novel content about Feather is largely about V1, which exists in production mostly as legacy data that has not been re-saved.

Feather V1 on its own terms

Feather V1 is structurally similar to Arrow IPC but with several differences that matter at the wire level. The file structure is:

[magic: "FEA1" (4 bytes)]
[column data: each column's bytes, contiguous]
[file metadata: FlatBuffers-encoded schema and column locations]
[metadata length: 4 bytes]
[magic: "FEA1" (4 bytes)]

The columns are stored as their raw bytes — primitive types in their natural representation, strings as offsets-plus-data, with no compression. The metadata at the end of the file gives the schema and the byte offset of each column within the file. A reader opens the file, seeks to the end, reads the metadata, and then can load any column directly without parsing the others.

The differences from Arrow IPC are subtle. Feather V1's metadata schema is a custom FlatBuffers schema, not Arrow's; the schema fields have similar shapes but are not interchangeable. Feather V1 does not support nested types (no list, no struct, no map) — every column must be a primitive or a string. Feather V1 does not support extension types or logical types beyond the basic set. Feather V1 does not support multiple record batches per file; the whole file is a single batch.

These limitations were intentional. Feather V1's goal was speed, and the constraints made the implementation simple. The constraints also limited the format's applicability beyond pandas-and-R interop, which is why Arrow's superset of Feather V1 features eventually replaced it.

Wire tour

A Feather V1 file containing our Person record (with the constraint that nested lists are not supported, so tags would have to be flattened to a separate row-per-tag table or encoded as a delimiter-separated string — neither is satisfying) is not the right exemplar. Even setting that aside, the file structure for a single record looks like:

[FEA1 magic (4 bytes)]
[id column: 8 bytes (int64 = 42)]
[name column: 4 bytes offset(0) + 12 bytes "Ada Lovelace"]
   plus a 4-byte length-array entry
[email column: similar to name]
[birth_year column: 4 bytes (int32 = 1815)]
[active column: 1 byte (boolean)]
[file metadata: ~300-500 bytes of FlatBuffers schema]
[metadata length: 4 bytes]
[FEA1 magic (4 bytes)]

For the realistic case of a single record, Feather V1 is around 500-700 bytes. For a million records, it is essentially the raw columnar bytes plus a small constant metadata overhead, which makes Feather V1 dramatically faster than CSV, dramatically smaller than pickle, and competitive with Arrow IPC for the same payload (which is unsurprising because Arrow IPC is its successor).

The Person record's tags field has to be handled specially in Feather V1. The conventional workaround is to encode it as a JSON-formatted string inside a string column, with the consumer parsing the JSON to recover the list. This is one of the limitations that drove Feather V2's adoption.

Feather V2 / Arrow IPC equivalence

Feather V2 is a strict alias for the Arrow IPC file format. The files have the ARROW1 magic at start and end (not FEA1), the metadata is Arrow's standard schema and record batch metadata, the column layouts are Arrow's standard layouts. The only difference between an Arrow IPC file and a Feather V2 file is the file extension (.arrow versus .feather), and even that is conventional rather than required.

Tools that read Feather files in 2026 read them as Arrow IPC files. The pyarrow library exposes a read_feather function for historical reasons; under the hood it dispatches to the Arrow IPC reader. The R arrow package does the same. Feather V1 files are still supported as a legacy read path — the readers detect the FEA1 magic and dispatch to the V1 codec — but new files are always V2.

This means that the chapter on Arrow IPC is the chapter on Feather, structurally. Everything in chapter 16 about Arrow IPC's schema, columnar layout, record batches, and metadata applies unchanged to Feather V2.

Evolution and compatibility

Feather V1 has no formal evolution story. The format is fixed, the schema is in the metadata, and changing a column's type or adding a column means writing a new file. Cross-file evolution is the consumer's responsibility.

The transition from V1 to V2 is the format's only meaningful evolution event. Files written before 2019 are typically V1; files written after are V2. Tools that read both detect the magic number. There is no in-format mechanism for upgrading a V1 file to V2; the straightforward recipe is to read with the V1 reader and write with the V2 writer.

The deterministic-encoding question for Feather is the same as for Arrow IPC: the body bytes are deterministic given a value, the metadata is FlatBuffers-encoded and is not by default. Feather V1 is similarly non-deterministic.

Ecosystem reality

Feather's ecosystem is the pandas-and-R intersection, which is narrower than either community's ecosystem alone. The tools that read and write Feather files are pyarrow, the R arrow package, and a small number of derivative tools (Polars, DuckDB) that inherited Arrow IPC support and exposed it through a Feather-shaped API.

The deployments that use Feather in 2026 fall into two categories. First, short-term interchange: a Python pipeline produces a Feather file, an R pipeline consumes it, and the file is deleted within hours. This is the pattern McKinney and Wickham designed for, and it works exactly as advertised. Second, medium-term caching: an analytical pipeline computes an intermediate result, saves it to a Feather V2 file, and reuses the file across multiple downstream queries. This is also reasonable, and Feather's zero-compression approach makes the cache hits fast at the cost of disk space.

The ecosystem gotcha worth noting is that the Feather V1 reader is gradually being deprecated in some tools. As of 2026, pyarrow still supports V1 reads, but the V1 writer was removed several versions ago. Files written today will be V2; files read today may be either, and code that explicitly handles both is rare. Migration of legacy V1 files to V2 is straightforward and is the recommended path for any V1 data that needs to be preserved long-term.

When to reach for it

Feather V2 (which is Arrow IPC) is the right choice for short-term or medium-term columnar interchange between processes that share in-memory layouts: pandas-to-R, pandas-to-Polars, Spark-to-DuckDB, Python-to-Rust through Arrow.

It is the right choice for caching analytical intermediates where the next consumer will load the file into a dataframe library. The zero-copy load path is the principal benefit.

It is the right choice in any pandas or R workflow where Feather has historically been used; the format remains supported and is the recommended replacement for older alternatives like pickle.

When not to

Feather is the wrong choice for at-rest analytical storage; Parquet is purpose-built for that case and is dramatically smaller on disk. Feather files are not compressed by default and are correspondingly larger.

It is the wrong choice for inter-service RPC, log payloads, or any workload that is not fundamentally columnar. The format's design is specifically for dataframe-to-dataframe interchange.

It is also the wrong choice when the operational story requires distinguishing Feather and Arrow IPC files; they are the same format, and treating them as different is a source of confusion.

Position on the seven axes

Feather V2 inherits Arrow IPC's stance on every axis: schema- required, self-describing, columnar, zero-copy across compatible in-memory consumers, codegen for FlatBuffers metadata, body deterministic and metadata not, file-level evolution.

Feather V1 differs in two ways. The schema vocabulary is smaller (no nested types). The format is uncompressed by spec. Otherwise the position on the axes is the same.

A note on the broader "interchange formats that lost" picture

Feather V1's fate — absorbed by a successor format from the same ecosystem — is unusual but not unique. Several other "interchange formats" have followed similar trajectories, and the pattern is worth recognizing.

Pickle (Python) and RDS (R) were the language-specific competitors Feather was designed against. Both remain widely used within their own languages and remain hostile to cross-language use. Both will be used by Python and R programmers respectively indefinitely; neither will become a cross-language interchange format.

HDF5 was the cross-language alternative that predated Feather. It is more capable than Feather V1 ever was — supports arbitrary nesting, datasets of arbitrary dimension, attribute metadata, and self-describing schemas — and is widely used in scientific computing, but its complexity made it a poor fit for the interactive interchange use case Feather targeted. HDF5 remains the right choice for scientific datasets where the file format's expressive richness matters; for dataframe interchange, Arrow IPC has supplanted it.

MsgPack-NumPy and the various MsgPack-based formats for arrays existed but never gained the broader-than-MessagePack adoption they would have needed to compete. MessagePack itself is schemaless and not columnar; the layered formats on top tried to fill the gap, but the gap was filled instead by Arrow.

The pattern across these is that the cross-language interchange problem for tabular data has, after years of attempts, converged on Arrow as the answer. Feather's position in the convergence is that it was the prototype that proved the design space, then joined the bigger project that emerged from its lessons. Few formats meet such a graceful end.

A note on the lifespan of file formats

Feather offers a small lesson worth absorbing: a format can be good enough for the use case it targets and outgrown by a broader format from the same community, and the right outcome is for the targeted format to be subsumed rather than maintained as a parallel option. Feather V2's strict equivalence with Arrow IPC is the cleanest version of this outcome. Files do not change. Tools do not break. The vocabulary shifts (people say "Arrow file" where they used to say "Feather file"), and the artifacts merge gracefully into the larger ecosystem.

The opposite outcome — a small targeted format kept on life-support next to a broader successor — is more common in this book and is generally a sign of accumulated technical debt rather than a thoughtful design choice. Feather V1 lives on as read-only legacy support; Feather V2 is the format you write today; the path forward is unambiguous. This is what graceful format obsolescence looks like, and it is rare.

Epitaph

Feather is the file format that solved pandas-to-R interop in a weekend, then had its job absorbed by Arrow IPC, and lives on as the file extension you put on your Arrow files when the next consumer is more comfortable saying "Feather."

ASN.1 (BER/DER/PER)

ASN.1 is the format that runs the world without anyone noticing. The TLS handshake on the connection that loaded this page used ASN.1. The cellular signaling protocol that delivered the request to the server used ASN.1. The X.509 certificate that authenticated the server uses ASN.1. The LDAP directory queries that authorized the user used ASN.1. SNMP packets, Kerberos tickets, smart card protocols, MPEG audio metadata, and biometric standards all use ASN.1 in some encoding. The format predates almost every other format in this book, and it is, by a large margin, the most deployed binary serialization technology on Earth.

This chapter is harder to write than most because ASN.1 is not one format. It is a schema language with a family of encoding rules that produce dramatically different bytes for the same logical value. BER, DER, CER, PER, OER, XER, and JER are all encoding rules; the schema is one ASN.1 module that any of these encoders can consume. The right way to read ASN.1 is to understand the schema language first, then pick whichever encoding rule is relevant to your context.

Origin

ASN.1 (Abstract Syntax Notation One) was specified by the ITU-T (at the time CCITT) in 1988 as part of the OSI protocol stack. The motivation was that the various OSI layer specifications all needed a way to describe message structures, and the committee wanted a single notation that could be used across protocols rather than each protocol inventing its own. ASN.1 was the notation. The original encoding rule, BER (Basic Encoding Rules), was the wire format. Subsequent encoding rules were added over the following decades as the OSI vision faded but ASN.1 turned out to be useful for non-OSI work: DER (Distinguished Encoding Rules) was specified for cryptographic uses where deterministic encoding was required; PER (Packed Encoding Rules) was specified for bandwidth- constrained environments like cellular signaling; OER (Octet Encoding Rules) was specified more recently to combine PER's density with simpler encoder/decoder logic.

The ITU has continued to maintain the ASN.1 specifications. The current spec is from 2021, with updates issued periodically. The specifications themselves are dense, technical, and cover use cases that range from the canonical (X.509 certificates, signed with DER) to the obscure (PER-encoded layer 3 protocol messages in 5G NAS signaling). Reading the specs is a serious undertaking; fortunately, most users of ASN.1 do not need to read the specs, because the format is consumed through battle-tested compilers and runtime libraries that hide the details.

ASN.1's adoption inside cellular networks is essentially universal. 3GPP's LTE and 5G specifications define every wire-format message in ASN.1 with PER encoding rules. The signaling layers in these networks — RRC, NAS, S1AP, NGAP, and many more — exchange billions of ASN.1-PER-encoded messages per second across the world's mobile infrastructure. Inside cryptography, X.509 (the format of every TLS certificate), PKCS#7 (the format of signed and encrypted documents in S/MIME), and OCSP (online certificate status protocol) are all ASN.1 with DER encoding. Inside identity, Kerberos and LDAP both use ASN.1 with BER. Inside operations, SNMP uses ASN.1 with BER for every query and response. Inside multimedia, MPEG-7 metadata is ASN.1.

The deployment scale is, in literal terms, larger than any other format in this book. The fact that most engineers never knowingly touch ASN.1 is testament to the format's success at being infrastructure: it works, the libraries exist, and the specifications are stable enough that the wire format on a 2024 LTE signaling channel is interoperable with the wire format on a 1995 GSM signaling channel.

The format on its own terms

ASN.1 has two parts that need to be discussed separately: the schema language and the encoding rules.

The schema language is similar in role to Protobuf's .proto or Avro's JSON schemas: it describes the types and structures that will be encoded. ASN.1's type system is rich. Primitive types include INTEGER (arbitrary precision), BOOLEAN, REAL, BIT STRING, OCTET STRING, UTF8String, NumericString, PrintableString, OBJECT IDENTIFIER (a sequence of integers identifying a global named entity), and several others. Constructed types include SEQUENCE (an ordered fixed list of fields), SET (an unordered fixed set of fields), SEQUENCE OF and SET OF (variable-length lists), and CHOICE (tagged union). Optional fields are indicated by the keyword OPTIONAL; default values by DEFAULT. Extension points are indicated by ... and allow new fields to be added in backward-compatible ways.

A schema looks like this:

Person ::= SEQUENCE {
    id          INTEGER,
    name        UTF8String,
    email       UTF8String OPTIONAL,
    birthYear   INTEGER,
    tags        SEQUENCE OF UTF8String,
    active      BOOLEAN
}

The schema language allows constraints on values (e.g., INTEGER (0..99) for an integer in 0-99), constraints on string lengths, and constraints on the size of SEQUENCE OF. These constraints are not just documentation; PER and OER use the constraints to encode fields more compactly. A constrained INTEGER (0..99) is encoded in PER as 7 bits; the unconstrained INTEGER takes more.

The encoding rules take a schema and a value and produce bytes. Different encoding rules produce different bytes:

BER (Basic Encoding Rules) is TLV: every value is encoded as a Tag, a Length, and a Value. Tags are bytes that include a class (universal, application, context-specific, or private), a form bit (primitive or constructed), and a tag number. Lengths are encoded as a single byte for short values (0-127) or a length-of-length followed by length bytes for longer values. Values are encoded according to their type: INTEGERs as two's-complement big-endian bytes, OCTET STRINGs as raw bytes, SEQUENCEs as the concatenation of their components.

DER (Distinguished Encoding Rules) is a strict subset of BER: the same tags, lengths, and values, but with additional constraints that produce a unique byte representation for each value. DER mandates the smallest possible length encoding, the shortest INTEGER representation (no leading zeros except for sign), the canonical TRUE encoding (FF, all bits set), and a specific ordering for SET fields. DER bytes are deterministic; the same value always produces the same bytes.

PER (Packed Encoding Rules) is fundamentally different. PER does not encode tags or lengths in the bytes for fixed-position fields; instead, the encoder and decoder both know the schema and use it to determine where each field is. Optional fields are indicated by a single bit at the start of a SEQUENCE (a "preamble") that says which optional fields are present. Integers are encoded in the minimum number of bits required by their constraint range. PER produces dramatically smaller bytes than BER for the same schema, at the cost of being uninterpretable without the schema.

OER (Octet Encoding Rules) is a newer encoding designed to combine PER's density with simpler implementation. OER aligns to octet boundaries, uses fixed-width integers where the schema's constraints permit, and skips the preamble bit-packing optimization that PER uses. OER is denser than BER and slower than PER but simpler than both; it has been adopted in some 5G specifications and in some LWM2M deployments.

XER (XML Encoding Rules) and JER (JSON Encoding Rules) encode ASN.1 values as XML and JSON respectively. They are useful for debugging and for interop with non-ASN.1 systems; they are not what you would choose for production wire bytes.

Wire tour

Encoding our Person record. The schema:

Person ::= SEQUENCE {
    id          INTEGER,
    name        UTF8String,
    email       UTF8String OPTIONAL,
    birthYear   INTEGER,
    tags        SEQUENCE OF UTF8String,
    active      BOOLEAN
}

DER encoding:

30 4c                                        SEQUENCE, length 76
   02 01 2a                                  INTEGER, length 1, value 42
   0c 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65 UTF8String, length 12, "Ada Lovelace"
   0c 15 61 64 61 40 61 6e 61 6c 79 74 69 63
        61 6c 2e 65 6e 67 69 6e 65           UTF8String, length 21, "ada@analytical.engine"
   02 02 07 17                               INTEGER, length 2, value 1815
   30 1b                                     SEQUENCE OF, length 27
      0c 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   UTF8String, length 13, "mathematician"
      0c 0a 70 72 6f 67 72 61 6d 6d 65 72            UTF8String, length 10, "programmer"
   01 01 ff                                  BOOLEAN, length 1, TRUE (0xff per DER)

78 bytes total. Slightly larger than Protobuf for this payload, slightly smaller than the schemaless self-describing formats. The bytes are deterministic by spec: the same value always produces exactly these bytes, in this order, with these lengths. This is the property that makes DER the wire format for X.509 certificates and other signed documents — the bytes can be hashed and signed, and the signature verifies against the canonical encoding from any producer.

A few details worth noting in the DER tour. The INTEGER for id takes one byte (0x2a) because 42 fits in a single signed byte. The INTEGER for birthYear takes two bytes (0x07 0x17) because 1815 doesn't fit in one signed byte (the high bit must be 0 for a positive number). The BOOLEAN encodes TRUE as 0xff (all bits set); BER permits any non-zero byte to mean TRUE, but DER requires 0xff specifically. The SEQUENCE OF uses tag 0x30 (SEQUENCE in BER terminology — ASN.1 conflates the encoding of SEQUENCE and SEQUENCE OF, distinguishing them only at the schema level).

PER encoding (assuming the schema declares no constraints, so we use unaligned PER) for the same record:

80                                           preamble: bit 1 = email present
00 00 00 00 00 00 00 2a                     id: encoded as full INTEGER...

Wait — that doesn't work. PER's INTEGER encoding for unconstrained values uses a length-prefixed representation. For 42:

01 2a                                        length 1, value 42

The full PER encoding is roughly:

80                                           preamble (1 bit): email present
01 2a                                        id: length 1, value 42
0c 41 64 61 20 4c 6f 76 65 6c 61 63 65       name: length 12, "Ada Lovelace"
15 61 64 61 40 61 6e 61 6c 79 74 69 63 61
   6c 2e 65 6e 67 69 6e 65                   email: length 21, "ada@..."
02 07 17                                     birthYear: length 2, value 1815
02                                           tags count: 2
0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   tag 1: length 13, "mathematician"
0a 70 72 6f 67 72 61 6d 6d 65 72             tag 2: length 10, "programmer"
01                                           active: 1 bit, value true... 
                                              padded to byte

Approximately 73 bytes — modestly smaller than DER, primarily because PER does not emit the type tag for each field. With aligned PER and explicit constraints in the schema (e.g., birthYear INTEGER (1800..2200)), the savings increase: a constrained integer in that range encodes in 9 bits, not 16, and the boolean encodes in a single bit packed into the preamble. For schemas with many constrained fields, PER routinely produces encodings 30-50% smaller than DER.

If email were absent, the DER encoding would skip the email TLV (24 bytes) and the SEQUENCE length would shrink accordingly. The PER encoding would change the preamble bit to 0 and skip the email value, saving 22 bytes plus the preamble shift.

Evolution and compatibility

ASN.1's evolution mechanism is the extension marker (...), which appears in a SEQUENCE definition to indicate that future versions may add fields after the marker. Old decoders that encounter an extended message skip the post-marker fields they don't recognize; new encoders include them. The effect is analogous to Protobuf's tag-based skipping but is built into the schema language explicitly.

A schema with an extension marker:

Person ::= SEQUENCE {
    id          INTEGER,
    name        UTF8String,
    email       UTF8String OPTIONAL,
    birthYear   INTEGER,
    tags        SEQUENCE OF UTF8String,
    active      BOOLEAN,
    ...
}

A future version that adds a country field after the marker remains compatible with old decoders; old decoders see the extension and skip it. New decoders see the new field and process it.

The schema language also supports version brackets — explicit groupings of fields by schema version — which give finer control over how extensions are handled. Telecom protocols use these extensively; cryptographic protocols generally do not.

The deterministic-encoding question for ASN.1 is settled by the choice of encoding rules. DER is fully deterministic. CER is also deterministic but with different rules suited to streaming. PER is deterministic given a schema and a value, though aligned PER's padding rules add some bytes that vary based on alignment. BER is non-deterministic — multiple valid encodings of the same value exist — and BER bytes should not be hashed.

Ecosystem reality

ASN.1's ecosystem is split between specialized tooling and the implementations buried inside the major standards. The specialized tooling is professional-grade, often commercial: OSS Nokalva, Marben, and a few others sell ASN.1 compilers that handle the full standard with all extensions. Open-source implementations exist (asn1c is the canonical C compiler, asn1tools is the Python implementation, the Bouncy Castle Java library includes ASN.1 support for cryptographic uses) and are mature for their respective scopes.

Inside the major standards, ASN.1 is consumed without ceremony. Every TLS implementation has a DER decoder for X.509 certificates; the decoder is part of OpenSSL, BoringSSL, and every other TLS library, and it just works. Every Kerberos implementation has a BER decoder for tickets. Every LDAP server has a BER decoder for queries. Every cellular base station has a PER encoder/decoder for the relevant 3GPP messages.

The ecosystem gotcha that bites engineers when they first encounter ASN.1 is that the format depends on the encoding rules, and the bytes for a single value can differ dramatically between BER, DER, and PER. Code that produces DER and code that expects BER will disagree, even though both are processing the same schema. Reading an ASN.1 deployment requires identifying which encoding rule is in use, which is usually obvious from context (X.509 is DER, LDAP is BER, 3GPP is PER) but is sometimes documented obscurely.

A second gotcha is implicit vs. explicit tagging. ASN.1 schemas can declare tags as IMPLICIT or EXPLICIT, and the choice affects the wire bytes. This is one of the historically confusing parts of ASN.1 and is the source of many implementation incompatibilities between hand-rolled parsers and standards-conformant libraries. Use a real ASN.1 compiler; do not write your own parser unless you are interoperating with a known small subset of features.

When to reach for it

ASN.1 is rarely the right choice for new general-purpose protocols. It is the right choice when interoperating with an existing ASN.1-using ecosystem: you are implementing X.509 extensions, you are writing a cellular base station, you are extending an LDAP schema, you are working in MPEG-7 metadata.

It is a defensible choice when the requirements include extreme schema stability, multi-decade specifications, and the operational infrastructure of standards bodies. The OSI legacy continues to matter in domains where international standardization is the governance model: telecommunications, aviation, biometric identification, government identity systems.

It is the right choice for cryptographic protocols specifically, because DER's deterministic encoding is essential and the legacy of X.509-and-friends means every implementation can read it.

When not to

ASN.1 is the wrong choice for general-purpose typed binary serialization in 2026. The schema language's expressive power exceeds what most applications need; the encoding rules add operational complexity; the tooling is more expensive than the modern alternatives; and the engineering culture around the format expects more rigor than most teams have appetite for.

It is also the wrong choice when human readability or hex-dump debugging matter; the bytes are not meant to be read directly, and the encoder/decoder is the only way in.

Position on the seven axes

ASN.1's stance varies by encoding rule. The schema is always mandatory. DER and PER are not self-describing; the bytes alone are uninterpretable. BER is partially self-describing because the TLV structure includes type tags that can be parsed without the schema for some types, though full interpretation still requires the schema. The format is row-oriented. Parse rather than zero-copy. Codegen-first via mature compilers. DER is fully deterministic; PER is mostly deterministic; BER is not. Evolution via extension markers, with explicit per-version schema brackets when needed.

The cell ASN.1 occupies — schema-rich, encoding-rule-pluggable, extension-marker-evolved — is the deepest expression of "schema language and wire format are separate concerns" in this book. It is also the format with the longest production track record by several decades.

Epitaph

ASN.1 is the format that runs the cellular network and the public key infrastructure; old, capacious, encoding-rule-pluggable, and the strongest argument that schema languages and wire formats are separate concerns.

XDR

XDR is the format that runs the Network File System and, less expectedly, the Stellar blockchain network. It is one of the oldest binary serialization standards still in active use, predating Protobuf by a decade and a half, and its design choices — fixed-width big-endian fields with 4-byte alignment, positional encoding without field tags, no built-in schema evolution mechanism — feel austere by modern standards. The austerity is deliberate: XDR was built for a world where the cost of CPU cycles to encode and decode was the bottleneck, not the cost of bytes on the wire. That world is mostly gone, and yet XDR persists in the places where it was good enough.

Origin

XDR was specified by Sun Microsystems in RFC 1014 in 1987, revised as RFC 4506 in 2006, and has been stable since. The motivation was Sun's own RPC system (later called ONC RPC), which needed a wire format for the arguments and return values of remote procedure calls. The constraints were straightforward: fast to encode and decode on the workstation hardware of the late 1980s (SPARC, MC68000, VAX), portable across the architectures Sun and its customers used, and simple enough that engineers could implement it correctly.

XDR fell out of those constraints with surprisingly little friction. Every value is fixed-width, big-endian, and aligned to a 4-byte boundary. The schema language describes types in C-like syntax. The encoder is a few hundred lines of C. The decoder similarly. The wire format is exactly what you would get if you wrote out C struct fields one at a time with explicit padding — which, in fact, is approximately what most XDR encoders do.

ONC RPC and XDR became the foundation of NFS, Sun's network file system, and through NFS XDR achieved a deployment scale that was substantial throughout the 1990s and remains substantial today. NFSv3 and NFSv4 both use XDR for their wire formats; every NFS server and client in the world parses XDR. The format also became the basis for several other Sun-era protocols (NIS, the Network Information Service; rwall; rusers; a half-dozen others) that have mostly faded.

The unexpected modern revival of XDR came from the Stellar network and its smart-contract platform Soroban. The Stellar team chose XDR for their wire format in 2014, on the reasoning that it was simple, deterministic, well-specified, and had mature implementations in the languages they cared about. The choice was contrarian — the rest of the blockchain world was gravitating toward Protobuf or custom binary formats — and it has held up. Stellar's transactions, ledger entries, and contract calls are all XDR-encoded.

The format on its own terms

XDR's data model has primitive types and constructed types, and the rules for each are short.

The primitive types: integer (32-bit signed, big-endian), unsigned integer (32-bit unsigned, big-endian), hyper (64-bit signed), unsigned hyper (64-bit unsigned), float (IEEE 754 single, big-endian), double (IEEE 754 double, big-endian), quadruple (IEEE 754 quad, used rarely), and boolean (encoded as a 32-bit integer with value 0 or 1). Every primitive is fixed-width; there is no varint encoding, no zigzag, no compact form. A 32-bit integer always takes 4 bytes; a 64-bit integer always takes 8.

The constructed types: enumerations (encoded as their underlying integer); structures (encoded as the concatenation of their fields); fixed-length arrays (encoded as the concatenation of their elements); variable-length arrays (encoded as a 4-byte length followed by the elements, padded to a 4-byte boundary at the end); fixed-length opaque (raw bytes, padded to a 4-byte boundary); variable-length opaque (4-byte length plus bytes plus padding); strings (variable-length opaque, with UTF-8 by convention but technically uninterpreted); optional data (a 4-byte boolean indicating presence, followed by the value if present); discriminated unions (a tag value followed by the variant's encoding).

Every variable-length value pays for alignment to a 4-byte boundary. A string of 5 bytes encodes as 4 bytes of length, 5 bytes of content, and 3 bytes of zero padding — 12 bytes for a 5-byte string. Strings of length 4n encode in 4n+4 bytes (length plus content with no padding). The padding is bandwidth that is spent on alignment, but the alignment makes encoding and decoding trivial: every field starts at a known offset, and the decoder advances by exactly the field's encoded size.

There is no schema-in-bytes representation. The schema is the XDR file (a text artifact, similar to a Protobuf .proto), and both ends are expected to have it. The wire bytes are uninterpretable without the schema.

Wire tour

Schema:

struct Person {
    unsigned hyper id;
    string name<>;
    string *email;
    int birth_year;
    string tags<>[];
    bool active;
};

The string<> notation declares a variable-length string. The * declares an optional. The string<>[] declares a variable- length array of variable-length strings. Encoded:

00 00 00 00 00 00 00 2a                     id: 8 bytes (BE u64) = 42
00 00 00 0c                                  name length: 12
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace" (no padding needed, 12 % 4 == 0)
00 00 00 01                                  email present (boolean true)
00 00 00 15                                  email length: 21
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65 00 00 00      "ada@analytical.engine" + 3 bytes padding
00 00 07 17                                  birth_year: 4 bytes (BE i32) = 1815
00 00 00 02                                  tags count: 2
00 00 00 0d                                  tags[0] length: 13
6d 61 74 68 65 6d 61 74 69 63 69 61 6e
   00 00 00                                  "mathematician" + 3 bytes padding
00 00 00 0a                                  tags[1] length: 10
70 72 6f 67 72 61 6d 6d 65 72 00 00         "programmer" + 2 bytes padding
00 00 00 01                                  active: 4 bytes (BE u32) = 1 (true)

104 bytes. Larger than every modern variable-length encoding — about 50% larger than Protobuf, 60% larger than Avro — and the overhead is essentially all alignment padding plus fixed-width integers. The id field alone takes 8 bytes; in Protobuf it took 2; in Avro it took 1. The active boolean takes 4 bytes; in MessagePack it took 1.

What XDR pays for the bytes is read latency. Every field is at a known offset within the buffer once the variable-length fields before it have been walked. There is no varint decoding, no big-endian-to-host-byte-swap on most architectures (modern x86 and ARM both have native byteswap instructions, but the ones from 1987 didn't, which is why XDR's choice of big-endian was a real cost on Sun's own SPARC hardware), no length-prefix scanning for fixed-width fields. The decoder is trivial. The encoder is trivial. The format is trivial. Trivial is the design.

If email were absent, the encoding would be:

... (id, name as before)
00 00 00 00                                  email present: false
... (birth_year, tags, active)

The optional field is a 4-byte presence flag plus, if present, the value. Saving 28 bytes when absent (the 4 bytes of length plus 21 bytes of UTF-8 plus 3 bytes of padding equals 28 bytes that disappear when email is null). The presence flag itself remains.

Evolution and compatibility

XDR has no formal evolution mechanism. The schema is fixed; the wire bytes are fixed; changes to the schema break compatibility with old bytes unless the change is "additive at the end of a struct" — and even then, the format does not require old decoders to skip unknown trailing bytes, so the convention of adding-at-the-end is enforced by social agreement, not by the spec.

The conventional approaches to XDR schema evolution are:

  • Add fields by versioning the entire message. Define a new type, and have the producer emit either the old or the new type based on what the consumer expects. The consumer dispatches on some discriminator (often part of the protocol envelope, not XDR itself).
  • Use discriminated unions. Define a union type whose variants represent different versions of the message. New versions add new variants; old consumers ignore variants they don't recognize (after some discriminator parsing).
  • Tolerate trailing bytes. Some consumers are written to read exactly the fields they expect and to ignore anything after; new versions can append fields and continue to work. This is fragile and is not part of the spec.

In practice, XDR-using systems handle evolution at the protocol layer, not at the format layer. NFSv3 and NFSv4 are different protocols with different XDR schemas; they do not share wire format compatibility, and both ends of a connection negotiate which protocol to use. The Stellar network does similar versioning: the protocol version is part of the envelope, and major changes mean a new protocol version with a new XDR schema.

The deterministic-encoding question for XDR is unambiguous: the format is fully deterministic. Given a schema and a value, there is exactly one byte sequence that encodes it. The fixed-width representation, the absence of length-encoding choices, and the mandated padding all produce a unique encoding. XDR bytes are hashable and signable without canonicalization.

Ecosystem reality

XDR's ecosystem is bimodal: extremely mature in the Sun-RPC and NFS lineage, freshly active in the Stellar lineage, and almost nonexistent elsewhere.

The Sun-RPC tooling includes rpcgen, the canonical XDR-to-C compiler, which has been distributed with every Unix-like operating system since the late 1980s. ONC RPC implementations exist for every language that has been used for systems programming in the last 35 years. NFS implementations on every operating system include XDR. The format is wire-stable across decades; an NFS client from 2005 talks to an NFS server from 2025 (assuming both agree on protocol version), with the XDR encoding being the unproblematic layer.

The Stellar tooling is more modern: TypeScript, Rust, and Go implementations of XDR maintained by the Stellar Foundation, with codegen tools that produce idiomatic types in each language. The Stellar XDR schema is a substantial document — hundreds of types defining the network's data model — and the codegen pipelines are heavy enough to be a routine concern for Stellar developers. The choice of XDR has held up well; Stellar's transactions are deterministic, the wire format is stable across protocol versions, and the implementations are interoperable.

Outside these two communities, XDR is rare. A few legacy enterprise systems still use it. Some scientific computing systems use it for data exchange. None of these communities are growing.

The most consequential ecosystem gotcha is the gap between XDR and modern protocol design. New protocols built today rarely benefit from XDR's strengths and are penalized by its weaknesses. The lack of schema evolution, the wire-size overhead, and the lack of modern tooling like Buf's breaking-change detection make XDR a high-friction choice. The Stellar team's decision to use it was made before some of the modern alternatives existed and has been preserved by inertia and by the value of stability in a financial network.

When to reach for it

XDR is the right choice when interoperating with NFS, Sun RPC, or the Stellar/Soroban ecosystem. It is the right choice for deterministic encoding requirements where the alternative would be a custom hand-rolled format and where a mature spec is preferable.

It is a defensible choice when extreme stability across decades is the binding constraint and the schema-evolution constraints are manageable.

When not to

XDR is the wrong choice for new general-purpose typed binary serialization. Protobuf, Avro, and CBOR all do the job with better wire density, better evolution stories, and better tooling.

It is the wrong choice when bytes-on-the-wire matter; XDR's fixed-width fields and 4-byte alignment cost real bandwidth on realistic payloads.

Position on the seven axes

Schema-required. Not self-describing. Row-oriented. Parse rather than zero-copy, although the fixed-width discipline makes parsing nearly trivial. Codegen-first via rpcgen and similar tools. Fully deterministic by spec. Evolution by social convention; no in-format mechanism.

XDR's stance on the axes is the simplest of any format in this book: the spec is short, the choices are uniform, and the trade-offs are transparent. The price of the simplicity is that modern formats have improved on every axis where XDR is paying a cost — except determinism, where XDR remains as good as any.

A note on Sun RPC's adjacent influence

It is worth a brief detour into Sun RPC, because the format inherits much of XDR's character and because Sun RPC is the ancestor of every RPC framework that came after.

Sun RPC was designed alongside XDR — they are sibling specifications, both from 1987 — and defined a complete RPC protocol stack: program numbers, procedure numbers, version numbers, authentication flavors, and a wire format that wrapped XDR with framing and metadata. The protocol was simple and effective. Every Unix operating system shipped a portmap daemon to track which RPC services were running on which ports, and rpcinfo was a routine tool for service discovery.

The lineage of modern RPC frameworks runs through Sun RPC. CORBA (designed in the early 1990s) was a more elaborate response to the same problem, with object references and type system features that Sun RPC lacked. DCE RPC, Java RMI, and eventually gRPC and Thrift all owe direct architectural debts to Sun RPC. The fact that gRPC's designers studied Sun RPC carefully is not an academic detail; the choice to make program identifiers stable across versions, to support multiple authentication flavors, and to keep the wire framing simple all trace back through XDR to Sun's original design.

The reason Sun RPC didn't itself become the dominant RPC framework is mostly the security model: Sun's authentication flavors were rudimentary, and the protocol shipped with AUTH_NONE as a viable option. NFSv3, in particular, had famously weak authentication, which mostly worked because the deployments were behind LAN firewalls. As networks became more open, Sun RPC's security insufficiency became disqualifying for new deployments, and the protocols that succeeded it (CORBA's IIOP, gRPC over HTTP/2 with TLS) all centered authentication and transport security in ways Sun RPC did not.

XDR survives Sun RPC's decline because it can be used outside the RPC framework. XDR-as-a-format is a viable independent choice, even when XDR-as-part-of-an-RPC-protocol has been displaced. The Stellar example is one such use; the NFSv4 example, where XDR is still in production but as part of a more modern protocol stack, is another.

Epitaph

XDR is the format that ran NFS and now runs Stellar; austere, fully deterministic, and exactly as evolved as it needs to be — which is to say, not very.

Borsh and SCALE

Borsh and SCALE are the formats blockchain protocols pick when they need a deterministic binary serialization that produces the same bytes for the same value across implementations and across versions of the runtime. Both come from the second-generation blockchain ecosystem, both emerged in roughly the same time frame (2018-2020), and both took the same general approach: fixed-width little-endian encoding, no schema-in-bytes, deterministic by construction. They differ in their length-encoding strategies and in their schema languages, but the cell of the design space they occupy is the same. Reading them together is the right way to understand both, because the differences highlight choices each format made about the same questions.

Origin

Borsh — the name is a backronym for Binary Object Representation Serializer for Hashing — was created by the NEAR Protocol team in 2019 as the canonical serialization format for the NEAR blockchain. NEAR's smart contract platform needed a serialization that produced bytes equal across producers (so that contract authors could hash state and signatures could be verified across nodes) and that worked from Rust, since contracts on NEAR compile to WebAssembly from Rust. Borsh was the answer: a serde-compatible Rust library plus a small spec defining the wire format. The spec is short, the implementation is small, and the design choices are conservative.

SCALE — Simple Concatenated Aggregate Little-Endian Encoding — was created by the Parity Technologies team for the Substrate blockchain framework, which underlies Polkadot, Kusama, and a broader ecosystem of substrate-based chains. SCALE's motivation was the same as Borsh's: a deterministic serialization for the runtime, the wire format, and the on-chain state. Substrate is also a Rust framework, which means SCALE is also a serde-adjacent Rust library; the spec is platform-independent, but the canonical implementation lives in the parity-scale-codec crate.

The two formats overlap in their intended use case to a degree that, in retrospect, suggests one of them might not exist if the two communities had been more aware of each other early on. Borsh's design was finalized roughly contemporaneously with SCALE's, and the choices each made — small but real differences in how lengths are encoded, how options are tagged, how strings are handled — diverge enough that the formats are not interchangeable. Both are now well-established within their respective ecosystems, and the duplication is permanent.

The format on its own terms

Borsh's encoding rules are straightforward. Primitive integers are encoded as their natural width in little-endian byte order: u8 is one byte, i32 is four bytes little-endian, u64 is eight bytes little-endian. Booleans are one byte (0 or 1). Floating point uses IEEE 754 little-endian. Arrays of fixed-known size encode as the concatenation of their elements with no length prefix. Variable- length collections (Vec<T>, String, HashMap) encode as a 4-byte little-endian length prefix followed by their elements. Option<T> encodes as a one-byte discriminant (0 for None, 1 for Some) followed by the value if Some. Result<T, E> encodes as a discriminant byte (0 for Ok, 1 for Err) followed by the variant. Structs encode as the concatenation of their fields in declaration order. Enums encode as a one-byte discriminant followed by the variant's payload.

The schema is the Rust source code; the format relies on serde- derive-style macros to produce serializers and deserializers from struct and enum definitions. There is no separate IDL. Cross-language Borsh implementations exist for JavaScript, TypeScript, Python, Go, and a few others, but the canonical schema is the Rust struct.

SCALE's encoding rules are similar with one substantial difference: lengths use compact encoding, a variable-width integer encoding designed to minimize bytes for small values. A compact integer's low two bits indicate the encoding mode: mode 0 (low bits 00) is a 6-bit value packed into the remaining bits of a single byte (values 0-63); mode 1 (01) is a 14-bit value in two bytes (values 64-16383); mode 2 (10) is a 30-bit value in four bytes (values 16384-(2^30 - 1)); mode 3 (11) is a big-int mode where the next byte gives the byte count of a little-endian integer (values larger than 2^30). The compact encoding is used for the lengths of Vec<T> and String, for the count of elements in collections, and in a few other places where the spec calls for it.

SCALE also differs from Borsh in Option encoding for booleans: Option<bool> is encoded as a single byte with three possible values (0 = None, 1 = Some(false), 2 = Some(true)), where Borsh would use two bytes (one for the discriminant, one for the bool). This optimization saves a byte for the common option-of-bool case.

Both formats share the lack of a schema in the bytes. The producer and consumer must agree on the schema out of band; the format provides no metadata for type recovery. Both formats are strict about their declared types: a u32 is always exactly four bytes, with no varint compression of small values. This rigidity is what produces the determinism: there are no encoding choices, no width selection, no padding options.

The schema language for Borsh is documented at borsh.io and is expressed via Rust's type system. The schema language for SCALE is documented in the Substrate documentation and uses a slightly different vocabulary — a "compact" field type, "Vec" and "BoundedVec", "BTreeMap", and so on — but the underlying correspondence is the same.

Wire tour

Borsh schema (Rust):

#![allow(unused)]
fn main() {
#[derive(BorshSerialize, BorshDeserialize)]
struct Person {
    id: u64,
    name: String,
    email: Option<String>,
    birth_year: i32,
    tags: Vec<String>,
    active: bool,
}
}

Encoded:

2a 00 00 00 00 00 00 00                     id: u64 LE = 42
0c 00 00 00                                  name length: u32 LE = 12
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace"
01                                           email Option discriminant: Some
15 00 00 00                                  email length: u32 LE = 21
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                "ada@analytical.engine"
17 07 00 00                                  birth_year: i32 LE = 1815
02 00 00 00                                  tags count: u32 LE = 2
0d 00 00 00                                  tags[0] length: 13
6d 61 74 68 65 6d 61 74 69 63 69 61 6e       "mathematician"
0a 00 00 00                                  tags[1] length: 10
70 72 6f 67 72 61 6d 6d 65 72                "programmer"
01                                           active: bool = true

90 bytes. The fixed-width lengths (4 bytes each for every variable- length quantity) account for most of the difference between Borsh and the varint-using formats. Five lengths (name, email, tags count, tags[0], tags[1]) take 20 bytes total in Borsh; Protobuf's varint encoding of the same lengths would take 5 bytes.

SCALE schema (Rust, with parity-scale-codec derive):

#![allow(unused)]
fn main() {
#[derive(Encode, Decode)]
struct Person {
    id: u64,
    name: String,
    email: Option<String>,
    birth_year: i32,
    tags: Vec<String>,
    active: bool,
}
}

Encoded:

2a 00 00 00 00 00 00 00                     id: u64 LE = 42
30                                           name compact length: 12 (mode 0, value << 2 | 0 = 48 = 0x30)
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace"
01                                           email Option: Some
54                                           email compact length: 21 (21 << 2 = 84 = 0x54)
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                "ada@analytical.engine"
17 07 00 00                                  birth_year: i32 LE = 1815
08                                           tags compact count: 2
34                                           tags[0] compact length: 13 (13 << 2 = 52 = 0x34)
6d 61 74 68 65 6d 61 74 69 63 69 61 6e       "mathematician"
28                                           tags[1] compact length: 10 (10 << 2 = 40 = 0x28)
70 72 6f 67 72 61 6d 6d 65 72                "programmer"
01                                           active: bool = true

75 bytes. The compact encoding shrinks the length prefixes from 4 bytes each to 1 byte each (since all our lengths fit in 6 bits). Five 1-byte compacts replace five 4-byte fixed lengths, saving 15 bytes overall.

If email were absent, Borsh would replace the 01 discriminant with 00 and skip the value entirely, saving 26 bytes (the 1-byte length-of-length-of-email plus the email itself). SCALE would do the same, saving 23 bytes.

Evolution and compatibility

Both formats have the same answer to the schema-evolution question, which is essentially: don't. The wire format is positional and rigid; the schema cannot change without breaking every consumer that has the old schema; new fields cannot be added without a coordinated upgrade.

In practice, both ecosystems handle evolution at the protocol level rather than the format level. NEAR's smart contract upgrades are versioned by the runtime; SCALE-using chains version their state schema by spec_version (a runtime metadata field) and migrate state at upgrade time. The format is not asked to handle skew; the runtime is.

This is a clean separation of concerns and is appropriate for the deployment context. Blockchains have atomic upgrade points (block heights at which the runtime changes); they do not have the heterogeneous-deployment problem that Protobuf and Avro were designed for. The lack of in-format evolution is therefore not the cost it would be in a microservices context.

The deterministic-encoding question is the entire reason these formats exist. Both are fully deterministic by spec: given a schema and a value, exactly one byte sequence is produced. Floats encode their bit representation directly (NaN bit patterns are preserved). Maps that have unstable iteration order in the source language must be encoded with sorted keys (Borsh's spec mandates this for HashMap; SCALE's BTreeMap is sorted by Rust's construction). The bytes are hashable, signable, and comparable byte-for-byte without canonicalization.

Ecosystem reality

Borsh's ecosystem is concentrated in NEAR and the broader Solana adjacent universe; Solana itself uses Borsh for many of its SPL programs and tooling, despite its own native serialization being something different. The reference Rust implementation is the canonical one; JavaScript and TypeScript implementations are mature; Python and Go implementations exist and are used by off-chain tools. The format spec is short and stable.

SCALE's ecosystem is concentrated in Substrate and Polkadot. The canonical implementation is parity-scale-codec in Rust; the JavaScript implementation @polkadot/types is used by the Polkadot JS API and is feature-complete. Implementations in Python, Go, C++, and Java exist for various Substrate-adjacent uses. The format spec is part of the Substrate documentation and is also stable.

Outside their respective blockchain ecosystems, neither format has significant adoption. There is no good general-purpose case for choosing Borsh or SCALE over Protobuf or CBOR; the determinism guarantee is real but is not unique (CBOR's deterministic encoding is comparably strong), and the lack of evolution support is a disadvantage in the contexts where most binary serialization happens.

The most consequential ecosystem gotcha is the schema-source question. In a Rust-only deployment, the schema is the source code, and changes to a struct's fields propagate through the type system. In a multi-language deployment (Rust contracts plus JavaScript clients, say), the schema lives in two places and must be kept in sync. Both ecosystems have tooling to mitigate this: Borsh has a schema-export tool that generates a JSON description of a Rust type's serialization layout, which client libraries can consume; SCALE relies on Substrate's metadata system to publish a runtime's type definitions for JavaScript clients to deserialize. Both tools work; both are points of friction.

When to reach for them

Borsh is the right choice for NEAR contracts and Solana SPL programs. SCALE is the right choice for Substrate-based chains. Both are reasonable choices for any system that needs deterministic binary serialization with a well-specified format and a Rust-first implementation, and that does not need in-format schema evolution.

For new general-purpose use cases, either is a defensible choice when the alternative would be a hand-rolled binary format and the team values the existing libraries. CBOR with deterministic encoding is the more obvious general-purpose alternative, but Borsh and SCALE are simpler in some ways: smaller spec, less optionality, faster to implement.

When not to

Neither is the right choice when in-format schema evolution is a hard requirement. Neither is the right choice when bytes-on-the-wire matter and the alternative is a varint format like Protobuf or CBOR (Borsh in particular pays substantial bytes for fixed-width length prefixes). Neither is the right choice when the cross-language ecosystem outside Rust matters; both have implementations in other languages, but the implementations are secondary.

For non-blockchain use cases where the appeal is "deterministic binary format," CBOR with the deterministic encoding profile is typically the stronger choice — better tooling, more languages, broader ecosystem.

Position on the seven axes

Schema-required. Not self-describing. Row-oriented. Parse rather than zero-copy. Codegen via Rust's derive macros, with runtime fallbacks for dynamic schemas. Fully deterministic by spec. No in-format evolution mechanism.

The cell Borsh and SCALE occupy — schema-required, fixed-width- little-endian, fully-deterministic, no-evolution — is a coherent choice for the workloads they target and a poor choice for almost anything else. The fact that two such similar formats exist is an ecosystem accident, not a sign that the cell is naturally divided in two.

A note on the broader blockchain-format landscape

It is worth situating Borsh and SCALE in the broader landscape of binary formats in blockchain protocols, because the determinism-and-no-evolution choice these formats make is shared by a number of others.

Bitcoin's serialization is its own format — a custom little-endian binary encoding with VarInt-style length prefixes and explicit versioning at the protocol level. The bytes are deterministic; the format is unspecified outside Bitcoin's source code. Ethereum uses RLP (Recursive Length Prefix), which is simpler than Borsh but similar in spirit: fixed encoding rules, deterministic output, positional layout. Cosmos chains use a Protobuf-derived format called Amino (now mostly replaced by direct Protobuf with strict encoding rules); Tendermint's wire format is Protobuf-shaped. Stellar uses XDR (covered in chapter 21).

The pattern across blockchain formats is that determinism is non-negotiable and evolution happens at the runtime/protocol level, not the format level. The formats that fit this pattern have remarkably similar shapes: fixed-width fields, no in-band schema, deterministic encoding rules. The handful of formats that tried to bring richer schema features (Avro, Protobuf with deterministic mode) into blockchain contexts had to disable or work around the features that don't survive the determinism requirement.

Borsh and SCALE are therefore not idiosyncratic; they are representatives of a coherent format family that emerged from a specific deployment context. The family's design choices are driven by the consensus mechanism, not by general serialization preferences.

A note on canonical bytes vs. canonical structure

One subtlety worth flagging is the difference between canonical bytes (the format produces the same bytes for the same logical value) and canonical structure (the format normalizes the value before encoding). Borsh and SCALE produce canonical bytes given a value; they do not normalize the value first. If the source-language type is a HashMap whose iteration order is unstable, the encoder must sort keys before emitting, which is the producer's responsibility, not the format's. SCALE's documentation is explicit about this; Borsh's is implicit and relies on the Rust types (specifically, on using BTreeMap rather than HashMap when ordering matters).

This is the place most often where determinism-by-spec fails in practice: the format produces bytes from a value, but the value itself was constructed with unspecified ordering, and so two producers that "look" like they produced the same value produce different bytes. The fix is always at the application layer — sort inputs, use ordered collections, normalize before encoding. It is not a flaw of the format; it is a consequence of the format being honest about what it can and cannot guarantee.

Epitaph

Borsh and SCALE are blockchain-flavored deterministic binary formats: rigid by design, fast to implement, perfectly suited to the on-chain consensus contexts they were built for, and rarely the right choice anywhere else.

NBT

NBT is the format Minecraft uses, and that is approximately the entire reason it deserves a chapter in this book. The format is not technically remarkable, the design choices are conventional, and outside Minecraft and its derivatives there is no production deployment of any size. What makes NBT interesting is the pedagogical value: it is a clean, complete, schemaless self-describing binary format with a distinct flavor — Big-endian, type-tagged, recursively nested, designed by a single person for a single application — that has been deployed at scale, evolved under pressure, and survived the application's transition from hobby project to billion-user platform. Reading it is the right way to understand what a hand-rolled binary format looks like when its constraints are clear and its scope is bounded.

Origin

NBT — Named Binary Tag — was created by Markus Persson, who designed and built Minecraft's first several years. The format was introduced in Minecraft Alpha around 2010 to replace the previous world-storage format, which had been a packed binary representation of block IDs that did not have room for the metadata Persson wanted to add. NBT was designed to carry that metadata: every value is tagged with a type, every value can have a name, and compound values can nest arbitrarily.

Persson designed NBT in roughly an afternoon, by his own account, based on the requirements for storing player inventory, world chunks, and the various entity properties Minecraft wanted to persist. The format is documented in a short specification on Mojang's wiki. It has been stable since 2010, with one notable addition (Long Array, added in 2017 for representing block-light data more efficiently). The wire format itself has not changed.

NBT's deployment is, in raw byte terms, enormous. Every Minecraft world is millions of NBT files; every Minecraft server replicates NBT-encoded chunks to every player; every Minecraft plugin that wants to persist state writes NBT. The Minecraft modding community has built tooling around NBT — viewers, editors, converters — that constitutes a substantial subculture of the wider Minecraft ecosystem. The Bedrock Edition of Minecraft uses a slightly different NBT variant (little-endian, with some structural adjustments) but is recognizably the same format.

Outside Minecraft, NBT has been adopted by a few derivatives and fan projects. Some Minecraft-adjacent tools use it for shared configurations. A handful of voxel-game engines have borrowed it for their save formats, on the reasoning that the format is well-known to the modding community. Nothing else of consequence.

The format on its own terms

NBT is a tree of named, tagged values. Every value in an NBT document begins with a one-byte tag indicating its type. Every value (with one exception, noted below) is preceded by a two-byte big-endian length and that many UTF-8 bytes of name. The value's bytes follow.

The exception to the name rule is values inside a TAG_List. List elements share a single type (declared in the list header) and have no individual names; they are just values, concatenated.

The thirteen tags:

TAG_End         (0)  - terminates a TAG_Compound; no name, no value
TAG_Byte        (1)  - 1-byte signed integer
TAG_Short       (2)  - 2-byte signed integer (BE)
TAG_Int         (3)  - 4-byte signed integer (BE)
TAG_Long        (4)  - 8-byte signed integer (BE)
TAG_Float       (5)  - 4-byte IEEE 754 float (BE)
TAG_Double      (6)  - 8-byte IEEE 754 double (BE)
TAG_Byte_Array  (7)  - 4-byte length, then that many bytes
TAG_String      (8)  - 2-byte length, then that many UTF-8 bytes
TAG_List        (9)  - 1-byte element type, 4-byte count, elements
TAG_Compound    (10) - sequence of named values, terminated by TAG_End
TAG_Int_Array   (11) - 4-byte length, then that many BE int32s
TAG_Long_Array  (12) - 4-byte length, then that many BE int64s

Big-endian throughout. No varint encoding. Strings use modified UTF-8 (Java's variant, where U+0000 is encoded as two bytes 0xC0 0x80 to allow null-termination — a Java-isms detail that trips up non-Java implementations regularly). Compound values are schemaless maps where the keys are names and the values are typed; lists are homogeneous arrays of one element type.

The format has no schema language. The structure of an NBT document is whatever the producer chose to emit; consumers walk the tree, dispatch on tags, and extract values they care about. This is the schemaless model in its purest form, with the addition of a type tag system that gives more type information than JSON does.

The data model has one notable peculiarity: lists declared as empty must declare their element type, and the canonical choice for a "list of unknown type" is TAG_End (which is otherwise the compound terminator). This produces the occasional bug in implementations that don't expect TAG_End to appear as a list's element type.

The compression story is part of the format. NBT files are almost always compressed: gzip is the default, zlib is also common. The Minecraft world file format wraps each NBT chunk in zlib compression and stores them in custom region files. A bare uncompressed NBT file is rare in production; consumers must detect the compression and inflate first.

Wire tour

Encoding our Person record. The natural NBT representation:

0a 00 00                                     TAG_Compound, name "" (length 0)
                                             — root compound
   04 00 02 69 64                             TAG_Long, name "id" (length 2)
   00 00 00 00 00 00 00 2a                    value 42 (BE i64)

   08 00 04 6e 61 6d 65                       TAG_String, name "name"
   00 0c                                       string length 12
   41 64 61 20 4c 6f 76 65 6c 61 63 65         "Ada Lovelace"

   08 00 05 65 6d 61 69 6c                   TAG_String, name "email"
   00 15                                       string length 21
   61 64 61 40 61 6e 61 6c 79 74 69 63
      61 6c 2e 65 6e 67 69 6e 65               "ada@analytical.engine"

   03 00 0a 62 69 72 74 68 5f 79 65 61 72    TAG_Int, name "birth_year"
   00 00 07 17                                value 1815 (BE i32)

   09 00 04 74 61 67 73                      TAG_List, name "tags"
   08                                          element type: TAG_String
   00 00 00 02                                 count: 2
   00 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e
                                              "mathematician" (length-13 UTF-8)
   00 0a 70 72 6f 67 72 61 6d 6d 65 72         "programmer"

   01 00 06 61 63 74 69 76 65                TAG_Byte, name "active"
   01                                          value 1 (true encoded as byte)

   00                                         TAG_End: closes root compound

135 bytes uncompressed. Compressed with gzip, the output drops to about 100 bytes, depending on the compressor settings. Compared to MessagePack's 104 bytes for the same record, NBT uncompressed is about 30% larger, primarily because every value carries a name (a 2-byte length plus the UTF-8 bytes) and every integer is full-width. Compressed NBT is competitive on size with MessagePack, but the comparison is misleading; comparing compressed NBT to compressed MessagePack would close the gap further.

The Person record's email field needed a workaround. NBT compounds do not have a native concept of optional; a field is either in the compound or it is not, with no marker for absence. The straightforward representation of "absent email" is to omit the TAG_String for email entirely. The straightforward representation of "present email" is to include it. This is the same approach JSON, MessagePack, and CBOR use, and the schema-evolution implications are the same: the application has to know which fields it expects and handle the missing-key case.

The boolean active is encoded as TAG_Byte with value 1. NBT does not have a TAG_Bool; the convention is to use TAG_Byte for booleans, with 0 meaning false and 1 meaning true. Some NBT documents use other byte values; consumers should be liberal in what they accept (anything non-zero) and strict in what they emit (0 or 1).

Evolution and compatibility

NBT has no formal evolution story. Adding a field means starting to emit a new key in the compound; consumers that don't know about the key skip it. Removing a field means stopping emission; consumers that expect the key handle it as missing. Renaming a field is a breaking change that requires application-level coordination. Type changes are coordinated similarly.

In practice, Minecraft's evolution has been managed through a combination of strict versioning and pragmatic tolerance. The data version number embedded in every save file lets the game tell which schema version produced the data, and Mojang ships data fixers that migrate older NBT structures to the current format on load. The data fixer codebase is substantial — thousands of lines of Java handling the cumulative migrations across more than a decade — and is the operational equivalent of what a schema-evolution-aware format would handle in the format itself.

The deterministic-encoding question for NBT is unsettled. The format itself is mostly deterministic: integer widths are fixed, string encodings are fixed, list element ordering is preserved. The non-deterministic parts are compound member ordering (NBT compounds are unordered maps, and most encoders preserve insertion order but the spec does not require it) and the modified-UTF-8 encoding's edge cases (rare in practice). Hashing NBT bytes for content addressing is workable if you control both ends and mandate sorted compound members; otherwise it is a known source of bugs.

Ecosystem reality

NBT's ecosystem is concentrated entirely in Minecraft and its derivatives. The canonical Java implementation is in Mojang's own codebase (closed source for the game itself, but the wire format is documented and several open-source implementations exist). Open-source implementations of varying quality exist in Java (e.g., the Querz NBT library), Python (NBT-py), Rust (the nbt and fastnbt crates), JavaScript (prismarine-nbt), Go, C#, and a handful of others. The Bedrock variant has its own implementations that handle the little-endian and structural differences.

The tooling worth knowing about: NBTExplorer is a long-standing GUI tool for inspecting and editing NBT files; mcedit and its successors include NBT editors; the Minecraft commands ecosystem uses SNBT (Stringified NBT) for in-game configuration of entities and items, with a JSON-shaped textual syntax that maps to NBT semantically. SNBT is what most Minecraft players actually type when they configure things; the binary NBT is what the game stores.

The most consequential ecosystem fact about NBT is that the modding community has produced extraordinary tooling on top of the format. The Minecraft Wiki documentation of NBT structures for every block, entity, and item type is one of the most detailed schema descriptions of any binary format I have seen, maintained collaboratively by people who have read the game's bytecode to figure out what each field means. This is what schemaless documentation looks like when the community has sufficient incentive: the format provides no help, and the community provides everything else.

The ecosystem gotchas. First, the modified-UTF-8 encoding is a common source of cross-language bugs; non-Java implementations that use standard UTF-8 will produce subtly wrong bytes for strings containing the null character. Second, the compression detection: NBT files in the wild may be uncompressed, gzipped, or zlibbed, and consumers must auto-detect (typically by examining the first byte). Third, list-of-TAG_End: empty lists must declare TAG_End as their element type, and consumers that don't expect this will mis-parse.

When to reach for it

NBT is the right choice when you are working in the Minecraft ecosystem: writing a server plugin, a world-generation tool, a modpack manager, or anything else that consumes Minecraft data. Outside that ecosystem, NBT is rarely the right choice; CBOR or MessagePack do the same job with more support and less specific weirdness.

It is a defensible choice for hobby projects where the appeal is using the format that runs Minecraft, but the practical reasons are thin.

When not to

NBT is the wrong choice for general-purpose typed binary serialization. The format's eccentricities (modified UTF-8, the ubiquitous compression assumption, the lack of a schema language, the bare-tree data model) mean that integrating NBT into a non-Minecraft system imports complexity for no real gain.

It is also the wrong choice for high-performance applications; the per-value name string overhead and the integer-width inflexibility produce encodings larger than modern alternatives without offsetting benefits.

Position on the seven axes

Schemaless. Self-describing. Row-oriented (NBT is fundamentally a tree, but a single record is a row-shaped compound). Parse rather than zero-copy. Runtime. Mostly deterministic in bytes but not by spec. No formal evolution mechanism.

NBT's stance on the axes is essentially MessagePack's, with the Java-specific UTF-8 quirk and the per-value name overhead being the principal differences. The format is what you get when you design a schemaless self-describing binary format in 2010 without studying the existing options too carefully — which is fine, because NBT serves Minecraft's needs adequately, and the broader ecosystem has not needed it to be more.

A note on the modding-community schema documentation

Worth a brief detour because the pattern is unusual. The Minecraft Wiki's documentation of NBT structures is, for many practical purposes, the schema for the format. The game itself does not ship a schema document; the modding community has reverse-engineered the format by reading the game's bytecode, comparing emitted bytes across versions, and writing up the results. The documentation covers every block entity, every item, every entity type, every chunk-format detail. Updates appear within hours of a new game version.

This is what schemaless plus a sufficiently motivated community looks like in steady state: the format provides nothing, the community provides everything, and the documentation is sometimes better than what a format with an in-band schema language could produce. The lesson is not that schemaless formats can substitute for schemas in general — they can't — but that for a specific single-vendor application, the operational discipline of "the community will document everything because they need to" can fill in the gaps. Outside Minecraft, this pattern is rare.

The other lesson worth absorbing is that schema documentation, when it exists outside the format itself, drifts from the format in subtle ways that bite at the edges. The Minecraft Wiki has had documented mismatches with the actual game over the years — fields that were renamed without the wiki being updated, types that were promoted, fields that were silently removed. Each mismatch is debugged, each is fixed, and the system continues to work. But the latency between a format change and a documentation update is real, and consumers who depend on the documentation are exposed to that latency. A format with an in-band schema avoids the latency by definition.

A note on the SNBT text format

SNBT, Stringified NBT, deserves a paragraph because it is the text projection of NBT that most Minecraft players type without realizing they are touching a binary format. SNBT looks like JSON with a small set of differences: types are indicated by suffixes (42L for a long, 1.5f for a float), arrays of typed integers are written [I; 1, 2, 3], and compound and list syntax mirrors JSON's. The Minecraft commands ecosystem uses SNBT extensively; when a player types /give @s diamond_sword{Damage:0,Enchantments: [{id:"sharpness",lvl:5}]}, the curly-brace block is SNBT, parsed into NBT, and attached to the item.

SNBT is not a separate format; it is a textual encoding of NBT values, equivalent in expressiveness. Tools that read NBT files often have an SNBT mode for human-friendly output. The fact that SNBT exists is part of why NBT survives despite its eccentricities: the format has a usable text projection, and the text projection is what most users actually interact with.

Epitaph

NBT is the format Markus Persson wrote in an afternoon to give Minecraft player inventories a place to live; thirteen tags, one documentation wiki, and a billion users who have never heard of binary serialization formats.

ROS msgs

ROS messages are the wire format of the Robot Operating System, which despite the name is not an operating system but a publish- subscribe middleware used to coordinate the components of a robot — sensors, actuators, planners, perception modules — that need to share data at high frequency. ROS has two major versions that use different wire formats: ROS 1 uses a custom binary format defined by the ROS team, and ROS 2 uses CDR (Common Data Representation) from the OMG/DDS world. Both are interesting, both are deployed widely in the robotics community, and both illustrate a particular set of design choices that matter when the consumer is a robot's wheels and the producer is a robot's camera at sixty frames per second.

Origin

ROS was created at Willow Garage in 2007 by Brian Gerkey, Morgan Quigley, and others, as the framework underlying the PR2 robot research platform. The design assumption was that a robot is a collection of cooperating processes, possibly distributed across multiple computers, exchanging data over a network at rates ranging from once-an-hour (battery state) to a thousand-times- a-second (control loops). The framework needed a wire format that would handle this rate range, support a polyglot ecosystem of researchers writing nodes in C++, Python, and other languages, and allow message types to be defined by the application without running a code-generation tool every time.

ROS 1's wire format was straightforward and homemade. A message type was defined in a .msg file, the ROS build tools generated language-specific bindings, and the wire format was a flat concatenation of fields in declared order — little-endian, fixed width for primitives, length-prefixed for variable-length fields. The format was not particularly novel; what mattered was the integration with the ROS framework's transport layer (TCPROS and UDPROS), the topic-based pub/sub, and the language bindings.

ROS 2 was a complete rewrite, started in 2014 and stabilizing around 2017 with the first ROS 2 distribution. The motivation for ROS 2 was largely about transport: the team wanted to move to DDS (Data Distribution Service), an OMG standard for real-time publish-subscribe with quality-of-service guarantees, multicast discovery, and broad industrial adoption (DDS is used in aerospace, defense, autonomous vehicles, and several other domains where the requirements outstrip what TCPROS could provide). Adopting DDS meant adopting CDR, the wire format DDS specifies. ROS 2 messages are still defined in .msg files, but they compile to OMG IDL definitions under the hood, and the bytes on the wire are CDR.

The two formats are not interchangeable, and a ROS 1 node cannot talk to a ROS 2 node directly. Bridges exist (the ros1_bridge package translates messages), but they are operational seams in the architecture rather than format-level interop.

The format on its own terms

ROS 1's wire format is the simpler of the two and is worth covering first.

A ROS 1 message is a sequence of fields, encoded in declaration order, with no headers, no padding, and no per-field metadata. Primitives are little-endian fixed-width: int32 is 4 bytes, int64 is 8 bytes, float32 is 4 bytes, and so on. Strings are encoded as a 4-byte length prefix (uint32) followed by the UTF-8 bytes with no padding. Variable-length arrays are encoded as a 4-byte length prefix followed by the elements; fixed-length arrays are encoded as the elements with no length prefix. Booleans are 1 byte (0 or 1).

There are no optional fields. ROS 1 messages have no concept of absence; every field declared in the schema is present in every encoded message. The convention for "no value" is sentinel encoding: empty strings, NaN floats, sentinel integers (often the value's wraparound or extreme representation). Schema authors who want optionality have to layer it on top, typically by emitting a separate "valid" field alongside the optional value.

The schema language is ROS's .msg syntax, which is similar to flat C-struct declarations:

uint64 id
string name
string email
int32 birth_year
string[] tags
bool active

Each line declares a field with a type and a name. Types include primitives (int8 through int64, uint8 through uint64, float32, float64, string, bool, time, duration), arrays (using [] or [N] for variable or fixed length), and references to other message types defined in .msg files (allowing nested messages). The build tools generate the language bindings from the schema.

ROS 1's distinctive evolution mechanism is the MD5 hash. Every message type's schema is hashed (the MD5 of a canonical text representation of the schema and its dependencies), and the hash is part of the topic registration. A subscriber's expected hash must match the publisher's emitted hash; if they differ, the subscription is rejected. This is a hard rejection, not a graceful degradation; ROS 1 will not let a subscriber receive messages from a publisher with a different schema.

The MD5 mechanism is what makes ROS 1 deployments operationally manageable at modest scale and operationally painful at larger scale. A schema change requires updating every node that uses the message type, simultaneously, before the system can run again with the new schema. ROS 1's tooling helps with this — the build system propagates schema changes through dependent packages — but the format itself imposes the hard requirement.

ROS 2's wire format is CDR, the OMG's Common Data Representation. CDR is very similar in spirit to XDR — fixed-width, byte-order- prefixed, with rules for padding to natural alignment — but with modern conveniences. CDR messages begin with a 4-byte representation header: the first byte is reserved (0), the second byte indicates byte order and encoding flavor (0x00 for big-endian plain CDR, 0x01 for little-endian plain CDR, 0x02 and 0x03 for parameter-list CDR with respective byte orders), and the next two bytes are options. The body follows, encoded according to the selected representation.

CDR's data types are similar to XDR's: primitives are fixed-width, strings are length-prefixed (with a NULL terminator counted in the length), sequences are length-prefixed arrays, structs are the concatenation of fields. Padding is added between fields to align each to its natural boundary; this is a slight elaboration on XDR's blanket 4-byte alignment, allowing 8-byte fields to be aligned to 8-byte boundaries.

CDR has an extension point, XCDR2 (Extended CDR version 2), which is the encoding most ROS 2 implementations actually use. XCDR2 supports optional fields (via a parameter-list representation), more flexible alignment, and a richer schema evolution story than plain CDR. The OMG IDL files that ROS 2's build tools generate from .msg files specify XCDR2 by default in modern ROS 2 distributions.

The schema for ROS 2 lives in two places: the .msg file (which the user writes) and the OMG IDL file (which the build tools generate). The IDL file is the canonical wire-format spec; the .msg file is the user-facing convenience.

Wire tour

Schema (Person.msg):

uint64 id
string name
string email
int32 birth_year
string[] tags
bool active

ROS 1 encoding (little-endian throughout, no header):

2a 00 00 00 00 00 00 00                     id: u64 LE = 42
0c 00 00 00                                  name length: 12
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace"
15 00 00 00                                  email length: 21
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                "ada@analytical.engine"
17 07 00 00                                  birth_year: i32 LE = 1815
02 00 00 00                                  tags count: 2
0d 00 00 00                                  tags[0] length: 13
6d 61 74 68 65 6d 61 74 69 63 69 61 6e       "mathematician"
0a 00 00 00                                  tags[1] length: 10
70 72 6f 67 72 61 6d 6d 65 72                "programmer"
01                                           active: 1 byte = 1 (true)

89 bytes. Essentially identical to Borsh in layout (which is no accident; both formats made the same conservative choices) with the exception of the boolean width (Borsh's bool is 1 byte; ROS 1's is also 1 byte, so they match exactly here).

The Person record's email cannot be marked as absent in ROS 1 without an out-of-band convention. The cleanest convention is to add a separate bool email_present field; the ad-hoc convention is to use an empty string and have consumers treat empty as absent. Neither is satisfying, and both are common in practice.

ROS 2 (XCDR2 little-endian) encoding:

00 01 00 00                                  CDR header: LE plain
2a 00 00 00 00 00 00 00                     id: u64 LE = 42
0d 00 00 00                                  name length: 13 (12 + null terminator)
41 64 61 20 4c 6f 76 65 6c 61 63 65 00       "Ada Lovelace\0"
00 00 00                                     padding to next 4-byte boundary
16 00 00 00                                  email length: 22 (21 + null)
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65 00             "ada@...\0"
00 00                                        padding to 4-byte boundary
17 07 00 00                                  birth_year: i32 LE = 1815
02 00 00 00                                  tags count: 2
0e 00 00 00                                  tags[0] length: 14
6d 61 74 68 65 6d 61 74 69 63 69 61 6e 00    "mathematician\0"
00 00                                        padding
0b 00 00 00                                  tags[1] length: 11
70 72 6f 67 72 61 6d 6d 65 72 00             "programmer\0"
00                                           padding to byte alignment
01                                           active: 1 byte = 1

About 100 bytes. The differences from ROS 1 are: the 4-byte header at the start; the null terminators on strings (counted in the length); the alignment padding between variable-length fields. CDR's alignment overhead is real but modest for a single-record payload.

Evolution and compatibility

ROS 1's evolution story is the MD5-hash gate. Schemas can change, but the change requires every node that uses the message type to be rebuilt and redeployed before the topic can be subscribed to again. There is no graceful degradation; mismatched schemas mean no communication.

ROS 2 inherits from XCDR2 a more flexible story. Optional fields can be added without breaking older subscribers; older subscribers see the new fields as absent. Type promotions are supported. Renaming and reordering are still breaking. The schema-hash mechanism in ROS 2 is the RIHS hash (ROS Interface Hash Standard), which is similar in role to ROS 1's MD5 but is computed differently and supports the optional-field semantics.

The deterministic-encoding question for ROS messages is straightforward for ROS 1 (the format is deterministic — fixed widths, no choices) and slightly more nuanced for CDR-based ROS 2 (deterministic given the byte order, but the byte order can be either; XCDR2's parameter-list mode allows optional fields whose absence is encoded by omission, which can produce non-deterministic ordering). Most ROS deployments do not care about byte-equality; when they do, they fix the byte order and prohibit XCDR2's parameter-list optional-field encoding.

Ecosystem reality

The ROS ecosystem is concentrated entirely in robotics. Open Robotics maintains both ROS 1 (now in maintenance mode) and ROS 2 (active). Major ROS 2 distributions ship every 1-2 years, with LTS releases for production deployments.

The DDS layer underneath ROS 2 has multiple vendor implementations: Eclipse Cyclone DDS (open source), eProsima Fast DDS (open source, default in many ROS 2 distributions), RTI Connext DDS (commercial, widely used in automotive and aerospace), and OpenDDS (open source, less common). Each implementation has its own characteristics — different transport options, different QoS defaults, different debugging tools — and ROS 2 deployments typically choose one based on their requirements.

The rosbag format (ROS's recording format for messages) is itself a binary format that wraps ROS messages with timestamps and topic metadata. Rosbag2 (the ROS 2 version) is significantly more flexible than the original; both have their own ecosystems of playback tools and analytics frameworks.

The ecosystem gotcha worth noting is the bridge problem. ROS 1 and ROS 2 cannot directly interoperate; the ros1_bridge package exists but is operationally finicky, especially for complex message types or high-rate topics. Many robotics organizations have mixed ROS 1 / ROS 2 deployments, and the bridge is a substantial operational burden. The migration story for ROS 1 → ROS 2 is, by this measure, the long tail of the ecosystem's operational attention.

A second gotcha is the quality-of-service configuration in ROS 2 / DDS. Different QoS settings (best-effort vs. reliable, volatile vs. transient-local, deadline guarantees) produce silently different behavior, and mismatched QoS between publisher and subscriber means no messages flow. The format itself is fine; the surrounding configuration is the operational problem.

When to reach for them

Use ROS messages if you are working in robotics. Otherwise, do not. The formats are coupled tightly to their respective frameworks (ROS 1's TCPROS, ROS 2's DDS), and using them outside those frameworks loses the integration that justifies the formats' existence.

If you specifically need CDR (the wire format under ROS 2) without ROS, use a DDS implementation directly. CDR is widely deployed outside robotics — in automotive infotainment systems, aerospace control systems, and the broader DDS ecosystem — and is a reasonable choice for any system where DDS's pub/sub semantics are useful.

When not to

ROS messages are the wrong choice for inter-service RPC, log payloads, configuration, or any application that is not built on a publish-subscribe middleware. The format choices are optimized for the middleware integration, not for general serialization.

Position on the seven axes

ROS 1: schema-required, not self-describing, row-oriented, parse, codegen, deterministic, evolution gated by MD5 hash. ROS 2 (CDR): similar, with XCDR2 adding optional-field support and a more nuanced determinism story.

Both formats illustrate a design point — robotics middleware formats — that has substantial deployment but limited general interest, and the right way to use either is through the surrounding ROS framework rather than as a standalone serialization format.

A note on rosbag and the message-replay use case

The persistence story for ROS messages deserves a paragraph because it is one of the operational corners of the ecosystem that engineers new to robotics often miss. rosbag (in ROS 1) and rosbag2 (in ROS 2) are file formats and tools for recording the messages that flow over a system's topics, with timestamps, so that the recording can be replayed later for debugging or analysis. The recordings are essential for robotics development: a lab robot's behavior in an unusual situation is captured as a rosbag, sent to a developer, and replayed in simulation to reproduce the bug.

The rosbag formats are themselves binary serialization formats — wrappers around the underlying ROS messages with framing, indexing, and metadata. rosbag1 has a custom format with its own quirks; rosbag2 supports multiple storage backends (the default is sqlite3, with mcap as a popular alternative). MCAP, the format underlying many rosbag2 deployments, is itself an interesting binary format: it stores any kind of timestamped messages, not just ROS, and has gained adoption outside the ROS ecosystem in robotics-adjacent applications (autonomous vehicle data logging, drone telemetry, sensor fusion pipelines).

This is the second time in this book a format has spawned a recording format of its own — the first was Avro Object Container Files. The pattern is worth noting: serialization formats that are used for high-rate streaming workloads tend to grow companion formats for archival and replay, and the companion formats often outlive the original. MCAP is now specified as a generic timestamped-message container that ROS happens to use; rosbag is now an interface, not a format.

Epitaph

ROS messages are robotics' default wire format; ROS 1's homemade binary serves nodes-talking-to-nodes adequately, ROS 2's CDR piggybacks on DDS's industrial pedigree, and both are bought together with the framework rather than chosen on their own merits.

Bond

Bond is the format Microsoft built when it had concluded — like Google before it and Facebook contemporaneously — that it needed a typed schema-first wire format for inter-service traffic. Bond arrived after Protobuf and Thrift were already established, and its design choices reflect that arrival. Bond's authors had the benefit of seeing what worked and what did not in its predecessors and made small but considered refinements: richer schema expressiveness, multiple wire encodings selectable per use case, a "bonded" type for pass-through serialization that lets a service forward a message without knowing its full schema. The format has been used at substantial scale inside Microsoft for over a decade. The format's gradual displacement by Protobuf inside Microsoft, and its near-invisibility outside, are the reasons it is in this book rather than fighting Protobuf for the chapter on schema-first wire formats.

Origin

Bond was developed at Microsoft and open-sourced in 2015. Its internal use predated the public release by several years; Bing, the search engine, used Bond for inter-service traffic, as did the Cosmos data warehouse (the internal Microsoft system, not the Azure offering of similar name) and several other large properties. The team behind Bond was led by Adam Sapek and included engineers with substantial experience in distributed systems serialization.

The design goals of Bond were a slight reframing of Protobuf's: support typed schemas with formal evolution rules, support multiple wire encodings (because different use cases want different tradeoffs between size and parse speed), support forward-compatible "bonded" fields that let services route or forward messages without parsing them fully, and support a richer type system than Protobuf's, including non-nullable references, nullable wrappers, and various container types.

The format has been stable since the public release. Microsoft's internal use continued through 2024 in the systems that adopted it early, but new services inside Microsoft increasingly chose Protobuf, and the Bond GitHub repository's commit history shows the slow tempo of a project transitioning from active development to maintenance. The reference implementations (C++, C#, Python, Java) are still maintained, but the broader Microsoft engineering center of gravity has moved to gRPC and Protobuf.

Outside Microsoft, Bond has had minimal adoption. A handful of external projects use it, mostly in cases where a Bond-using Microsoft codebase was open-sourced and the dependency came along. The format is not on the radar of any organization choosing a serialization format from scratch in 2026.

The format on its own terms

Bond's schema language is similar to Protobuf's and Thrift's, with some additions:

namespace example;

struct Person {
    0: required uint64 id;
    1: required string name;
    2: nullable<string> email;
    3: required int32 birth_year;
    4: required vector<string> tags;
    5: required bool active;
}

Field ordinals (the 0:, 1:, etc.) play the same wire-level role as Protobuf's field numbers and Thrift's field IDs: they are the stable identifier for the field across schema versions. Field declarations have presence modifiers — required, optional, or implicit (which is similar to Protobuf 3's "default" semantics) — and can use Bond's richer type modifiers: nullable<T> for an explicitly nullable wrapper, bonded<T> for the pass-through type, vector<T> and list<T> for ordered collections, set<T> for unordered, map<K, V> for keyed lookups.

Bond supports a handful of types beyond what Protobuf and Thrift offer in their core specs: decimal (a fixed-point decimal type), blob (binary blobs as a distinct type from byte vectors), and bonded<T> (described below).

The bonded<T> type is Bond's most distinctive contribution. A bonded<T> field carries a serialized T as opaque bytes, with just enough metadata to identify its type and version. A service can read the bonded field's metadata, decide whether to fully parse it, forward it as bytes to another service, or store it without ever decoding it. This is useful for routing services that need to look at a few fields of a message and pass the rest through; with a bonded field, the routing service does not need to depend on the schema for the inner type. Protobuf has something similar in its Any type, but bonded<T> is more explicit about what the inner type is and is generally easier to work with.

Bond defines several wire encodings:

Compact Binary is the default: tag-and-length-prefixed, varint-encoded integers, similar in shape to Thrift Compact and Protobuf. This is the encoding used most often.

Fast Binary is a faster-to-encode-and-decode variant that trades some size for performance: explicit width fields, less varint encoding. The "fast" name is relative; Compact Binary on modern hardware is rarely the bottleneck.

Simple Binary is a minimal encoding that omits some metadata and is suitable for fixed-schema scenarios where evolution is not required.

Simple JSON is a JSON-shaped encoding for debugging and interop with non-Bond systems. Bond also has an XML encoding for the same role.

The encoding is selected at runtime by the service; a single schema can be deserialized from any of the encodings, and a service can mix encodings on different topics if appropriate.

Wire tour

Encoding our Person record with Compact Binary:

3a 00                                        struct header (compact binary)
   c0 2a                                     field 0 (id), uint64, value 42
   c1 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65 field 1 (name), string len 12
   c2 15 61 64 61 40 61 6e 61 6c 79 74 69 63
        61 6c 2e 65 6e 67 69 6e 65            field 2 (email), nullable string,
                                              has-value flag plus string len 21
   c3 ae 1c                                  field 3 (birth_year), int32, zigzag(1815)
   c4 02                                       field 4 (tags), vector<string>, count 2
        0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e
        0a 70 72 6f 67 72 61 6d 6d 65 72
   c5 01                                     field 5 (active), bool
00                                           struct terminator

Total approximately 75 bytes. Comparable to Thrift Compact (which made similar tag-encoding choices) and Avro for this payload. The breakdown is comparable to Thrift Compact's: field tags take 1-2 bytes, varint integers take their natural width, length-prefixed strings dominate the size.

The exact bytes here are approximate; Bond's Compact Binary has specific framing details (the leading struct header, the encoding of the field tag with type information, the terminator) that differ slightly from Thrift Compact, but the sizes work out within a few bytes of each other. Bond's field-tag encoding packs the field ordinal and the type information into a single byte for small ordinals and types, with a multi-byte fallback for larger values.

If email were the null branch of nullable<string>, the encoding would emit a single byte indicating absence (0x00 in the nullable wrapper), saving the 22 bytes of the string. Field 2's tag would still be present.

Evolution and compatibility

Bond's evolution rules are similar to Protobuf's. Adding a field with a new ordinal is forward and backward compatible. Removing a field requires the ordinal to be retired (Bond does not have reserved syntactically, but the convention is to never reuse ordinals). Changing a field's type is mostly unsafe, with similar exceptions to Protobuf's (numeric promotions, nullable to non-nullable when the value is always present). Renaming is source-only; ordinals are wire-level identity.

Bond's required keyword has the same operational hazard Thrift's has: removing a required field is a coordinated deployment, and required fields are conventionally avoided unless absence is genuinely a fatal error.

Bond's bonded<T> type adds an interesting evolution wrinkle. A service that holds bonded data does not need the schema for the inner type; it can forward the bonded bytes to another service that does have the schema. This means a schema change to the inner type only affects services that fully parse it; routing or forwarding services are unaffected. This is a real operational benefit at scale, and it is one of the reasons Bing's architecture used Bond for as long as it did.

The deterministic-encoding question for Bond is the same as for Protobuf: not specified at the wire level, achievable with care. Bond does not have a canonical encoding subset. Map ordering, varint widths, and a few other choices are encoder-specific. Applications that need byte-equality canonicalize separately.

Ecosystem reality

Bond's ecosystem is small. The reference implementations are in C++, C#, Python, and Java; all four are mature. The Compact Binary encoding is what production deployments use; the other encodings are debugging or interop conveniences.

Internal Microsoft systems that use Bond include Bing's search infrastructure, the Cosmos data warehouse, and several Azure services that predate the gRPC migration. Outside Microsoft, Bond appears in the open source projects Microsoft has released that came with Bond dependencies — a few research codebases, some distributed systems experiments — and in a handful of independent projects that adopted it deliberately. None of these are large.

The ecosystem gotcha worth noting is the gradual displacement. Microsoft's recommendation for new services has shifted to Protobuf and gRPC, and the Bond ecosystem's tooling has not kept up with Protobuf's. Buf-equivalent tools for Bond do not exist; the breaking-change detection that makes large-scale Protobuf deployments manageable is mostly missing for Bond. New services choosing Bond in 2026 inherit this gap and have to build the operational discipline themselves.

A second gotcha is the multi-encoding architecture. Bond's flexibility in supporting Compact Binary, Fast Binary, Simple Binary, and JSON variants is theoretically useful but operationally cumbersome: the encoding has to be configured per service, the choice of encoding is part of the deployment contract, and tooling that snoops on Bond traffic has to handle all of them. This is the same operational cost Thrift's multi-protocol design imposes, and it is one of the reasons Protobuf's single wire format is, on balance, easier to live with.

When to reach for it

Bond is the right choice when interoperating with an existing Bond-using Microsoft codebase, where the dependency exists and the cost of switching is high. It is a defensible choice when the bonded<T> pass-through type is a hard requirement and Protobuf's Any is unsatisfying.

For new general-purpose use cases, Bond is rarely the right choice. Protobuf is better-supported, better-tooled, and better- understood; Avro is a more thoughtful answer to a similar problem; the case for choosing Bond in 2026 has to be specific.

When not to

Bond is the wrong choice for new microservices in greenfield environments outside the Microsoft ecosystem. It is the wrong choice when ecosystem maturity matters; the breaking-change detection, language-binding-of-the-month, and broad community support that Protobuf has are all stronger than Bond's equivalents.

It is also the wrong choice when the multi-encoding flexibility is unwelcome; new deployments rarely benefit from supporting more than one wire encoding.

Position on the seven axes

Schema-required. Not self-describing. Row-oriented. Parse rather than zero-copy. Codegen-first via the Bond compiler, with runtime support via reflection-style APIs. Non-deterministic by spec. Evolution by tagged fields with an explicit bonded<T> mechanism for pass-through.

The cell Bond occupies — schema-required, tagged-field, multi- encoding, with a forward-compatibility-friendly pass-through type — is in the same neighborhood as Protobuf and Thrift, with small distinctive features. The features are real but did not prove decisive enough to overcome the broader ecosystem gravity around Protobuf.

A note on the bonded pattern and the Any / oneof comparison

The bonded<T> type deserves more attention than the chapter gave it, because it represents a design choice that Protobuf has struggled with and that Bond got right early.

The problem bonded<T> solves: a service in a routing or forwarding role often handles messages whose inner payload is addressed to a different service. The routing service needs to look at a few fields (the destination, the priority, perhaps a correlation ID) but does not need — and arguably should not have — full schema knowledge of the inner payload. If it had full schema knowledge, it would have a build-time dependency on every inner-message schema in the system, which is operationally expensive and architecturally bad.

Protobuf's answer to this has been the Any type, which carries a type URL and an opaque bytes buffer. The receiver looks up the type URL and decides whether to unpack. The Any type works, but it has friction: the type URL convention is by string match rather than by integer ID, the unpacking machinery requires the schema to be loaded into the runtime separately, and the identity-of-the-inner-type relies on conventions about URL resolvers that are not always uniform.

bonded<T> declares the inner type explicitly in the schema. A field of type bonded<Person> is known to be a Person, but the serialized bytes are kept opaque until the field is actually accessed. The routing service can read the bonded<Person> field's metadata, look at the version, decide whether to forward or fully parse, and act accordingly. The schema knows the type; the runtime decides when to materialize it.

This is a small distinction but a meaningful one. Any says "the inner type is whatever the type URL says it is, and we'll figure it out at runtime." bonded<T> says "the inner type is T, and we'll choose at runtime whether to actually deserialize it." The latter is more typed and more amenable to static analysis. Bond got this right; Protobuf has not, and the result is that forwarding patterns in Protobuf are operationally harder than they need to be.

A note on Microsoft's path away from Bond

Several articles and conference talks from Microsoft engineers over the past few years have addressed the gradual move away from Bond toward Protobuf. The reasons cited are operational rather than technical: Protobuf has the better breaking-change detector (Buf), better cross-team tooling, better language support outside the languages Bond invested in, better integration with modern observability stacks, better gRPC integration. The technical merits of Bond — bonded<T>, the richer type system, the multi-encoding flexibility — have not been compelling enough to overcome those operational gaps.

The lesson worth absorbing is that in long-running format choices, operational maturity outranks technical merit. Bond was technically a slight refinement of Protobuf; Protobuf became operationally a substantial improvement on Bond. The latter mattered more.

Epitaph

Bond is Microsoft's contribution to the schema-first wire format genre; technically thoughtful, operationally outclassed by Protobuf, gradually displaced inside its own birthplace.

Postcard and bincode

bincode and postcard are the two binary formats most Rust programmers encounter when they wire up serde for the first time. They are both serde-compatible, both Rust-first, both small in spec and small in implementation. They differ in target — bincode for general-purpose serde-binary use, postcard for the embedded and no_std contexts that bincode does not address — and in encoding philosophy. They share a significant property: neither is a serialization format in the way Protobuf or Avro are. They are both layouts of serde's data model, which means the schema is implicit in the Rust types being serialized, and cross-language use is barely a thing.

Origin

bincode was created by Ty Overby in 2014, near the dawn of the Rust serde ecosystem. The original goal was simple: provide a binary format that any serde-Serialize type could be encoded into and any serde-Deserialize type could be decoded from. bincode's first wire format was minimal: fixed-width little-endian integers, length- prefixed strings, no metadata. Its original design predated many of serde's more sophisticated features and produced a format that was fast, simple, and not particularly small.

bincode 2.0 was released in 2023 with a substantially revised wire format. The default behavior in 2.0 supports varint encoding for integers, more compact representations for option types, and a configuration system that lets users opt in or out of various encoding choices. The 2.0 format is wire-incompatible with 1.x, which was a deliberate break to enable the smaller default encoding; 1.x continues to be maintained for projects that have on-disk data in the older format.

postcard was created by James Munns in 2019, with the explicit goal of providing a serde-compatible binary format that worked in no_std and embedded contexts where bincode was not a fit. postcard's design constraints were stricter: no allocator dependency, predictable wire size, varint encoding for integers (because embedded systems often handle small values), and a deterministic encoding suitable for embedded firmware checksums and signatures. The format spec is short, the canonical implementation is in pure Rust, and the format has been stable since version 1.0.

Both formats are concentrated in the Rust ecosystem and have minimal cross-language adoption. There are unofficial postcard implementations in C and Python; bincode-compatible reimplementations exist for Go and JavaScript but are not widely used. The lack of cross-language adoption is not a flaw of either format; it is a consequence of their being designed to encode serde's data model, which does not exist outside Rust.

The format on its own terms

bincode 2.0's default configuration produces wire bytes that are structurally similar to postcard's. Integers are varint-encoded, strings are varint-length-prefixed, options are tagged with a discriminant byte, sequences are length-prefixed. The differences are in the details: bincode 2.0 uses LEB128-shaped varints, postcard uses a slightly different varint that aligns better with embedded use cases; bincode 2.0's option encoding takes 1 byte for None and 1+payload for Some; postcard's is the same.

bincode 1.x's default configuration is different. Integers are fixed-width little-endian (u64 always 8 bytes, i32 always 4 bytes), lengths are u64 (8 bytes per length prefix), and the result is dramatically larger than the 2.0 encoding for the same data. Many projects on bincode 1.x configured it to use "VarintEncoding" via the configuration system, which approximated the 2.0 default; many did not, and the larger format is what their on-disk data uses.

postcard's encoding is uniformly varint-based. The varint encoding is LEB128-style: each byte contributes 7 bits to the encoded value, with the high bit indicating whether more bytes follow. Signed integers are zigzag-mapped to unsigned before varint encoding. Lengths use the same varint. There is no fixed-width mode; the format is varint-throughout.

Both formats handle Rust's algebraic data types (enums) by encoding the variant discriminant as a varint integer followed by the variant's data. This is straightforward for non-recursive enums and works fine for recursive ones up to the limits of the serializer's stack depth.

The schema for both formats is the Rust source code. There is no separate IDL, no compiled schema artifact, no schema registry. The encoder is generated by serde's procedural macros from the Rust type's Serialize and Deserialize derive; the decoder similarly. Cross-version compatibility, when it matters, requires the consumer's type definition to match the producer's exactly, or to be wire-compatible according to the format's narrow rules.

Wire tour

Schema (Rust source, identical for both):

#![allow(unused)]
fn main() {
#[derive(serde::Serialize, serde::Deserialize)]
struct Person {
    id: u64,
    name: String,
    email: Option<String>,
    birth_year: i32,
    tags: Vec<String>,
    active: bool,
}
}

postcard encoding:

2a                                           id: varint(42) = 1 byte
0c                                           name length: varint(12)
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace"
01                                           email Option discriminant: Some
15                                           email length: varint(21)
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                "ada@analytical.engine"
ae 1c                                        birth_year: zigzag-varint(1815) = 2 bytes
02                                           tags count: varint(2)
0d                                           tags[0] length: 13
6d 61 74 68 65 6d 61 74 69 63 69 61 6e       "mathematician"
0a                                           tags[1] length: 10
70 72 6f 67 72 61 6d 6d 65 72                "programmer"
01                                           active: bool true = 1 byte

66 bytes. Among the smallest encodings in this book — narrowly beating Avro's 67 bytes and Protobuf's 71. The win comes from the universal varint encoding (lengths and integers both shrink to their natural width) plus the absence of any per-record framing overhead.

bincode 1.x with default (fixed-width) encoding:

2a 00 00 00 00 00 00 00                     id: u64 LE = 8 bytes
0c 00 00 00 00 00 00 00                     name length: u64 LE = 8 bytes
41 64 61 20 4c 6f 76 65 6c 61 63 65          "Ada Lovelace"
01                                           email Some discriminant
15 00 00 00 00 00 00 00                     email length: u64 LE = 8 bytes
61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                "ada@analytical.engine"
17 07 00 00                                  birth_year: i32 LE = 4 bytes
02 00 00 00 00 00 00 00                     tags count: u64 LE = 8 bytes
0d 00 00 00 00 00 00 00                     tags[0] length: u64 = 8 bytes
6d 61 74 68 65 6d 61 74 69 63 69 61 6e       "mathematician"
0a 00 00 00 00 00 00 00                     tags[1] length: u64 = 8 bytes
70 72 6f 67 72 61 6d 6d 65 72                "programmer"
01                                           active: bool = 1 byte

110 bytes. The fixed-width 8-byte length prefixes eat 32 bytes of overhead that varint encoding eliminates entirely. bincode 1.x deployments that have not configured varint encoding pay this cost on every record.

bincode 2.0 with default settings is wire-similar to postcard, within a few bytes — both use varint encoding, both prefix strings the same way, both encode options identically. The two formats are not byte-identical (the varint encodings differ slightly, and the option discriminant byte may differ in specific edge cases) but the size difference for typical payloads is small.

If email were None, both formats would replace the 01 Some discriminant with 00 and skip the value. postcard saves 23 bytes; bincode 1.x saves 31 bytes (the value plus the 8-byte length); bincode 2.0 saves 23 bytes.

Evolution and compatibility

Neither format has a meaningful schema-evolution story. Both are positional encodings of serde's data model: change the Rust type, change the wire format. Adding a field, removing a field, reordering fields, or changing a field's type all break compatibility with old data.

The conventional pattern for evolving postcard or bincode data is explicit versioning: prepend a version discriminant to every record, dispatch on the version to pick the right Rust type, migrate explicitly. This works but requires application discipline; the formats provide no help.

Some Rust ecosystems handle this with serde's #[serde(default)] attribute, which lets a deserializer accept missing fields and substitute a default value. This is a per-field opt-in and works for forward compatibility (new code reading old data, where the old data lacks new fields). Backward compatibility (old code reading new data) is a different problem and requires either dropping unknown fields explicitly (which is not the default) or encoding the new fields in a way old code can skip.

Both formats are deterministic by construction. Given a value and a configured encoder, exactly one byte sequence is produced. There are no choices in the wire format, no padding, no map-key- ordering issues (Rust's BTreeMap has stable iteration; HashMap does not, and using HashMap with serde produces non-deterministic encoding unless the keys are sorted before serialization).

Ecosystem reality

bincode's user base is broad within the Rust ecosystem. It is the default binary format for many Rust projects: cache files, inter-process communication on a single machine, on-disk record formats. Many crates that need a binary format default to bincode because it is the path of least resistance. The 1.x → 2.0 migration has fragmented this slightly; older crates and older data are 1.x, newer crates default to 2.0, and the two are wire-incompatible. Migration is straightforward but requires attention.

postcard's user base is concentrated in embedded Rust: drone firmware, microcontroller projects, IoT devices that run Rust through frameworks like Embassy, RTIC, and Tock. The no_std constraint matters in those contexts and rules out bincode (which historically had std dependencies — bincode 2.0 has reduced this but not fully). postcard's varint encoding, deterministic behavior, and small implementation make it a reasonable fit for the firmware-checksum and over-the-wire-control use cases that embedded systems care about.

Both formats are used outside their primary niches, but the choice in those cases is usually contingent on the project's dependencies. A general-purpose Rust service might choose either, and the choice rarely has consequences large enough to warrant a formal evaluation.

The ecosystem gotchas worth noting. First, the bincode configuration system: bincode (especially 1.x) has many configuration knobs, and different defaults produce wire- incompatible output. Code that round-trips through a default bincode encoder and a varint-configured bincode decoder will fail. The recommendation is to pick a configuration once, document it, and stick to it.

Second, the no_std story: bincode 2.0's no_std mode works but has subtle differences from the std mode, especially around how allocations are handled. Embedded projects should test carefully. postcard's no_std story is the format's design center and is correspondingly more polished.

Third, the Rust version dependency: both formats depend on serde, and serde's wire-impact-changes are rare but real. A serde update that affects the discriminant encoding for an enum type can change the bytes; the change is typically opt-in but can surprise long-lived data deployments.

When to reach for them

Use postcard when working in embedded Rust or any context where no_std support and varint encoding matter.

Use bincode for general-purpose Rust binary serialization where serde compatibility is the priority and cross-language use is not. New deployments should use bincode 2.0 with the default configuration.

Either is a reasonable choice for inter-process communication between Rust processes on the same machine, on-disk caches that the same Rust binary will read, and similar workloads.

When not to

Neither is the right choice when cross-language use is required; both formats are Rust-only in spirit and pretty much in implementation.

Neither is the right choice when schema evolution is a hard requirement; both formats are positional and provide no in-format mechanism for handling skew.

Neither is the right choice when long-term archival is the goal; the Rust-type-as-schema dependency makes archival fragile.

Position on the seven axes

Schema-required (the schema is the Rust type). Not self-describing. Row-oriented. Parse rather than zero-copy. Codegen via serde's proc macros. Deterministic by construction. No formal evolution mechanism.

Both formats sit in the same cell — Rust-first serde-binary encodings — with the difference being the embedded vs. general- purpose niche. Neither is a serious choice outside Rust.

A note on the broader serde-binary picture

Several other formats live in the same Rust serde-binary niche and deserve brief mention because they show how the niche has been explored differently.

ron (Rusty Object Notation) is a textual serde format that shares the data model but is not binary. It is included here only to clarify that ron is the "human-readable serde format" companion to bincode and postcard, not a binary alternative.

MessagePack via rmp-serde is the cross-language alternative. Rust projects that need to serialize through serde and also interop with non-Rust consumers typically use rmp-serde rather than bincode or postcard. The wire format is MessagePack (covered in chapter 4), not a Rust-specific encoding.

CBOR via ciborium and serde_cbor is the standardized alternative. Same logic as MessagePack: cross-language interop through a non-Rust-specific format, accessed through serde for Rust ergonomics.

speedy is a Rust binary format that prioritizes decode speed, with a custom (non-serde) derive macro. It produces bytes that are smaller than bincode 1.x and faster to decode than either, but with a smaller user base.

abomonation is a Rust zero-copy format predating rkyv. It is strictly unsafe in the Rust safety sense — reading abomonation data requires the reader to trust the bytes — and has been largely superseded by rkyv (chapter 15) for the zero-copy use case.

The shape of the niche is therefore: bincode and postcard for Rust-only serde-binary, rmp-serde and ciborium for cross-language serde-binary, rkyv for zero-copy, and a long tail of specialized options. The choice is usually forced by one of the above constraints, and the format is then determined.

A note on schema-as-source-code

bincode and postcard share with rkyv the property that the schema is the source code. This is not unique to Rust — many language-specific binary formats do this — but Rust's strong type system and explicit derive macros make the source-as-schema approach particularly clean. The cost is that the format is unusable outside the language; the benefit is that schema changes are simply Rust changes, which the type system tracks and which the developer is going to make anyway.

The trade is sharper for the Rust-binary formats than for any other category in this book because Rust's type system is so strong. A type change in a Rust struct produces compile errors in every consumer in the same workspace; the formats inherit this property and let it serve the role of schema-evolution checking. Outside Rust, a similar pattern is achievable but less clean (Java's serialization, Python's pickle, Go's gob — all share aspects of source-as-schema, all are clunkier than the Rust equivalents).

The lesson is not that source-as-schema is universally good. The lesson is that it is particularly good in Rust, because the language's existing type machinery is strong enough to substitute for what other formats need a separate IDL to provide. The right question to ask before choosing a Rust-only binary format is whether the no-cross-language-use is acceptable for the workload; if it is, the source-as-schema approach is materially nicer than the alternatives.

Epitaph

postcard and bincode are Rust-flavored serde-binary formats: small, fast, deterministic, and shaped by the language's type system to the point of being effectively unportable.

Hessian

Hessian is the format that runs Dubbo, and through Dubbo it runs a substantial fraction of the inter-service traffic at the largest Chinese internet companies. Outside that context, Hessian is a historical curiosity from the Java RPC world of the early 2000s, preserved by inertia and by the specific operational characteristics of the Java enterprise ecosystem. Reading Hessian is the right way to understand a particular flavor of binary RPC that the Anglosphere mostly skipped, and a particular set of design choices that made sense in their original context but have not aged well.

Origin

Hessian was created by Scott Ferguson at Caucho Technology around 2000, as part of the Resin Java application server. Caucho was a significant player in the early-2000s Java application server space — Resin was a competitor to BEA WebLogic and IBM WebSphere — and Hessian was Caucho's answer to the question of how Java RPC should work. The competition at the time was Java RMI (Java's language-specific RPC protocol with a complex serialization format), SOAP-over-HTTP (the XML-based protocol with verbose envelopes), and the early CORBA implementations. Hessian's pitch was that Java RPC should be smaller than SOAP, simpler than RMI, and cross-language by design.

Hessian 1 was specified and released; Hessian 2 followed with significant wire-format improvements (more compact encodings, better support for object references and cycles, a more orthogonal type system). Hessian 2 is the version most production systems use. The format spec is a few pages of plain text, the implementations are small, and the cross-language story is reasonable in the languages that have implementations (Java, Python, JavaScript, C++, PHP, .NET, Ruby).

The format would have remained a footnote in Java RPC history if not for Apache Dubbo. Dubbo is a Java RPC framework originally developed at Alibaba in 2008, open-sourced in 2011, and donated to Apache in 2018. Dubbo uses Hessian as its default serialization format for inter-service messages, and Dubbo's adoption inside Alibaba and the broader Chinese internet ecosystem (JD.com, Pinduoduo, ByteDance, and many others) is enormous. Estimates put the daily message volume of Dubbo deployments at hundreds of billions per day. Most of that traffic is Hessian-encoded.

The non-Dubbo deployments of Hessian are smaller and aging. Some older Spring Java services use Hessian via Spring's HTTP invoker; some financial systems in the JVM ecosystem use it for legacy reasons; Caucho's own products continue to support it. None of these are growing.

The format on its own terms

Hessian 2's wire format is similar in shape to MessagePack and CBOR but with a Java-specific flavor: the type system reflects Java's, the encoding handles object references explicitly (so that a Java object graph with shared references is preserved on the wire), and field names are encoded as strings rather than integer tags.

Every value begins with a tag byte. The tag byte's bits encode both the type and, for short values, the value or its length. Hessian's tag space is compact: small integers (-16 through 47) encode as a single byte where the byte is the value with a specific offset; longer integers escalate through 1-byte, 2-byte, 3-byte, and 4-byte forms. Strings have a similar tiered encoding: short strings (up to 31 characters) take a single tag byte plus the UTF-8 (Hessian uses modified UTF-8, like NBT, which is a Java-ism); longer strings have a length prefix.

Maps and lists use a fixed-size encoding when the size is known at encode time and a stream encoding when it is not. The stream encoding is similar to CBOR's indefinite-length form: a start marker, the elements, and an end marker.

Object references are Hessian's distinctive feature. When an object is serialized, Hessian assigns it an integer reference and emits its fields. If the same object appears later in the stream — directly, or as a member of another object — Hessian emits a reference rather than re-serializing. This preserves shared references and handles cyclic graphs (a tree with a back-pointer, say) without infinite recursion.

The class definitions in Hessian 2 are emitted at the start of the stream. A class definition gives the class name and the ordered list of field names; subsequent objects of that class are emitted as a reference to the class definition followed by the field values in order. This is a hybrid between schemaless self-describing (the class definition is in the stream, so the stream is self-contained) and schema-required (the field order is positional within the class definition). The result is denser than MessagePack for streams with many objects of the same class — the field names amortize over instances — and more self-describing than Protobuf since the class definition is in the bytes.

The data model is broadly Java-shaped: integers (with a 64-bit long type), doubles, strings (Unicode), booleans, null, lists, maps, and objects (with class definitions). There are also specific types for dates (encoded as 64-bit milliseconds since epoch), binary blobs, and a mechanism for arbitrary user-defined serializable types via the bonded-value mechanism.

Wire tour

Encoding our Person record. The class definition is emitted first, followed by an instance:

43                                           class definition tag
06 50 65 72 73 6f 6e                         class name length 6, "Person"
96                                           field count: 6 (compact)
02 69 64                                       field 0 name: "id"
04 6e 61 6d 65                                 field 1 name: "name"
05 65 6d 61 69 6c                              field 2 name: "email"
0a 62 69 72 74 68 5f 79 65 61 72              field 3 name: "birth_year"
04 74 61 67 73                                 field 4 name: "tags"
06 61 63 74 69 76 65                           field 5 name: "active"

60                                           object reference (class index 0, no fields-as-mods)
ba                                             id: long = 42 (compact range -8..47 → byte 0xba)
0c 41 64 61 20 4c 6f 76 65 6c 61 63 65         name: short string len 12
15 61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                  email: short string len 21
59 17 07                                       birth_year: int 1815 (3-byte form)
58 96                                          tags: typed list with length 2
   0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician"
   0a 70 72 6f 67 72 61 6d 6d 65 72            "programmer"
54                                           active: boolean true

The class-definition portion is approximately 50 bytes, dominated by the field name strings. The instance portion is approximately 75 bytes. Total for a single Person: about 125 bytes, somewhat larger than MessagePack and substantially larger than Protobuf.

The class definition pays for itself when many instances of the same class are emitted in the same stream. For 1,000 Person records in a single Hessian stream, the class definition is emitted once (50 bytes) and the instances each cost about 75 bytes (no per-instance class definition, no per-instance field names). The amortized per-record cost approaches Protobuf's, though Protobuf would still be smaller in absolute terms for the same payload.

The object-reference encoding shines when the data has shared structure. If two Person records share a Department reference, Hessian emits the Department once and references it twice, saving the bytes of the duplicate. MessagePack and Protobuf have no analogous mechanism; they would emit the Department twice. For data with high sharing, Hessian's wire-size advantage can be substantial.

If email were absent, Hessian would emit a null marker (0x4e) in the email slot. The class definition still includes the field; the instance just has null where the value would have been. This is the same approach Avro takes with ["null", "string"] unions.

Evolution and compatibility

Hessian's evolution story is mediated by the class definition emitted in the stream. Adding a new field requires updating the class definition; readers that have a different class definition locally must reconcile somehow.

The conventional reconciliation is by name: the class definition gives field names, and the consumer matches them to its local type's fields. Fields in the stream that the consumer does not know about are skipped; fields in the consumer's local type that are not in the stream are filled with default values (null for references, zero for primitives). This is a graceful-evolution model and has been used effectively in long-running Dubbo deployments.

Renaming a field is a breaking change unless the consumer side implements field-name aliasing manually. Reordering fields in the class definition is permitted (the stream's class definition is authoritative), but this can cause subtle bugs if a consumer caches the class definition incorrectly. Type changes follow similar rules to Protobuf's: numeric promotions are usually safe, type changes within different categories (int → string) are not.

The deterministic-encoding question for Hessian is mostly unresolved. The format does not specify a canonical encoding subset. Object references can be assigned in any order; map ordering is unspecified; integer encodings have multiple valid forms. Two Hessian encoders producing the same logical value can produce different bytes. Applications that need byte-equality have to canonicalize separately, as with Protobuf and Thrift.

Ecosystem reality

The Hessian ecosystem is bimodal. Inside the Dubbo-using ecosystem (which is to say, inside the Chinese internet at scale), Hessian is canonical, the implementations are mature, and the format has been hardened by a decade and a half of high-volume production. The Java implementation in Dubbo is the canonical one; bindings exist in many languages but are typically thinner.

Outside Dubbo, Hessian appears in older Java systems (legacy Spring HTTP invokers, Caucho Resin deployments, older financial systems), and in a small number of cross-language tools that adopted it because Hessian was the path of least resistance for talking to a Dubbo service.

The cross-language story varies by language. Python's python-hessian and pyhessian libraries exist and are maintained. JavaScript's implementations are more or less working. C++ and .NET implementations exist but are second-class. Ruby's implementation is limited.

The most consequential ecosystem fact about Hessian is that it is closely associated with Dubbo, and Dubbo's choice to support multiple serialization formats has created options. Dubbo can use Hessian, Protobuf, JSON, or several others as its serialization layer; the choice is per-service, and there has been a slow migration in newer Dubbo deployments toward Protobuf for the ecosystem reasons that Protobuf wins in most other places. New Dubbo services in 2026 often use Protobuf rather than Hessian, even though Hessian remains the default.

The ecosystem gotcha worth noting is the modified-UTF-8 issue (same as NBT — the Java-isms produce bytes that don't round-trip through standard UTF-8 implementations) and the class definition caching (consumers that cache class definitions across connections can get out of sync if a producer evolves the class). Both are well-known in the Dubbo community and have documented workarounds.

When to reach for it

Hessian is the right choice when interoperating with Dubbo or with an existing Hessian-using Java system. It is rarely the right choice for new systems outside that context.

It is a defensible choice when the object-reference preservation is a hard requirement and the alternative would be to manually deduplicate references in the application code; few formats handle shared references the way Hessian does.

When not to

Hessian is the wrong choice for new microservices in greenfield environments. Protobuf, gRPC, and the modern alternatives have better tooling, better cross-language support, better breaking- change detection, and broader community knowledge.

It is also the wrong choice when ecosystem maturity matters; the non-Java implementations are functional but not first-class.

Position on the seven axes

Schemaless-with-class-definitions-in-stream (a hybrid). Self- describing (the class definitions are in the stream). Row- oriented. Parse rather than zero-copy. Codegen via class definitions, with runtime support for dynamic types. Non- deterministic by spec. Evolution by name-matching across class definitions, with graceful skipping of unknown fields and defaulting of missing ones.

The cell Hessian occupies — class-definition-as-in-stream-schema, object-reference-aware, Java-flavored — is unusual and is the combination of design choices that distinguish Hessian from MessagePack, CBOR, and the schema-first formats. The distinctiveness is real but has not produced enough leverage to overcome the broader ecosystem gravity.

A note on the object-reference mechanism

Hessian's handling of object references deserves a longer look, because it is the design feature that most distinguishes Hessian from the other formats in this book and the one that maps least well onto modern serialization patterns.

The mechanism, in plain terms: when an object is serialized, it is assigned an integer index. The first time the object appears in the stream, its full encoding (class definition reference plus field values) is emitted. Each subsequent appearance of the same object is emitted as a reference to the integer index. The decoder maintains a table of seen objects and resolves references on demand.

This works for shared references and cyclic graphs. It also works for repeated emission of the same string or large blob, as long as the encoder's identity-tracking treats them as the same object — Java's String interning is the relevant detail; two String references to the same intern slot are the same object, and Hessian preserves that identity on the wire.

The cost of this is operational. The encoder must track every serialized object in a table, which grows unboundedly with the size of the stream. For very large streams, the table memory is a real concern. Decoders similarly must allocate a table for resolution. Streams that use heavy reference sharing benefit; those that don't pay the table overhead for nothing.

The other cost is that Hessian's bytes are stateful. A reference to object index 5 means nothing without knowing what object indices 0 through 4 were. Streams cannot be trivially partitioned; a consumer that wants to read the second half of a Hessian stream must replay the first half to populate the reference table. This is awkward for several modern access patterns (random access, parallel processing, lazy decoding) and is one of the reasons Hessian has not spread to those use cases.

Modern formats handle reference sharing differently. Protobuf does not preserve references at all; if a producer has two fields that point to the same object, both fields are serialized in full, with no de-duplication. The applications either accept the duplication or de-duplicate at the application layer. This is less efficient on the wire but simpler operationally.

The trade Hessian made was right for the Java RPC context of 2000-2010, where typical payloads were small enough that the table overhead was negligible and shared references were common in domain models. The trade ages less well in 2026, where streams are bigger, parallel processing is more common, and the operational cost of stateful encodings is more painful. Reading Hessian's design today is the right way to understand a feature that was reasonable in its time and that the rest of the format landscape has chosen not to replicate.

Epitaph

Hessian is the Java-flavored binary RPC format that runs Dubbo and not much else; technically interesting in its handling of object references and class definitions, ecosystem-bound to a context most engineers outside Asia rarely encounter.

Decision Frameworks

A book that has spent twenty-six chapters describing formats should not now resolve them into a flowchart that says "for X problem, use Y format." That kind of recommendation is what most articles about serialization formats produce, and it is wrong often enough that the genre is mostly noise. The decision is not amenable to flowcharting. The decision is amenable to structured questions you ask yourself in a useful order, where the answer to each question narrows the candidate set and clarifies what the remaining trade is. This chapter is that set of questions.

The order matters. The first questions are the ones whose answers disqualify the most candidates; later questions discriminate among the survivors. The questions are not all the same kind: some are about the workload (how many records, how often, how big), some are about the deployment topology (who controls what, what gets upgraded together), some are about the team (what languages, what operational appetite), some are about the data itself (what types matter, how much nesting). All of them are important, and asking them in the wrong order leads to wrong answers.

Question 1: Are you crossing a process, machine, or time boundary?

This is the question chapter 1 introduced and that the rest of the book has reinforced. The boundary determines which formats are even candidates.

For a process boundary (two parts of one running program, two processes on one machine, one process talking to itself across threads), the relevant axis is encode/decode latency, not size. Zero-copy formats are first-class candidates here: FlatBuffers, Cap'n Proto, rkyv, Apache Arrow IPC. Parse-style formats also work but rarely justify their parse cost.

For a machine boundary (one host to another, possibly different architectures, possibly different language runtimes), the relevant axes are size, schema portability, and encode/decode speed in roughly that order. Schema-first formats (Protobuf, Avro, Thrift, Bond) are first-class candidates. Self-describing formats (MessagePack, CBOR) are candidates when schemalessness is preferred. Zero-copy formats are candidates if the bytes are loaded into memory and accessed many times before being discarded; they are usually not candidates when the bytes are written-once-read-once.

For a time boundary (bytes written today, read in years), the relevant axes are schema durability, format stability, and the availability of decoders into the indefinite future. ASN.1 has the longest track record. Avro's Object Container Files put the schema in the file. Parquet's columnar layout with metadata is designed for archival. Most of the schemaless formats are also defensible because they carry their own type information. Schema-required formats without schema-with-data (Protobuf, Cap'n Proto, FlatBuffers) require an out-of-band schema archival plan that is often missing.

The boundary question disqualifies whole families. If you are crossing a time boundary, you should not be using FlatBuffers without an explicit plan to archive the schema for as long as the bytes need to be read. If you are crossing a process boundary, you should not be using JSON unless you have measured the cost and decided to pay it.

Question 2: Who controls the schema, and how often does it change?

The answer is one of: I control both producer and consumer, and I can deploy them together; I control both but they deploy independently; I control one side but not the other; I control neither.

If you control both sides and deploy them together, you have no schema-evolution problem. Almost any format works. Choose on other axes (size, speed, ergonomics).

If you control both sides but they deploy independently (microservices in heterogeneous deployment), you have a schema- evolution problem. The format choice should be driven by which evolution model fits your deployment topology. Tagged-field formats (Protobuf, Thrift) are easier on small teams that can follow conventions; resolution-based formats (Avro) are stronger when a registry can enforce compatibility automatically. Both require operational discipline that schemaless formats do not.

If you control one side but not the other (a public API, an SDK consumed by external clients), the format must be resilient to heterogeneous client implementations. Self-describing formats work well; schema-first formats with mature codegen work also. The choice often falls on whichever format has the broadest client library coverage in the languages you expect clients to use.

If you control neither side (interop with an external standard), the format is determined for you. Read the spec; build to it.

Question 3: What is the workload's read/write ratio?

A serialization format's costs are paid asymmetrically. Encoding costs are paid by writers; decoding costs are paid by readers; storage costs are paid wherever the bytes live. Different formats shift these costs differently.

If reads dominate writes by orders of magnitude, zero-copy formats become attractive. The decode cost amortizes to nothing; the encode cost is paid once. FlatBuffers and Cap'n Proto are designed for this. Apache Arrow IPC is too, in the in-memory case. Even Parquet's metadata-driven access pattern is a kind of read-favoring optimization.

If writes dominate reads, the encode cost matters more. Compact varint formats (Protobuf, MessagePack) are competitive here; zero-copy formats are penalized because their alignment constraints add bytes that the writer pays to lay down.

If read and write rates are roughly equal, the formats with balanced costs (Avro, MessagePack, Protobuf) are usually the right answer. The choice between them is on other axes.

For analytical workloads with heavily-skewed read patterns (many queries reading small column subsets out of huge tables), columnar formats win unambiguously. Parquet for at-rest, Arrow for in-flight.

Question 4: Does the workload require deterministic encoding?

Determinism — the property that the same value always encodes to the same bytes — is required when bytes are signed, hashed, or content-addressed. Cryptographic protocols, blockchain consensus, content-addressable storage, deterministic builds, and signature-verification all need it.

If determinism is required, the candidate set narrows substantially. ASN.1 DER is fully deterministic. CBOR has a deterministic encoding subset. Borsh, SCALE, and SBE are deterministic by construction. Cap'n Proto has a canonical encoding mode. Protobuf, Thrift, MessagePack, Avro, BSON, FlatBuffers, and the columnar formats are not deterministic by spec; making them deterministic requires careful canonicalization that is often not worth the effort.

If determinism is not required, this question disqualifies nothing.

Question 5: What is the language landscape?

Some formats are language-portable; some are not. The format choice must include the question of whether good implementations exist in every language that needs to consume the bytes.

Formats with broad language support: Protobuf, JSON Schema, MessagePack, CBOR, BSON, Avro. These have first-class implementations in most languages used in production.

Formats with moderate language support: FlatBuffers, Cap'n Proto, Thrift, Parquet, Arrow. These have good support in major languages and weaker support in long-tail languages.

Formats with single-language or single-ecosystem support: rkyv (Rust), bincode (Rust), postcard (Rust), Smile (JVM), Hessian (JVM-centric, with thin support elsewhere), NBT (Minecraft ecosystem).

If your language landscape is broad, choosing a single-language format imposes a translation layer at every boundary. If your landscape is narrow, the single-language format may give better ergonomics than a portable one.

Question 6: Does the data have structural features that constrain the choice?

Some data shapes are awkward in some formats. Worth checking explicitly:

  • Heavily nested structures: Parquet, ORC, and Cap'n Proto have substantial machinery for nesting. FlatBuffers' nesting story is workable but uses tables-within-tables in ways that add overhead. NBT's nesting is uniform and clean. Schemaless formats handle nesting trivially.
  • Cyclic references: most formats cannot represent them directly. Hessian preserves them via its object-reference mechanism. Most others require flattening at the application layer.
  • Heavy repeated string columns: dictionary-encoded columnar formats (Parquet, ORC) handle this dramatically better than any row-oriented format. Smile's key sharing helps in the schemaless case.
  • Floating point with NaN preservation: most formats preserve NaN bit patterns; some (older JSON-based encodings) do not. Check if it matters.
  • Arbitrary-precision integers and decimals: CBOR, Ion, ASN.1, and a few others handle these natively. Most formats do not and require encoding-as-strings or as-bytes.
  • Optionality at scale: formats vary in how cheaply they encode "field absent." Avro, Hessian, and the schemaless formats are cheapest. Borsh, Postcard, and Bond are middle. SBE is most expensive (requires sentinel encoding).

If your data has one of these features and the candidate format handles it badly, the workaround usually becomes a permanent operational cost.

Question 7: What is the operational appetite of the team?

This is the question most often unasked. Schema-evolution-aware formats require operational infrastructure: a schema registry, a breaking-change detector, a build process that regenerates code on schema changes. Zero-copy formats require alignment discipline. Columnar formats require batching infrastructure. Determinism requires either a format that provides it by spec or a canonicalization layer the team builds and maintains.

If the team has the appetite for the surrounding infrastructure, formats that demand it (Protobuf with Buf, Avro with Confluent Schema Registry, Parquet with table format projects) work well. If the team does not, formats that demand less infrastructure (MessagePack, CBOR, JSON) work better despite their other drawbacks.

The single most common cause of long-term format-choice regret is that a team chose a format whose operational requirements exceeded what the team was prepared to build, and so the format runs without its intended supports and produces the bugs the supports were meant to prevent.

A decision matrix

Putting the questions together, the typical paths look like:

WorkloadRecommendedBackup choice
Internal microservice RPC, polyglotProtobuf + gRPCAvro + registry
Internal microservice RPC, single-languagerkyv (Rust) or BondProtobuf
Public API, broad clientsJSONProtobuf with JSON
Streaming events with schema registryAvroProtobuf
Analytical at-restParquetORC
Analytical in-memory interchangeArrow IPCFeather V2 (= Arrow)
Game asset loadingFlatBuffersrkyv (Rust-only)
Hot-path RPC with read-dominanceCap'n ProtoFlatBuffers
Cryptographic signingASN.1 DERCBOR canonical
Blockchain on-chain encodingBorsh / SCALECustom format
Embedded telemetrypostcardCBOR
Configuration and human editsnot binaryProtobuf prototext
Document databaseBSON(avoid; use a database)
Telecom protocolASN.1 PER(no good alternative)
Robotics middlewareROS msgs (CDR)(no alternative)

The matrix is a starting point, not an answer. Read the relevant chapters; weigh the trade-offs against your specific context.

A few principles to internalize

A handful of cross-cutting observations from the book worth pulling out:

The format is an answer to a question; choose the format whose question you are actually asking. If the question is "how do I encode my Rust types densely for a single-machine cache," postcard is right and Avro is wrong. If the question is "how do I move records between heterogeneous services with strong evolution guarantees," Avro is right and postcard is wrong. The questions are not interchangeable.

Operational maturity outranks technical merit. Bond is slightly better-designed than Protobuf in several specific ways and has lost decisively. The reason is that Protobuf has better tooling, better documentation, broader community knowledge, and better integration with the surrounding stack. The same calculus applies to most format choices: the format with the larger ecosystem and better tools usually wins, regardless of whether it is technically superior.

Schemaless is not free. Schemaless formats look like they let you skip the work of defining a schema. They do not. The schema still exists; it is in the application code, distributed across producers and consumers, undocumented and unenforced. Schemaless formats are the right choice when the schemalessness is what you actually want; they are the wrong choice when you wanted a schema but didn't want to do the work.

Zero-copy is a constraint, not a feature. Zero-copy formats trade encoding flexibility for read latency. The trade is right for some workloads and wrong for most. Choose zero-copy when the read latency dominates your performance budget, not because the marketing materials claim speed.

Determinism is a feature you must choose explicitly. Most formats are non-deterministic by default. If you need deterministic encoding, choose a format that provides it by spec or commit to building canonicalization yourself. Both are real costs; pretending otherwise leads to silent hashing bugs years later.

Almost all benchmarks lie. The next chapter (Anti-Patterns) covers the specific ways. The general advice: do not let published benchmarks make your format choice.

A worked decision example

Concretizing the framework on a hypothetical: a team is building a new event-sourcing system for a financial-services back office. Events are written to durable storage, replayed for state reconstruction, and audited periodically. Producers and consumers are independently deployable services in Java and Python. Throughput is moderate; latency is not the binding constraint; audit and replay must work indefinitely.

Question 1: this is primarily a time boundary problem. The events live for years. Schema-with-data is essential. ASN.1, Avro container files, and Parquet are first-class candidates. Schema-required formats without in-band schema (Protobuf, Cap'n Proto) require an explicit schema-archival plan.

Question 2: producer/consumer schema control is split, with independent deployment. Schema evolution is a real concern. Avro with a registry and tagged-field formats with discipline both work.

Question 3: read/write ratio is heavily read-dominated (events written once, read on every replay). Zero-copy is not relevant because the events are batched into a stream. Compact encoding matters because storage cost compounds.

Question 4: determinism is not required (events are identified by ID, not by content hash).

Question 5: language landscape is Java + Python — both well supported by Avro and Protobuf, less well by some of the others.

Question 6: events are flat-ish records with occasional nesting. No special structural features.

Question 7: operational appetite for a registry exists (the audit requirement justifies the investment).

The recommendation falls out: Avro with the Confluent Schema Registry, stored as Avro Object Container Files. The schema is in the file (good for the time boundary), evolution is registry- enforced (good for the deployment topology), the Java and Python implementations are mature, and the storage cost is competitive.

Note that the recommendation is not "the best format" in the abstract. It is the format whose answers to the seven questions match the team's situation. A different team in a different context (microsecond-latency RPC, single-language deployment, no registry tooling) would land on a different format.

On the cost of revisiting the choice

A final observation worth stating explicitly: the cost of choosing the wrong format is mostly paid at migration time. Production data in a binary format is a substantial commitment; moving it to a new format requires either dual-writing (both formats during a transition window), translating (running producer-side and consumer-side translation code, adding latency), or freezing (stopping the migration when "good enough"). All three are expensive, and all three accumulate their own bugs.

The right time to make the format choice carefully is before production data starts piling up. Once you have a petabyte of Parquet, you are not switching to Arrow's IPC file format casually, even though Arrow IPC is structurally similar. Once you have a fleet of services using MessagePack, you are not switching to Protobuf without a coordinated effort. Once you have a Hive metastore full of ORC files, you are not switching to Parquet without a migration project.

The book's working assumption is that you are reading this before you have made the choice, or while you are considering whether the existing choice is right for the next phase of your system. The decision framework is meant to be used in that window. Outside it, the framework is mostly aspirational; the costs of switching usually outweigh the costs of staying.

Anti-Patterns

This chapter is a catalog of the ways teams misuse binary serialization formats in production, drawn from observation of many systems and from the consistent shapes of the resulting incidents. The catalog is not exhaustive; new anti-patterns emerge as systems evolve. But the patterns described here recur often enough that recognizing them in your own systems is disproportionately valuable. Each entry describes the pattern, explains why it is tempting, and names the failure mode it produces.

The patterns are presented roughly in order of frequency, with the most common first. They are not equally severe; some produce gradual technical debt, while others produce specific incidents that page on-call. Both kinds matter.

Reusing a field number

The single most common Protobuf anti-pattern is reusing a field number that was previously assigned to a different field. The pattern: a field is added with number 5, deployed, used for a year, then removed when its purpose is obsolete. Some time later, a developer adds a new field, looks at the schema, sees that 5 is "available" (not currently used), and assigns the new field number 5.

Why it is tempting: the old field is gone, the schema looks like 5 is free, the developer wants the next sequential number rather than skipping ahead. The schema looks fine.

The failure mode: production data written under the old schema still has field-5 bytes in some buffers. Consumers under the new schema will decode those bytes as the new field's type. If the types are wire-compatible (string and bytes are both length-delimited; varint and fixed64 are not), the decode succeeds and produces garbage. If the types are wire-incompatible, the decode fails with a parse error that sometimes looks like a network corruption. Either way, the bug appears intermittently and is hard to localize because the byte history of the field is not visible from the schema.

The fix: the reserved keyword in Protobuf, the equivalent discipline in Thrift, the schema registry's compatibility check in Avro. Use Buf or its equivalent to detect this at PR time. The problem is not in the format; it is in the lack of enforcement infrastructure.

Treating optional fields as required

The pattern: a field is declared optional in the schema (or, in Protobuf 3 before the optional keyword was restored, scalar fields are implicitly optional with default values), but the application code reads the field and assumes it is always present. Tests pass because the test data always has the field. Production breaks when a producer omits the field.

Why it is tempting: defining a field as required makes the schema document the invariant. But Protobuf 3 dropped required for good reasons (Chapter 11), so there is no schema-level way to express "this must be present"; the application is supposed to enforce it. Many do not.

The failure mode: consumers crash, return wrong defaults, or silently skip records when the field is absent. The producer that omitted the field may be operating correctly per the schema; the consumer is the one with the bug. But the bug manifests at the consumer, and the diagnosis often blames the producer.

The fix: enforce required-ness in application code at the boundary. Document it in the schema's comments. If you are crossing a schema boundary where required-ness is contractual, use a format (Thrift, Avro with explicit non-nullable types) that expresses required-ness at the schema level.

Hashing non-canonical encodings

The pattern: an application needs to hash or sign serialized bytes for caching, deduplication, or signature purposes. The chosen format does not specify a canonical encoding, but the application hashes the bytes anyway, on the assumption that "the bytes are the bytes."

Why it is tempting: hashing bytes is cheap and obvious. The nuance about canonical encoding is not always documented at the top of the format's spec.

The failure mode: two producers in different languages (or different versions of the same library) produce different bytes for the same logical value. The hashes differ. Cache hit rates collapse, deduplication breaks, signatures fail to verify across producer changes. The bug typically appears after a library upgrade or a producer migration, often months after the original deployment.

The fix: use a format that specifies canonical encoding (CBOR canonical, ASN.1 DER, Borsh, SCALE, SBE). If you must use a non-canonical format, hash the value, not the bytes — typically by canonicalizing first (sorting map keys, normalizing representations) before serialization, then hashing.

Choosing format based on benchmarks

The pattern: a team chooses a format based on a published benchmark. The benchmark says format X is fastest. The team adopts X. Production reveals that X is not fastest in their specific workload, or that the speed advantage is dwarfed by other factors.

Why it is tempting: benchmarks are concrete. They produce numbers. Numbers feel objective. A 30% speed advantage in a published benchmark is hard to argue with.

The failure mode: the published benchmark's workload differs from yours. The format authors' implementations are tuned in ways the benchmark exercises and your code does not. The compression codec, the buffer sizes, the message shapes all matter, and the benchmark fixed them to numbers that are not yours. The result is a format choice based on a number that does not predict production behavior, with the operational costs of the chosen format paid in full while the speed benefit is partial or absent.

The fix: benchmark with your own workload, on your own hardware, with your own access patterns, before making a decision. Benchmark also after the fact, to verify the choice was right. Treat published benchmarks as input to your hypothesis, not as the conclusion.

Re-encoding through JSON

The pattern: an application has data in some binary format. To work with it, the application decodes to language-native objects, encodes to JSON, parses the JSON, and uses the parsed result. The JSON step is in the middle for no good reason, often as a side-effect of using a JSON-centric library that happens to also support the binary format.

Why it is tempting: JSON is what most application code expects. Working in JSON-shape internally is comfortable.

The failure mode: every JSON round-trip loses something — integer precision (above 2^53), byte strings (which JSON cannot represent), distinct types (number vs. string vs. bool), and floating-point edge cases. The application accumulates bugs at the JSON boundary, often subtly. The performance is dramatically worse than direct binary-to-binary work.

The fix: do not re-encode through JSON. Use libraries that deserialize directly into your domain types from the binary format, and serialize directly back out without an intermediate text representation. This is more work in the short run and saves much pain over time.

Storing config in binary

The pattern: an application uses Protobuf (or any binary format) for its inter-service messages. The same format is then used for the application's configuration files. The config is now in prototext or in binary; either way, it is not readable by cat, not greppable by grep, and not editable by anyone without the right tooling.

Why it is tempting: the format is already in the codebase. Using it for config eliminates a dependency on a separate config language. The schemas can be reused.

The failure mode: operations engineers cannot inspect the config without specialized tools. Configuration changes are harder to review (PRs show binary diffs or arcane prototext). Debugging a production issue at 3 a.m. is harder. Migration to a new tool chain becomes a project rather than a script.

The fix: use a text-based config language (YAML, TOML, JSON, HCL) for human-edited configuration. If you want type-safe configs, use a schema language whose binding generates the config-validation code without forcing the config bytes to be binary.

Premature columnar

The pattern: a team chooses Parquet or Arrow for data that is not columnar in any meaningful sense. Single-record events, small JSON-shaped messages, transactional records that get written and updated, are all stored in Parquet because "Parquet is the modern standard."

Why it is tempting: Parquet has good ecosystem support. Tools exist for everything. The team has heard that Parquet is efficient.

The failure mode: Parquet's overhead — metadata, row group boundaries, page headers, footer with statistics — dominates the file size for small or transactional payloads. A million tiny Parquet files use more storage than a million tiny JSON files once metadata is counted. Reads are slower because the metadata-driven access pattern that makes Parquet fast on analytical queries does nothing for transactional ones. Writes are awkward because Parquet is batch-oriented and individual rows are not its unit of work.

The fix: pick the row-oriented format for row-oriented workloads. Use Parquet for analytical workloads where columnar layout genuinely matters, and not for everything else.

The polyglot premature optimization

The pattern: a team is building a single-language application (say, in Rust). They choose a cross-language format (Protobuf or Avro) on the assumption that someday the system might need to talk to a different language. That day does not come. The cost of the cross-language format — the codegen step, the schema files, the build complexity — is paid forever for benefits never realized.

Why it is tempting: future-proofing feels prudent. Choosing a cross-language format is the conservative choice.

The failure mode: the codegen and IDL machinery slow the team down on every schema change. The advantages of the host language's type system (strong types in Rust, classes in Java, named tuples in Python) are not fully usable because the generated types stand between the application and the data. The ergonomics of the system suffer for a feature that is not used.

The fix: if cross-language interoperability is not a current requirement, use the host language's native serialization (rkyv or postcard for Rust, Java's serialization or Kryo for Java, pickle or msgspec for Python). The cost of switching to a cross-language format if needed is real but bounded; the cost of using one when not needed is paid forever.

Failing to archive the schema

The pattern: a team uses Protobuf, FlatBuffers, or Cap'n Proto for archival data. The bytes are written to durable storage. The schema is in a Git repository that gets reorganized, renamed, or deleted years later. The bytes still exist; the schema does not.

Why it is tempting: the schema lives in the source tree. The source tree feels permanent.

The failure mode: years later, the bytes are unreadable. The team that wrote them has moved on. The Git repository has been restructured. The compiler is several versions newer and generates different code from the same schema. The data is, for practical purposes, lost.

The fix: archive the schema with the bytes. Avro Object Container Files do this automatically. Parquet does it via the file footer. Other formats require explicit operational discipline: copy the schema to a write-once archive every time it changes, version the archive, document the recovery procedure. The format will not remind you.

Schema in code only

The pattern: a team uses a serde-binary format (bincode, postcard, rkyv, or Java serialization) for their wire format. The schema is the source code. There is no separate IDL, no schema document, no contract artifact.

Why it is tempting: schema-as-code is convenient. The type system enforces correctness within the language. Why have a separate schema?

The failure mode: any non-source consumer (a debugging tool, a data-pipeline migration, a forensic investigation, a different language client added later) cannot decode the bytes without having the source code. The source code is sometimes available; sometimes the developer who wrote it has moved on; sometimes the source has evolved and the old bytes are incompatible. The schema is undocumented, and "the schema is the code" is a sentence that means "the schema is uncomputable from the bytes."

The fix: if you want schema-as-code, accept that the format is single-language and the bytes are not interpretable without the source. If that is acceptable, the choice is fine. If it is not acceptable, use a format with an explicit IDL.

Map ordering assumed stable

The pattern: an application encodes a map (Java HashMap, Python dict, Rust HashMap, Go map) into a format that does not specify map ordering. The application then assumes the bytes are stable across encodes — for testing, for caching, for hashing. Different runs produce different bytes for the same data, and the assumption fails.

Why it is tempting: in some language runtimes the iteration order of a map is almost stable (Go's intentional randomization notwithstanding); developers extrapolate from "almost stable" to "stable" without verifying.

The failure mode: tests that compare byte outputs flake. Caches keyed on hashes miss intermittently. Two replicas produce different bytes for the same data, and reconciliation logic trips.

The fix: use ordered map types (Rust's BTreeMap, Java's LinkedHashMap or TreeMap, Python's dict in 3.7+ which is insertion-ordered, sorted iteration when emitting) and document the ordering as a contract. For deterministic encoding, sort keys before emitting.

Forgetting that compression sits underneath

The pattern: a team compares formats on uncompressed size and chooses the smallest. The deployment uses compression at the storage or transport layer. After compression, the size differences are smaller or reversed.

Why it is tempting: format size is what marketing materials publish. Compression is "an implementation detail."

The failure mode: the format choice was driven by uncompressed density. The actual on-disk or on-wire bytes after gzip/snappy/ zstd are sometimes within a few percent across formats. The format choice solved the wrong problem; the team paid the operational cost of the format for a benefit that compression would have provided anyway.

The fix: measure the post-compression size if compression is in your stack. The format choice should be based on the bytes that actually reach storage or wire, not the bytes the format produces before compression.

A meta-pattern: the wrong-question format choice

The patterns above are mostly specific. They have one thing in common: each starts with the team using a format whose answer to "what should you optimize for" does not match the question the team is actually asking. Protobuf optimizes for cross-language typed serialization with tagged-field evolution. If you are asking how to encode your Rust types densely for a single-machine cache, Protobuf is the wrong answer to the right question, even though it would be the right answer to a different question.

The recurring fix is therefore to ask, before choosing or defending a format, what question is this format an answer to? Then ask whether that is the question your system is actually asking. If it is not, the format is going to fight you, often in ways that don't manifest until production data has accumulated. If it is, the format will mostly cooperate.

This is the meta-pattern that ties the chapter together. Each specific anti-pattern is a manifestation of choosing a format whose intended question does not match the team's actual one. The next chapter, on migration paths, describes what to do once you have made one of these mistakes and need to walk back from it.

Migration Paths

The chapters before this have been mostly forward-looking: what to choose, what to avoid, why one format suits a workload better than another. This chapter is backward-looking. You have a system that already uses some serialization format. The format is wrong for where the system is going, or the format has accumulated technical debt that has reached its boiling point, or the team has decided to consolidate around a different format. You need to migrate. The chapter describes what migration looks like in practice, the patterns that work, and the patterns that look like they should work and do not.

The single most important fact to internalize about format migrations is that they are mostly an operational problem, not a technical one. The wire-level conversion between any two sensible formats is straightforward; tools exist for most pairs; the encoding logic is, in raw lines of code, small. What is hard is doing the conversion while production is running, against existing data, with consumers and producers that cannot all be upgraded simultaneously, and without breaking the availability or correctness of the system during the transition.

The four migration patterns

There are essentially four patterns for migrating from format A to format B. Most real migrations are combinations of these.

The rewrite. Stop writing format A. Convert all existing data to format B in a batch job. Restart writing as format B. This is the simplest pattern but requires either downtime or a system that can tolerate brief unavailability. Rewrites work for small datasets, internal systems, and cases where a maintenance window is acceptable.

The dual-write. Write to both format A and format B during the transition. Consumers continue reading format A. New consumers read format B. Once all consumers are migrated, stop writing format A and delete the old data. This pattern requires storage for two copies of the data during the transition and disciplined consumer migration but is otherwise low-risk.

The translate-on-read. Continue writing format A. Add a translation layer that reads format A and serves format B to consumers that want it. Migrate consumers one at a time. Once all consumers are migrated to expect format B, switch the producer to write format B and remove the translation layer. This pattern adds latency and operational complexity to the read path during the transition but does not require dual storage.

The translate-on-write. Switch producers to write format B. Add a translation layer that converts format B back to format A for consumers that have not yet migrated. Migrate consumers one at a time. Once all are migrated, remove the translation layer. This is the inverse of translate-on-read and has the inverse trade-offs: migration cost is on the write path during transition.

The four patterns can be combined. A common combined pattern is dual-write with translate-on-read: produce both formats during the transition, but also expose a translation endpoint so that emergency consumer needs can be served without waiting for the official migration to complete.

Migration scenarios

A handful of specific migrations are worth walking through, both because they are common and because they illustrate the patterns in concrete form.

JSON to Protobuf

The most common migration. The team has been using JSON over HTTP, has hit scale or cost limits, and wants to switch to Protobuf over gRPC.

The pattern: dual-write. The producer continues to expose a JSON HTTP endpoint while adding a Protobuf gRPC endpoint. The schema for both is generated from the same Protobuf source (the JSON encoding of Protobuf is well-defined; tools generate the JSON-shaped REST API from the same .proto file). Consumers migrate one at a time from HTTP/JSON to gRPC/Protobuf. The JSON endpoint is removed once all consumers have migrated.

The traps: the JSON-encoding rules for Protobuf differ from ad-hoc JSON conventions in subtle ways. A Protobuf int64 field serializes to a JSON string by spec (because JSON cannot represent 64-bit integers exactly), but most ad-hoc JSON conventions emit them as numbers. If consumers were written against the ad-hoc encoding, they break when migrated to the spec-compliant Protobuf-JSON. The fix is to pick the JSON encoding rules deliberately and document them; tools like protojson make this manageable.

The other trap: consumers who depended on JSON's flexibility (adding ad-hoc fields, sending strings where numbers were expected, mixing types) discover the rigidity of Protobuf the hard way. Each consumer migration becomes a small project rather than a one-liner.

Protobuf v2 to v3

The second most common. The team has Protobuf 2 schemas with required fields, custom default values, extensions, and other features that Protobuf 3 does not support directly.

The pattern: rewrite the schemas, with mechanical translation where possible. Required fields become optional fields with application-level enforcement. Default values are removed (or expressed via the application layer). Extensions become explicit embedded messages. The wire format is unchanged for compatible fields, so old binaries continue to work; the schema-level changes do not break existing data.

The traps: code that depended on Protobuf 2's required-field enforcement breaks silently when the field is no longer required. The fix is to add explicit checks in the application layer at the same time as the schema migration.

The Protobuf 3.15 optional keyword for scalar fields changes the migration story. Schemas that need explicit absence semantics for scalars can use optional after the migration, which restores the Protobuf 2 has-bit behavior.

Avro to Protobuf

The opposite-direction analytical migration. A team using Avro with Confluent Schema Registry decides to move to Protobuf for better tooling alignment with their non-Kafka services.

The pattern: dual-write at the producer side, with both Avro and Protobuf schemas registered for the same Kafka topics (Confluent supports this directly). Consumers migrate one at a time. The Avro schema is retired once all consumers are Protobuf.

The traps: Avro's schema resolution semantics (the algorithm that reconciles reader and writer schemas) does not map directly onto Protobuf's tagged-field model. Fields with non-trivial defaults in Avro must have the defaults reproduced in application code on the Protobuf side. Renames that were handled via Avro aliases need to become explicit field-rename coordinations in Protobuf. The migration can preserve compatibility, but the operational discipline differs from what the team is used to.

bincode 1.x to 2.x

A Rust-ecosystem migration. The team has on-disk data in bincode 1.x format and wants to move to 2.x for the smaller encoding and the better default behavior.

The pattern: translate-on-read with dual-write at write time. The reader detects the format version (bincode 2.x has a magic number; bincode 1.x does not) and decodes appropriately. New data is written in bincode 2.x. Old data is migrated lazily (rewritten when read, in 2.x format) or eagerly (a batch job rewrites all existing files).

The traps: bincode 1.x had several configuration modes that produced wire-incompatible output. Code that produced bincode 1.x with default settings will be readable; code that produced 1.x with VarintEncoding configured will be subtly different. The migration must handle both.

Row format to Parquet

A common analytical migration. The team has been writing JSON-lines or CSV files for their analytical pipeline and wants to move to Parquet for better query performance.

The pattern: dual-write at the producer if the producer can support both, or batch-conversion at the storage layer. The new files are Parquet; the old files are converted in chunks. The analytical queries are updated to read Parquet, with the row formats supported as a fallback for old data during the transition.

The traps: the schema must be inferred or provided. CSV has no schema; JSON-lines often does not either. Parquet requires one. Inferring the schema from existing data sometimes produces surprises (fields that were always integers but turn out to have one record with a string; nullable fields that were never null in the sample). Schema validation must happen during conversion, with the surprises documented and fixed at the data layer.

A second trap: Parquet's batching expectations. Writing one Parquet file per record produces enormous metadata overhead. The conversion job must batch records into reasonable row group sizes, which sometimes forces architectural changes upstream (streaming consumers must wait for batches; latency budgets shift).

Hessian to Protobuf

A Dubbo-ecosystem migration. The team is on Apache Dubbo with Hessian and wants to move to Protobuf over Dubbo.

The pattern: Dubbo natively supports both formats. The configuration is per-service. The migration is service-by-service: update each service to support both Hessian and Protobuf, update consumers to expect Protobuf, switch the producer to emit Protobuf only, decommission the Hessian path.

The traps: Hessian preserves shared object references; Protobuf does not. Code that depended on object-reference preservation (typically code with deeply linked domain models) breaks when migrated. The fix is to denormalize at the application boundary, which is sometimes a substantial refactor.

ORC to Parquet

A Hortonworks-lineage migration. The team is on Hive with ORC and wants to move to Parquet, typically as part of a broader move from Hive to Spark or to a cloud data warehouse.

The pattern: batch convert. ORC and Parquet both have mature readers, and tools like orc-tools and parquet-tools can convert between them with reasonable fidelity. New data is written as Parquet; old data is converted on a planned schedule.

The traps: type fidelity at the boundary. ORC's int96 timestamp encoding (deprecated) and Parquet's int64-timestamp-with-logical- type are not byte-for-byte equivalent. Decimal precision representations may differ. Nested type mappings require attention, especially for unions and complex maps.

The second trap: existing query engines that expected ORC's specific predicate-pushdown patterns may need tuning when moved to Parquet. The performance after migration is sometimes worse than before until the queries are adjusted to the new format's strengths.

Common cross-cutting concerns

Across all migration patterns, several concerns recur and are worth listing explicitly.

Schema versioning during migration. If the schema is also changing during the migration (new fields, removed fields), the migration becomes substantially harder. Combining a format migration with a schema migration is generally a mistake. Do one at a time: format migration first, schema migration after, or vice versa, but not both at once.

Test data for both formats. Production migrations require test data in both the source and target formats. Generating this from scratch is laborious; the cleanest source is to record real production traffic and translate it. The translation logic is also the migration logic, which is convenient.

Monitoring during transition. Track the percentage of producers and consumers on each format. Track decode failures on each side. Track the operational cost of the translation layer (latency, error rate). The migration is in flight as long as both formats are being produced or consumed.

Rollback planning. Migrations sometimes have to be rolled back. The dual-write pattern makes rollback easy: stop writing the new format, switch consumers back. The translate-on-read pattern makes rollback harder because the producer has already moved. Pick the pattern with the rollback story you can live with.

The long tail. Migrations are easy at 0% and 100% and hard at 80%. The last consumer or last producer is always somewhere specific that nobody remembered: a forgotten cron job, a dormant region's failover replica, a script run quarterly by a person who has since left. The migration is not done until that consumer is migrated or formally retired. Plan for the long tail; it is the largest part of the calendar.

Migrations that should be avoided

Some migrations are not worth doing. The format choice was defensible; the operational cost of changing it exceeds the benefit; the team should redirect its attention to other problems.

The clearest case is migrating between formats with similar trade-off profiles. MessagePack to CBOR is rarely worth the effort; the formats are essentially equivalent in most workloads. Avro to Protobuf inside an analytical pipeline is rarely worth it; the schema-evolution mechanisms differ but the wire-level performance is comparable. ORC to Parquet inside a Hive deployment is sometimes not worth it; the marginal gains do not justify the project cost.

The case for migration is strongest when the existing format has a real, observable cost — a recurring incident class, a storage bill that compounds, a hiring constraint because nobody knows the format — and the candidate format eliminates that cost. The case is weakest when the migration is purely aesthetic.

A note on the cost of always-migrating

Some teams convince themselves that migrations are easy and schedule them recurrently: "we'll migrate to the new format every two years to stay current." This is a mistake. Migrations are expensive in calendar time and in attention; recurring migrations crowd out other work. The right cadence is "migrate when the existing format has a cost that justifies the project," not "migrate on a schedule."

The corollary is that the right cadence for choosing the initial format is "carefully, with an eye to the next decade." Format choices are sticky. They should be made deliberately, not casually, and they should be allowed to stay made for as long as they continue to be defensible.

A worked migration playbook

To make the patterns concrete, here is a worked playbook for a hypothetical large migration: a team has a five-year-old Java service stack using Hessian over Dubbo, and is migrating to Protobuf over gRPC. The team is around 50 engineers, the data flows through about a hundred internal services, and the migration target is six months.

Month 1: Baseline. Document the existing Hessian schemas as Protobuf .proto files. Set up the Buf-equivalent tooling for the new schemas. Stand up gRPC infrastructure (load balancers, service mesh integration, observability hooks) alongside the existing Dubbo infrastructure. Identify pilot services that will migrate first.

Month 2-3: Dual-write rollout. Update producer services to support both Hessian and Protobuf. Configure Dubbo to expose both protocols on each service. Pick three to five pilot services for end-to-end migration; migrate their consumers to gRPC. Monitor for decode failures, performance regressions, and unexpected schema mismatches.

Month 4-5: Bulk migration. Roll the dual-write pattern out to all services. Migrate consumers in batches, prioritized by service criticality (most critical last, so failures appear in less critical services first). Track the percentage of traffic on each protocol per service.

Month 6: Decommission. Once all consumers are migrated, stop writing Hessian. Remove the Hessian client libraries from the build. Clean up the dual-protocol Dubbo configuration. Archive the old Hessian schemas alongside any historical data that might still need to be readable.

Forever after: long-tail vigilance. The forgotten consumer will appear three months after the migration is "complete." A quarterly cron job, a disaster-recovery script, an obscure monitoring system. Each one costs a day or two to migrate. Plan for this; budget the time as part of the migration.

The playbook is approximately what every successful Hessian-to- Protobuf migration I have observed has looked like. Variants exist (translate-on-read instead of dual-write for systems where producer changes are expensive; faster timelines when the team is smaller; slower timelines when the data flow is more complex), but the structure is the same.

Summary

Migrations are operational problems. Pick a pattern that matches your tolerance for transition complexity (dual-write, translate-on-read, translate-on-write, rewrite). Plan for the long tail. Avoid migrating between similar formats. Avoid combining format migrations with schema migrations. Monitor during the transition. Have a rollback plan.

The book has now described the formats, the axes, the common mistakes, and the migration paths. The remaining chapters cover what to do when the right answer is to roll your own format, and the appendices collect the wire-tour material that has been distributed across the format chapters.

When It's Defensible

The rest of this book has been about formats other people built. This part is about the case where you should consider building one yourself, and what that costs. Building a serialization format is unfashionable advice. The fashionable advice — use Protobuf, you fool — is right roughly 95% of the time and catastrophically wrong the other 5%, and the 5% includes some of the most important systems people build. The question this chapter answers is when you are in the 5%.

The short answer is almost never. The long answer is that there are specific properties no off-the-shelf format provides, and if your system needs one of those properties, an off-the- shelf format will fight you for the rest of the system's life. The cost of building your own format is high but bounded. The cost of fighting an unsuitable format is low per-incident and unbounded over time.

The cases that justify rolling your own

A custom format is defensible when all of the following are true: the existing formats fail to provide a property your system requires, the property is load-bearing for the system's correctness or performance, you have the engineering budget to build and maintain the format for the system's lifetime, and the team is committed enough to the format that they will not abandon it after the original author leaves.

These conditions are conjunctive. Failing any of them produces a worse outcome than just choosing an existing format and living with its limitations. Many of the systems I have seen with custom formats had only the first two conditions; the third and fourth were assumed and turned out to be wrong, and the system ended up either replacing the custom format under duress or maintaining it badly forever.

The specific cases that have justified rolling your own, in my observation:

Hard real-time constraints. A system with a hard real-time deadline (a hardware-in-the-loop simulator, a control system, a microcontroller-bounded protocol) sometimes cannot afford the overhead of a general-purpose format's encoder/decoder, even the zero-copy ones. The constraint is specific (this loop must finish in 50 microseconds; the existing decoder takes 80) and non-negotiable. A custom format tuned to the exact data shape and access pattern can sometimes save the necessary cycles.

A truly unusual data shape. Most data is records-with- fields. Some data is not. Sparse matrices, k-d trees, custom graph structures, time-series with irregular sampling, voxel grids — these have natural representations that the row-and- column orientation of mainstream formats does not capture gracefully. Forcing the data into a Protobuf or a Parquet schema usually produces a format that is correct but has 5× the natural representation's overhead. If the data is large enough that the 5× compounds, a custom format saves real money.

Specific cryptographic requirements. Some cryptographic protocols have format requirements that go beyond "deterministic encoding." They require specific byte layouts to match a published specification, particular ordering of authenticated and encrypted fields, framing that lets a verifier check signatures without parsing every byte. Building the format from scratch lets the requirements drive the design. Trying to retrofit those requirements onto a general-purpose format often produces a format that is slightly wrong in ways that take cryptographic auditing to discover.

Hard latency at extreme scale. When the deployment is large enough, the per-request CPU cost of a serialization format becomes a budget item. A 10% improvement at a billion requests per second is real money in datacenter capacity. The case for a custom format here is harder to make in 2026 than it was in 2010 — modern formats have closed most of the gap — but the case is sometimes still real, and the systems that built custom formats here in the past tend to keep them.

Single-vendor formats with no expected interop. A vendor's internal format that will only ever be read by their own code, on their own hardware, can sometimes justify a custom design that maximizes ergonomics for the vendor's specific stack. Apple's PLIST is one example of this; certain game console formats are another. The lack of any need for interop removes most of the cost of being non-standard.

Embedded or constrained-device protocols. When the device has 2 KB of RAM and the protocol must fit in 32 bytes, the existing formats' fixed overhead is sometimes too much. Some LWM2M deployments, some smart-card protocols, and a few similar contexts have justified custom formats that no other system will ever need.

The cases that look defensible and aren't

Several cases sound like they justify a custom format and don't.

"We need it to be small." Almost never justifies a custom format. The compression layer underneath the format usually closes the size gap; the wire-format choices that matter for size are well-understood and present in the modern formats. If you need bytes that are dramatically smaller than what Protobuf or CBOR produce, you need either a domain-specific encoding (varint with prior knowledge of the value range) or a better compressor, not a custom format.

"We need it to be fast." Rarely justifies a custom format in 2026. The mainstream formats have been tuned aggressively over the past decade. A custom format you build over a reasonable time budget will be slower than prost or fxamacker/cbor. Speed claims for new custom formats almost always come from comparing against poorly-tuned implementations of the alternatives.

"We have a unique set of constraints." Often the constraints are not as unique as the team believes. The unique constraint is typically expressible as "I want format A's evolution model with format B's speed and format C's tooling." This is wishing for a format that does not exist, but the wish is not justification for building one; the right answer is to pick the format whose trade-offs are least painful for you.

"We don't want a build-time dependency." Sometimes justifies postcard or bincode (Rust-only) over Protobuf (needs codegen). Almost never justifies a new format. The build-time dependency on protoc or its equivalent is a known cost; trading it for the cost of maintaining a custom format is usually a bad trade.

"We want full control." Usually a sign that the team has not understood what off-the-shelf formats provide. Full control also means full responsibility: every bug, every interop problem, every migration is yours to handle alone. The mainstream formats provide a community that handles many of these issues, and giving up that community costs more than the control is worth.

The cost ledger

Building a serialization format has costs that are easy to underestimate. A partial list:

The encoder. A few hundred to a few thousand lines of code per language, plus tests, plus documentation. Initial cost: a few engineer-weeks per language.

The decoder. The same shape of cost, perhaps slightly higher because the decoder has to handle malformed input gracefully.

The schema language and compiler. If the format is schema- required, the schema language must be specified, the compiler must be built, the build-system integration must be designed. This is months of work for a complete implementation.

The test suite. Including round-trip tests, malformed input tests, schema-evolution tests, performance tests, and cross-language interop tests if the format is multi-language. A real test suite for a mature format is thousands of test cases, accumulated over years.

The documentation. Specification, tutorials, examples, FAQ. The level of polish needed for a format other engineers will adopt is significant. Internal-only formats can have lighter documentation, but even those need enough detail that a new team member can be productive.

The tooling. Hex viewers, schema linters, breaking-change detectors, language bindings beyond the initial set, IDE plugins. Mature ecosystems have all of these. Custom formats typically have none and will need them as the deployment grows.

The maintenance. Bug reports from users, edge cases that weren't tested, language-binding contributions that need review, schema-evolution rules that need clarification. The maintenance cost is open-ended and continues for as long as the format is in production.

The opportunity cost of not investing in something else. Every hour spent on the custom format is an hour not spent on the system's core problem. Custom formats are rarely the system's core problem.

The cost ledger is not meant to be discouraging. It is meant to be honest. Some systems pay this cost willingly because the benefit is worth it. Most do not, and they pay it anyway, because the cost was not estimated correctly at the start.

A small case study

A team I observed (anonymized) built a custom binary format for their internal RPC system around 2014. The justification was "none of the existing formats are fast enough." The format was a careful design with several legitimate cleverness: bit-packed field tags, a specialized encoding for sparse maps, in-place deduplication of strings via a per-message dictionary. The encoder and decoder were a few thousand lines of C++; the performance was indeed about 30% faster than the contemporaneous Protobuf implementation in their workload.

By 2020, the format had become the team's biggest liability. The original author had left. The schema language had not been formalized; new fields were added by hand-editing wire-format constants. The cross-language story (Python, Go, JavaScript) had been delegated to teams who lacked the original author's expertise and produced subtly different implementations. Performance was no longer competitive with modern Protobuf implementations because the custom format had not been tuned for newer hardware.

The team migrated to Protobuf in 2022. The migration took six months of focused effort and was expensive in calendar time. The 30% speed advantage from 2014 was, in retrospect, a multi-million-dollar liability when amortized over the maintenance and migration costs.

This is the typical trajectory of custom formats that were defensible at the time of their creation. The conditions that justified the format change; the format does not adapt; the team eventually pays to migrate. The systems that have not gone through this trajectory tend to be ones with truly unusual constraints (specific cryptographic requirements; hard real-time bounds) where the alternative formats remain unsuitable.

A note on the formats in this book that started as custom

It is worth observing that most of the formats covered in chapters 4-27 began as custom formats. Some team built them for their own use, then realized other people had similar needs and published the spec. Protobuf was internal to Google for years. Thrift was internal to Facebook. Cap'n Proto came out of Sandstorm. SBE came out of LMAX. NBT came out of Minecraft. The line between "custom format" and "publicly-released format" is not a property of the format; it is a property of the publication decision.

This is the optimistic reading: if you build a custom format that is genuinely good, and you are willing to invest in the documentation and ecosystem, you might end up adding to the catalog rather than haunting your team's technical-debt log. The realistic reading is that the formats covered in this book are the survivors. For every published custom format, there are many that didn't survive — built, used briefly, abandoned. The survivor bias is large.

Summary

The case for rolling your own format is real but narrow. Specific cryptographic requirements, hard real-time constraints, genuinely unusual data shapes, and embedded contexts can justify it. Most claimed justifications — speed, size, flexibility, control — do not, because the existing formats have closed the relevant gaps and the cost of the custom format exceeds the benefit.

If you do decide to build, the next chapter is about what you owe your future self: the operational discipline that distinguishes custom formats that survive from those that become liabilities.

Position on the seven axes

Custom formats can occupy any cell in the seven-axis space, by construction. The axes are still useful as design questions: when designing your format, take a position on each axis explicitly, document the position, and stick to it. Custom formats that drift across the axes — adding self-description here, breaking determinism there, supporting evolution one way and then the other — produce the worst of all worlds.

The discipline of designing the format with the seven axes in mind is the same discipline you would apply when reading another format's specification. The axes are vocabulary; the format is your specific application of the vocabulary.

A note on the conjunctive nature of the conditions

Worth re-emphasizing because the conjunction is what most teams get wrong. A custom format requires all four conditions: an unmet property, a load-bearing requirement, a sustainable budget, and committed team ownership. Failing any one of these turns the format from a feature into a liability.

The most commonly missing condition is the fourth: committed team ownership. The format is designed by a person who is deeply interested in formats; that person leaves; the new team inherits the format and treats it as a black box; the format ossifies and gradually becomes unmaintainable. This pattern is so common in industry that I have come to consider the fourth condition the most important to evaluate. Will the team maintain this format for ten years is a harder question than it sounds, and "yes" usually requires either institutional investment in format expertise or a format simple enough that ordinary engineers can keep it running.

The second most commonly missing condition is the second: the load-bearing requirement. Teams convince themselves that a property is load-bearing when it is merely preferred. The clear test: if you took the existing format and added the property manually at the application layer, would the system be sufficiently worse? If the answer is "it would be slightly slower" or "it would be slightly less ergonomic," the property is preferred, not load-bearing, and the cost of a custom format exceeds the benefit.

The other two conditions — unmet property and sustainable budget — are usually easier to evaluate honestly. Teams know whether they have a property the existing formats lack, and they know whether they have engineering hours available. The conjunctive discipline of checking all four is what tends to be missed.

A note on adopting an existing format with extensions

A common middle ground worth considering before going full custom is to adopt an existing format and use its extension mechanism for the parts that don't fit. CBOR's tag system, Protobuf's Any, MessagePack's extension types, ASN.1's OCTET STRING-of-anything: most mainstream formats have a way to wrap arbitrary bytes inside a standardly-encoded outer structure. The application layer interprets the wrapped bytes; the format layer just carries them.

This pattern preserves the ecosystem benefits of the standard format (libraries, tooling, debugging, schema-evolution discipline) while letting the application handle the parts that genuinely need custom encoding. The result is usually better than rolling a fully custom format and almost always better than fighting a format whose model does not fit.

Epitaph

A custom binary format is a commitment to maintain it for as long as the bytes exist; defensible when the alternative is fighting an unsuitable format forever, but rare enough that "build my own" should be the conclusion of an analysis, not its premise.

What You Owe Your Future Self

You have decided to build a custom binary format. The previous chapter argued you should not, and now we are past that argument. This chapter is about what you owe your future self, and the future engineers who will inherit your format, in terms of operational discipline. The discipline is the difference between a custom format that survives a decade and a custom format that becomes the team's biggest liability inside three years.

The advice in this chapter is not technical. The technical choices have been covered in the format chapters; you now know the design space well enough to make them. The advice here is operational: things to do once the format exists, that determine whether it stays usable.

Write the spec

The first thing you owe your future self is a written specification. Not a comment in the source code. Not a wiki page that drifts. A specification that describes the wire format precisely enough that someone with no access to your source code could implement an interoperable encoder and decoder.

A real specification covers, at minimum: the byte-level encoding of every type your format supports; the framing rules that say where one value ends and the next begins; the rules for handling malformed input; the alignment, byte order, and endianness conventions; the schema language (if any); the versioning and evolution rules; the canonical-encoding rules (if any); the magic numbers or signatures that identify your format's bytes.

The specification should be testable. Specifically, you should be able to write a conformance test suite — a set of test cases that any conforming implementation must pass — derived from the specification. If the spec cannot generate a test suite, the spec is not precise enough.

Specifications drift from implementations. The drift is the single largest cause of "the spec says X, the code does Y, which is right" debates that consume engineer-years. The mitigation is to keep both in version control, in the same repository, with a review discipline that requires changes to the implementation to update the spec and vice versa.

The specifications for the formats in this book vary in quality. Protobuf's spec is excellent. Avro's is good but has gaps that the implementations have papered over. MessagePack's is okay. NBT's is essentially community-maintained on the Minecraft Wiki. The good specs make their formats survive. The gappy specs produce implementations that disagree on edge cases, which produces operational pain.

Magic numbers and version bytes

Every byte sequence you produce should start with bytes that identify the format. A four-byte magic number is the convention for files; a single-byte tag is enough for in-memory uses. The magic should be: short (you do not want to spend many bytes on identification), distinctive (it should not collide with the magic of other common formats), and stable (once chosen, it should never change).

Examples worth emulating: Parquet's PAR1 (four printable ASCII bytes, easily greppable in hex dumps); the Cap'n Proto buffer header (no magic per se, but the segment-count field at position 0 is structurally distinctive); ASN.1's tag byte distinctiveness for SEQUENCE (0x30, easy to recognize).

Following the magic, you should have a version byte. The version should not start at 0 (so that someone seeing a zero byte after the magic immediately knows something is wrong). Modern conventions start at 1 and bump on every wire-incompatible change.

A version byte is a contract: it tells future implementations which decoder to use. Format authors who omit the version byte because "we won't ever change the format" are wrong, and the omission is the single regret most format authors voice in retrospect. Add the version byte. The cost is one byte per record. The benefit is the ability to evolve.

The conformance test suite

A conformance test suite is the single most important durability asset your format can have. It is a set of (input, expected output) pairs that any conforming implementation must reproduce exactly.

The pairs cover: encoding test cases (given a value, the bytes your encoder should produce); decoding test cases (given bytes, the value your decoder should reconstruct); malformed-input cases (given bytes that violate the format, the specific error the decoder should report); evolution cases (given bytes from an old schema and a new schema, the value a new decoder should produce); canonical-encoding cases (given a value, the unique canonical bytes a deterministic encoder should produce).

The conformance suite is what lets you write a second implementation, in a different language, and verify that it produces the same bytes as the first. It is what lets you upgrade the encoder and verify that the new version is backward-compatible. It is what lets a junior engineer fix a bug in the format and verify they did not break anything.

A conformance suite is not the same as a unit test suite. Unit tests verify that individual functions in your encoder behave correctly. The conformance suite verifies that the bytes are right, by checking against an authoritative set of expected outputs. The conformance suite should be checked in to source control, in a format that can be consumed by any language implementation, and updated only with care.

The formats that have lasted have conformance suites. Protobuf's test data files are part of the open-source distribution. Avro has test cases that any new implementation must pass. CBOR's RFC includes a substantial appendix of test vectors. ASN.1 has the IETF-maintained test corpus.

Schema versioning, even if you don't think you need it

If your format has a schema (even an implicit one — every format has a schema, the question is whether it is in a file or in your head), version it explicitly. Schemas evolve. Pretending they do not, and then evolving them anyway without version markers, produces bytes that no decoder can interpret correctly because the bytes' meaning depends on the schema version that produced them.

The schema version can be embedded in every record (cheapest when records are large; expensive when records are small), embedded in the file or stream header (cheapest when records are short and the stream is long), or embedded in the protocol envelope (when the format is part of an RPC system). Pick one and stick with it. Multiple schema-version mechanisms in the same system are how you get bugs that take a year to reproduce.

Document the schema-evolution rules. Specifically: which changes are backward-compatible (new code reads old data), which are forward-compatible (old code reads new data), and which are breaking. Test the rules with the conformance suite.

Decoder hostility

Your decoder will, at some point, be given input that it does not expect. The input might be a corrupted file, a mid-flight network error, an adversarial payload, or a bug in another decoder that produced wrong bytes. The decoder must handle this input gracefully — meaning, it must report the failure explicitly rather than crashing, looping, or producing garbage.

Specifically, decoders should:

  • Validate every length prefix: a string declaring 4 GB of content must not be allocated; the decoder must reject the input.
  • Bound recursion: nested structures must have a maximum depth, configurable, with a sensible default.
  • Bound iteration: any varint or length-encoded count must have a maximum, with a sensible default that depends on the field's expected use.
  • Reject extra bytes after a record: trailing data after a fully-parsed record is suspicious; the decoder should report it rather than silently dropping it.
  • Handle alignment misses: if your format requires aligned buffers and the buffer is unaligned, the decoder should report the misalignment, not produce undefined behavior.

These rules sound like security advice. They are. Custom formats deployed in untrusted contexts are typically the source of CVE findings several years after they go to production, and the CVE is almost always about one of the rules above. Bake the rules in from the start.

The migration story

Plan how the format will change before it has changed. Specifically: how will you add a field? how will you remove a field? how will you change a field's type? how will you bump the format version? What does a version-2 decoder do when it encounters version-1 bytes? What does a version-1 decoder do when it encounters version-2 bytes?

The answers should be in the spec. They should be tested in the conformance suite. They should be enforced by tooling that prevents schema changes that violate the rules.

The fundamental discipline: changes that break old data are not allowed without a version bump, and version bumps are not allowed without a coordinated migration plan. Every change is either compatible or it is a project. There is no third option.

A note on tooling investment

Custom formats need tooling. Beyond the encoder and decoder themselves, you will need: a hex viewer that understands your format (essential for debugging); a schema linter (if you have a schema language); a breaking-change detector (the equivalent of Buf for Protobuf); language bindings for every language your consumers use; integration with your build system; CI tests that verify cross-implementation compatibility.

The tooling is not optional. It is the difference between a format that is debuggable and a format whose users curse it. Budget for the tooling at the same time you budget for the format itself, and treat the tooling as part of the format's specification.

The pattern that works: the canonical implementation is in the language the team is most comfortable in; the spec is in a neutral format (Markdown or HTML); the conformance suite is in a language-neutral format (JSON or YAML, with byte data in hex); the tooling lives in the same repository as the spec; and new language bindings are required to pass the conformance suite before being accepted.

What you owe people who inherit your format

The team that inherits your format will be smaller, less expert, and less invested in the format than you were. Your job is to write enough down that they can be productive without rebuilding your knowledge from scratch.

This is mostly the spec, the conformance suite, and the tooling. But it is also the rationale. Every nontrivial design decision should have a written justification: why this byte order, why this varint encoding, why this evolution rule. The rationale is what lets the next team change the format intelligently rather than just preserving it as a black box.

The good open-source formats do this through release notes, RFC- style design documents, and substantive commit messages. Custom formats often do not, because the team doesn't think they're publishing. They are publishing — to the next team — and the next team will benefit from being treated as a real audience.

Summary

What you owe your future self, in a list:

  • A precise written specification, in version control, kept in sync with the implementation.
  • A magic number and a version byte at the start of every byte sequence.
  • A conformance test suite that any implementation must pass, in a language-neutral format.
  • Explicit schema versioning, even when you think you don't need it.
  • A hostile decoder that bounds resource use and reports malformed input clearly.
  • A documented migration story, with rules for compatible changes and clear bumps for incompatible ones.
  • Tooling investments commensurate with the format's deployment scale.
  • A written rationale for every nontrivial design decision.

This list is the difference between a format that survives the loss of its original author and a format that does not. The investment is real; budget for it.

The next chapter walks through a worked example of designing a custom format end to end, with the discipline above applied.

A note on social maintenance

The technical disciplines above are necessary and not sufficient. Custom formats also have social maintenance requirements that are easy to underestimate.

Format ownership needs to be assigned, formally, to a specific person or team. The owner is responsible for: triaging implementation bug reports, reviewing changes to the spec or conformance suite, mediating disputes about whether a particular change is compatible, accepting new language bindings, and documenting the format's release notes. Without a designated owner, the format drifts: changes get made by whoever is nearby, the spec falls behind, the conformance suite stops running, and the format becomes a liability.

Format ownership is a job that needs slack in the team's budget. It is not a 0% time activity. The minimum sustainable level is probably 5-10% of one engineer's time, with spikes during schema-change periods. Teams that nominate an owner without allocating the time produce owners who are nominally in charge of a format they cannot actually maintain.

The formats in this book that have survived all have ownership arrangements that work. The formats that have not survived all share the absence of such arrangements.

A note on the path to publication

If you build a custom format that turns out to be useful beyond your team, there is a question of whether to publish it. The calculus has changed in the past decade. Open-sourcing a serialization format used to be a meaningful contribution that attracted users and validation. It is now a long-tail decision with mixed outcomes: published formats accumulate users who file bug reports, request features, and sometimes fork the project; the maintenance burden grows beyond what one team can sustain; the format either becomes a real community project or a maintainer-burnout liability.

If you are going to publish, plan for the maintenance burden explicitly. Document a clear governance model: who can merge PRs, how the spec is updated, how breaking changes are decided. The formats that have published successfully (Cap'n Proto, Borsh, FlatBuffers) all have governance models that work; the formats that have published unsuccessfully often had clear technical merit but inadequate governance.

If you are not going to publish, the discipline in this chapter is still required for internal use. The audience is just smaller.

A Worked Example

This chapter designs a custom binary format end to end, with the discipline of the previous two chapters applied. The example is deliberately small enough to fit in a chapter and large enough to demonstrate the design process. The format we will design is a deterministic compact event log format for a hypothetical embedded telemetry system — a use case that has, in this author's observation, occasionally justified custom formats in the wild.

The exercise is not a recommendation that you should design this format. It is a recommendation that, if you are going to design a format, you should walk through the steps described here.

The use case

Suppose we are designing the on-device storage format for an embedded telemetry device — a sensor module attached to a piece of industrial equipment that records measurements once per second. The device has 32 KB of RAM, a flash-based storage of a few megabytes, no real-time clock (timestamps are sequence numbers), and it talks to a central server via a satellite link that bills per byte and is available a few minutes per day.

The records are: sequence number (uint32), six float32 measurements, an event-type discriminant (enum of 16 values), a boolean fault flag, and a 1-byte status code.

The constraints derived from the use case:

  • Hard byte budget. Each record must encode in a small, predictable number of bytes; uplink time is metered.
  • Hard CPU budget. Encoding must complete in a few hundred cycles per record on a 50 MHz Cortex-M0.
  • Append-only writes. Records are written once and never updated.
  • Random read access. Field-service tools must be able to read individual records without scanning the whole file.
  • Multi-decade durability. Records may be read by tools 20 years after they were written.
  • No allocator. The decoder runs on a microcontroller that cannot afford dynamic allocation.
  • Cryptographic signing of batches. Each batch of N records is hashed and signed; the signature is verified at the server.

Several formats from earlier chapters are candidates. Cap'n Proto and FlatBuffers are zero-copy, but their alignment and metadata overhead is too high for the byte budget. Protobuf is not deterministic (signing requires canonicalization). MessagePack is not deterministic and does not give predictable per-record byte sizes. CBOR with deterministic encoding is the closest fit, but the per-record varint encoding wastes bytes for the fixed-size fields, and the floats encode to 5 bytes (1-byte prefix + 4-byte float) rather than 4 directly.

This is a case — not a common one, but a real one — where the mainstream formats are close to right but not quite. A custom format can save a few bytes per record, which over the device's lifetime is meaningful.

Apply the axes

Before writing any encoding rules, we take an explicit position on each of the seven axes from chapter 2.

Schema-required. The schema is fixed in firmware; the device's firmware version determines the schema. We do not need a schema in the bytes.

Not self-describing. The records are decoded by tools that have the schema available. There is no need for type tags.

Row-oriented. Each record is independently meaningful. We do not need columnar layout.

Parse rather than zero-copy. The records are small enough that parsing is cheap. Zero-copy alignment overhead is not worth its cost here.

Codegen-first. The schema is small; the codegen is a script that emits a C struct and a C decoder. No runtime reflection.

Fully deterministic. The signing requirement makes determinism non-negotiable.

Strict evolution rules. Schema changes happen via firmware upgrade. Multiple firmware versions may be deployed in the field at once; the format must be able to identify which version produced a record.

This is the point of the exercise: by going through the axes explicitly, we have produced a specification for what the format should do, before we have written any wire-format details. The format is now a target with clear properties.

The wire format

A record is exactly 32 bytes:

Offset  Size  Field                Description
0       1     format_version       0x01 for the format we are designing
1       1     firmware_version     0x00-0xff, matches the firmware that produced
2       4     sequence_number      uint32 little-endian
6       1     event_type           enum 0-15
7       1     status_code          uint8
8       1     fault                0x00 or 0x01
9       1     reserved             must be 0x00
10      4     measurement_0        float32 little-endian
14      4     measurement_1        float32 LE
18      4     measurement_2        float32 LE
22      4     measurement_3        float32 LE
26      4     measurement_4        float32 LE
30      2     measurement_5_low    low 16 bits of float32 LE (split for alignment)
0+30 ...

That last field is awkward; let's reconsider. A 32-byte record with six floats forces some structural compromise. Two options:

Option A: pad to 36 bytes for natural float alignment.

Offset  Size  Field
0       1     format_version
1       1     firmware_version
2       4     sequence_number
6       4     measurement_0
10      4     measurement_1
14      4     measurement_2
18      4     measurement_3
22      4     measurement_4
26      4     measurement_5
30      1     event_type
31      1     status_code
32      1     fault
33      1     reserved (must be 0)
34      2     reserved (must be 0)
36              total

Option B: pack the measurements as float16 (8 bytes for six measurements; precision sufficient for our sensor) and stay at 24 bytes:

Offset  Size  Field
0       1     format_version
1       1     firmware_version
2       4     sequence_number
6       2     measurement_0 (float16)
8       2     measurement_1
10      2     measurement_2
12      2     measurement_3
14      2     measurement_4
16      2     measurement_5
18      1     event_type
19      1     status_code
20      1     fault
21      3     reserved
24              total

The choice depends on the actual precision of the sensor. Suppose float16 is sufficient. Option B saves 12 bytes per record over Option A and lets the device store 50% more records in the same flash budget. Option B it is.

The alignment is intentional: every multi-byte field starts at a natural boundary (the float16s at 2-byte alignment, the uint32 at 4-byte alignment). The decoder on the microcontroller can read fields with native loads.

Batches and signatures

Records are grouped into batches. A batch header is 16 bytes:

Offset  Size  Field
0       4     magic (ASCII "TEL1")
4       1     format_version
5       1     batch_format_version
6       2     record_count
8       8     batch_sequence_number
16              records start here

After the records, a signature trailer:

Offset  Size  Field
0       64    Ed25519 signature over the batch header and records
64              total

The whole batch is therefore: 16 bytes of header + (24 bytes × record count) + 64 bytes of signature. For a typical batch of 600 records (10 minutes at 1 Hz), that is 16 + 14400 + 64 = 14480 bytes, or about 24.13 bytes per record amortized. The satellite link cost per record is reduced from a CBOR-based implementation by about 20%, which over a multi-year deployment is real money.

The specification

A real spec for this format would document:

  1. Magic number. The four bytes 0x54 0x45 0x4c 0x31 (TEL1) identify a batch.
  2. Format version. The single byte at offset 4 of the batch header. Currently 0x01. Future versions bump this byte.
  3. Batch format version. Reserved for batch-level layout changes. Currently 0x01.
  4. Record format. Exactly 24 bytes, layout as described above.
  5. Field semantics. Each field's meaning, valid range, and encoding rules.
  6. Float16 encoding. IEEE 754 half-precision, little-endian.
  7. Reserved fields. Must be zero on encode; on decode, must be verified zero (otherwise reject the batch as corrupted).
  8. Endianness. Little-endian throughout.
  9. Signature. Ed25519 signature over (batch header || records), computed by the device's signing key, verified against the device's public key registered at the server.
  10. Determinism. The format is fully deterministic. Given the field values, exactly one byte sequence is produced.
  11. Schema evolution rules. Format version 0x01 is the initial version. To add a field, bump the format version and document the new layout. Firmware that does not recognize a format version must reject the batch and report the version mismatch.
  12. Decoder requirements. Decoders must validate the magic, the format version, the reserved zero bytes, and the signature before exposing data to consumers.

The specification, written out, is several pages of plain prose. It should live in version control, alongside the firmware code.

The conformance suite

The conformance suite for this format is small. Sample test cases:

  • Empty batch: header + zero records + signature. Decoder must accept; reader sees no records.
  • Single record: known field values, known bytes. Encoder must produce these bytes; decoder must reconstruct these values.
  • Wrong magic: bytes XEL1... instead of TEL1.... Decoder must reject.
  • Wrong format version: bytes valid except offset 4 is 0x02. Decoder for version 1 must reject.
  • Non-zero reserved: bytes valid except a reserved byte is 0x01. Decoder must reject.
  • Bad signature: bytes valid except the signature trailer is random. Decoder must reject.
  • Truncated batch: the bytes end mid-record. Decoder must reject.
  • Specific field encodings: each field encoded with extreme values (zero, max, NaN for floats), with the expected bytes documented.

The suite is checked in as a JSON file: an array of test cases, each with a description, input bytes (in hex), expected outcome (value reconstruction or specific error code), and rationale. Any new implementation of the format must pass every case.

The implementation

The encoder and decoder, in C, are small. The encoder writes fields at known offsets in a 24-byte buffer; the decoder reads fields at known offsets. Both are a few dozen lines of code.

The signing logic uses Ed25519 from a vetted library (libsodium, or an embedded equivalent). Key management is the operational responsibility of the deployment, not the format.

A Python decoder, for the field-service tools, is similarly a few dozen lines. A Rust decoder, for the central server, is similar. All three implementations are validated against the same conformance suite.

What the team owes the future

The team designing this format owes its future self:

  • The specification, written, in version control.
  • The conformance suite, in a language-neutral format.
  • A tool that takes a hex dump of a batch and prints the decoded fields, for debugging.
  • A schema-versioning policy that documents how new format versions are introduced, deployed, and consumed.
  • A migration plan for the inevitable case where the next generation of the device requires a slightly different schema.
  • An owner with budgeted time to maintain the format.

The cost of all of this, for a format this small, is real but manageable: maybe two engineer-weeks of initial investment, a few hours per quarter of maintenance, a defined fallback to a different format if the current design hits a hard limit. The benefit is a format that does its job for the device's deployment lifetime without becoming a maintenance burden.

Why this exercise was worth doing

The walkthrough above looks small, and the format is small. The exercise was worth doing for a reason that is easy to miss: every step of the design corresponds to a chapter or section of this book. The use-case analysis came from chapter 1's boundary discussion. The axis-by-axis position came from chapter 2. The byte-level layout was informed by the wire tours in the format chapters. The evolution rules came from chapters 11 and 30. The discipline of the spec, conformance suite, and ownership came from chapter 32.

A team that designs a custom format without doing this exercise typically ends up with a format that is correct but lacks one or more of these pieces, and the missing pieces are what later become liabilities. The book has been a long argument for treating serialization formats as designs that deserve deliberate attention, and this chapter is the demonstration that the deliberate attention is, in fact, possible.

A final note

The format we designed is hypothetical. The hypothetical use case is one I have seen real teams build versions of, with varying success. The teams that succeeded did the steps above. The teams that failed skipped them, usually because the format felt small and they thought they could do it from memory.

Format design is one of those activities where the discipline is the work. The technical decisions are usually small; the operational follow-through is what determines whether the format survives. If you take one thing from this part of the book, take that.

The next section — the appendices — collects the wire-tour material from the format chapters into a single side-by-side reference, so that you can see the same Person record encoded across every format in this book at a glance.

A note on what this exercise deliberately did not include

The example above did not encode our Person record. That was a deliberate choice. Person is the exemplar for the existing formats — a record with optional fields, variable-length strings, a list of strings, a boolean. The hypothetical telemetry format we designed has none of those shapes. Person in this format would require expanding the format substantially (variable-length string encoding, optional fields, list encoding), at which point the format would be reinventing CBOR or MessagePack badly.

This is a useful reminder: custom formats earn their place by being the right answer to a specific question. The telemetry format is the right answer for fixed-record-shape, byte-budget- constrained, deterministic-signed embedded telemetry. It is the wrong answer for variable-shaped business records like Person. Trying to use it for Person would produce a worse format than either MessagePack or Avro.

If you find yourself extending your custom format to handle a shape it wasn't designed for, the right move is usually to declare that shape outside the format's domain and use a different format for it. The custom format covers its niche; the rest of the system uses something else.

A note on the relationship between this book and your judgment

Thirty-three chapters and four appendices is a lot of material about serialization formats. The book has tried to be honest about what each format is for, where it succeeds, where it fails, and how to choose between them. None of the chapters tell you what to do; they describe what is possible and what the trade-offs are. The decision is, and remains, yours.

The book's principal claim is that the decision is worth making deliberately. The format choice will outlast most technical decisions you make in your career. The systems built on a format inherit its constraints, and the constraints are not always the ones the format's marketing materials advertise. Reading the spec, walking the bytes, mapping the format onto the seven axes: these are activities that pay back many times over the modest time they take.

If you do that, the book has done its job. The next time someone in your organization says "we use Protobuf" or "we use Parquet," you will know what is being claimed, what is being omitted, and what questions to ask before assuming the sentence ended the conversation.

Appendix A: Hex Tours

This appendix collects the wire-tour bytes for our Person record across every format in the book, side by side. The intent is that you can scan vertically through the table to compare how each format handles the same logical value. The chapter that introduces each format walks the bytes in detail; the appendix is the reference for the bytes themselves.

The Person record, restated:

id          uint64            42
name        string            "Ada Lovelace"
email       optional string   "ada@analytical.engine"
birth_year  int32             1815
tags        list<string>      ["mathematician", "programmer"]
active      bool              true

For columnar formats (Arrow IPC, Parquet, ORC, Feather), a single record is uncharacteristic of the format's intended workload; the encoding shown is for one record, with the realistic per-record amortized cost noted in parentheses where applicable.

Size summary

The same Person record, encoded by every format in this book:

FormatBytes (single record)Notes
postcard66Tied for smallest
bincode 2.0 (default)~66Effectively tied with postcard
Avro67Smallest schema-first format
Protobuf71
Thrift Compact71Effectively tied with Protobuf
ASN.1 PER (unaligned)~73Smaller with constraints
SCALE75
Bond Compact Binary~75
ASN.1 DER78Fully deterministic
ROS 189
Borsh90
Smile~90Single-record (key sharing not engaged)
SBE92Includes 8-byte message header
ROS 2 (CDR / XCDR2)~100Includes 4-byte CDR header
MessagePack104
XDR104
CBOR105One byte over MessagePack on id
bincode 1.x (default)110Fixed-width 8-byte length prefixes
Hessian 2 (single)~125Class-def overhead amortizes per stream
FlatBuffers~130Aligned + padded for zero-copy
NBT135Uncompressed; gzip cuts to ~100
Cap'n Proto144Word-aligned
rkyv~144Comparable to Cap'n Proto
BSON148Array as document with stringified keys
Arrow IPC~836Single record; ~50 bytes/record at scale
Parquet~1500-2000Single record; ~25 bytes/record at scale
ORC~1500-2000Single record; ~25 bytes/record at scale
Feather V2~836= Arrow IPC file format

The headline observation from the table is that the schema-first formats cluster between 67 and 100 bytes, the schemaless self- describing formats cluster between 100 and 150, the zero-copy formats cluster between 130 and 150, and the columnar formats are catastrophically inefficient for single records but dominate at scale.

Schema-first row formats

Protobuf (71 bytes)

08 2a                                        field 1 (id), varint, value 42
12 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65    field 2 (name), len 12, "Ada Lovelace"
1a 15 61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                field 3 (email), len 21, "ada@..."
20 97 0e                                     field 4 (birth_year), varint, value 1815
2a 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e field 5 (tags), len 13, "mathematician"
2a 0a 70 72 6f 67 72 61 6d 6d 65 72          field 5 (tags), len 10, "programmer"
30 01                                        field 6 (active), varint, value 1

Thrift Compact (71 bytes)

16 54                                        field 1 (delta 1, type i64=6), zigzag(42)=84
18 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65    field 2 (string), len 12, "Ada Lovelace"
18 15 61 64 61 40 61 6e 61 6c 79 74 69 63
   61 6c 2e 65 6e 67 69 6e 65                field 3 (string), len 21, "ada@..."
15 ae 1c                                     field 4 (i32), zigzag(1815)=3630
19 28                                          field 5 (list)
   28 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e
   0a 70 72 6f 67 72 61 6d 6d 65 72
11                                           field 6 (bool=true=1)
00                                           stop field

Avro (67 bytes)

54                                           id: zigzag(42) = 84
18 41 64 61 20 4c 6f 76 65 6c 61 63 65       name: len 12, "Ada Lovelace"
02                                           email union branch 1 (string)
   2a 61 64 61 40 61 6e 61 6c 79 74 69 63 61 6c
      2e 65 6e 67 69 6e 65                   email value: len 21, "ada@..."
ae 1c                                        birth_year: zigzag(1815)
04                                             tags array block of 2
   1a 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician"
   14 70 72 6f 67 72 61 6d 6d 65 72            "programmer"
00                                             tags array terminator
01                                           active: true

ASN.1 DER (78 bytes)

30 4c                                        SEQUENCE, length 76
   02 01 2a                                  INTEGER, length 1, value 42
   0c 0c 41 64 61 20 4c 6f 76 65 6c 61 63 65 UTF8String len 12
   0c 15 61 64 61 40 61 6e 61 6c 79 74 69 63
        61 6c 2e 65 6e 67 69 6e 65           UTF8String len 21
   02 02 07 17                               INTEGER 1815
   30 1b                                     SEQUENCE OF len 27
      0c 0d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e
      0c 0a 70 72 6f 67 72 61 6d 6d 65 72
   01 01 ff                                  BOOLEAN TRUE (DER: 0xff)

Schemaless self-describing formats

MessagePack (104 bytes)

86                                           map of 6 entries
  a2 69 64                                     key "id"
  2a                                             value 42 (positive fixint)
  a4 6e 61 6d 65                               key "name"
  ac 41 64 61 20 4c 6f 76 65 6c 61 63 65       value "Ada Lovelace" (fixstr 12)
  a5 65 6d 61 69 6c                            key "email"
  b5 61 64 61 40 61 6e 61 6c 79 74 69 63
     61 6c 2e 65 6e 67 69 6e 65                value "ada@..." (fixstr 21)
  aa 62 69 72 74 68 5f 79 65 61 72             key "birth_year"
  cd 07 17                                     value 1815 (uint16, BE)
  a4 74 61 67 73                               key "tags"
  92                                             array of 2
    ad 6d 61 74 68 65 6d 61 74 69 63 69 61 6e   "mathematician"
    aa 70 72 6f 67 72 61 6d 6d 65 72            "programmer"
  a6 61 63 74 69 76 65                         key "active"
  c3                                           true

CBOR (105 bytes)

a6                                           map of 6 entries
  62 69 64                                     key "id" (text string len 2)
  18 2a                                        value 42 (uint, 1-byte follow)
  64 6e 61 6d 65                               key "name"
  6c 41 64 61 20 4c 6f 76 65 6c 61 63 65       value "Ada Lovelace"
  65 65 6d 61 69 6c                            key "email"
  75 61 64 61 40 61 6e 61 6c 79 74 69 63
     61 6c 2e 65 6e 67 69 6e 65                value "ada@..."
  6a 62 69 72 74 68 5f 79 65 61 72             key "birth_year"
  19 07 17                                     value 1815 (uint, 2-byte follow)
  64 74 61 67 73                               key "tags"
  82                                             array of 2
    6d 6d 61 74 68 65 6d 61 74 69 63 69 61 6e
    6a 70 72 6f 67 72 61 6d 6d 65 72
  66 61 63 74 69 76 65                         key "active"
  f5                                           value true

BSON (148 bytes)

94 00 00 00                                  document length 148 (LE)
12 69 64 00                                  type i64, key "id"
   2a 00 00 00 00 00 00 00                   value 42
02 6e 61 6d 65 00                            type string, key "name"
   0d 00 00 00                               len 13 (12 + null)
   41 64 61 20 4c 6f 76 65 6c 61 63 65 00    "Ada Lovelace\0"
02 65 6d 61 69 6c 00                         type string, key "email"
   16 00 00 00                               len 22
   61 64 61 40 61 6e 61 6c 79 74 69 63 61
      6c 2e 65 6e 67 69 6e 65 00             "ada@...\0"
10 62 69 72 74 68 5f 79 65 61 72 00          type i32, key "birth_year"
   17 07 00 00                               value 1815 LE
04 74 61 67 73 00                            type array, key "tags"
   2c 00 00 00                               inner doc length 44
   02 30 00                                    type string, key "0"
      0e 00 00 00 6d 61 74 68 65 6d 61 74 69 63 69 61 6e 00
   02 31 00                                    type string, key "1"
      0b 00 00 00 70 72 6f 67 72 61 6d 6d 65 72 00
   00                                          inner doc terminator
08 61 63 74 69 76 65 00                      type bool, key "active"
   01                                        value true
00                                           document terminator

Zero-copy and fixed-layout formats

FlatBuffers (~130 bytes)

The byte-level layout depends on encoder choices and alignment. The structure is: 4-byte root offset, out-of-line strings and the tags vector, the Person table's vtable (16 bytes), and the Person table's data section (32 bytes including padding). Total in the 130-150 byte range.

Cap'n Proto (144 bytes)

00 00 00 00 12 00 00 00         frame: 0 segments-minus-1, 18 words
00 00 00 00 02 00 03 00         root pointer: 2 data words, 3 ptr words
2a 00 00 00 00 00 00 00         id = 42
17 07 00 00 01 00 00 00         birthYear=1815, active=true (low bit)
0d 00 00 00 62 00 00 00         pointer to "Ada Lovelace"
19 00 00 00 b2 00 00 00         pointer to email
21 00 00 00 17 00 00 00         pointer to tags list
... (out-of-line text and list data; see chapter 13 for the full walk)

SBE (92 bytes)

10 00 01 00 01 00 01 00         message header
2a 00 00 00 00 00 00 00         id = 42
17 07 00 00                     birthYear = 1815
01                              active = 1
00 00 00                        padding
0c 00 41 64 61 20 4c 6f 76 65 6c 61 63 65         name (len 12 + bytes)
15 00 61 64 61 40 ...                              email (len 21 + bytes)
00 00 02 00 0d 00 6d 61 ... 0a 00 70 72 ...        tags group (count 2 + entries)

rkyv (~144 bytes)

The exact layout depends on the rkyv version and the encoder's ordering choices. The general shape: out-of-line strings first, then the tags list of pointers, then the Person struct, with a root pointer at the end of the buffer.

Deterministic and constrained formats

Borsh (90 bytes)

2a 00 00 00 00 00 00 00         id: u64 LE
0c 00 00 00                     name length: u32 LE
41 64 61 20 4c 6f 76 65 6c 61 63 65   "Ada Lovelace"
01                              email Some discriminant
15 00 00 00                     email length: u32 LE
61 64 61 40 ...                 "ada@..."
17 07 00 00                     birth_year: i32 LE
02 00 00 00                     tags count: u32 LE
0d 00 00 00 6d 61 74 ...        "mathematician"
0a 00 00 00 70 72 6f ...        "programmer"
01                              active

SCALE (75 bytes)

2a 00 00 00 00 00 00 00         id: u64 LE
30                              name compact length: 12
41 64 61 20 4c 6f 76 65 6c 61 63 65   "Ada Lovelace"
01                              email Some
54                              email compact length: 21
61 64 61 40 ...                 "ada@..."
17 07 00 00                     birth_year: i32 LE
08                              tags compact count: 2
34 6d 61 74 ...                 mathematician
28 70 72 6f ...                 programmer
01                              active

XDR (104 bytes)

00 00 00 00 00 00 00 2a         id: u64 BE
00 00 00 0c                     name length: 12
41 64 61 20 4c 6f 76 65 6c 61 63 65   "Ada Lovelace" (no padding, len 12)
00 00 00 01                     email present flag
00 00 00 15                     email length: 21
61 64 61 40 ... 65 00 00 00     "ada@..." + 3 bytes padding
00 00 07 17                     birth_year: i32 BE
00 00 00 02                     tags count: 2
00 00 00 0d 6d 61 74 ... 6e 00 00 00   "mathematician" + 3 padding
00 00 00 0a 70 72 6f ... 72 00 00      "programmer" + 2 padding
00 00 00 01                     active: 1 (bool as u32)

Other notable encodings

postcard (66 bytes)

2a                              id: varint(42)
0c                              name length: varint(12)
41 64 61 20 4c 6f 76 65 6c 61 63 65
01                              email Some
15                              email length: varint(21)
61 64 61 40 ... 65              "ada@..."
ae 1c                           birth_year: zigzag-varint(1815)
02                              tags count: varint(2)
0d 6d 61 74 ...                 "mathematician"
0a 70 72 6f ...                 "programmer"
01                              active

NBT (135 bytes uncompressed)

0a 00 00                        TAG_Compound, name ""
04 00 02 69 64 00 00 00 00 00 00 00 2a       TAG_Long "id" = 42
08 00 04 6e 61 6d 65 00 0c 41 64 61 ...      TAG_String "name" = "Ada Lovelace"
08 00 05 65 6d 61 69 6c 00 15 61 64 61 ...   TAG_String "email" = "ada@..."
03 00 0a 62 69 72 74 68 5f 79 65 61 72 00 00 07 17  TAG_Int "birth_year" = 1815
09 00 04 74 61 67 73 08 00 00 00 02 ...      TAG_List "tags" of String, 2 entries
01 00 06 61 63 74 69 76 65 01                TAG_Byte "active" = 1
00                              TAG_End

Hessian 2 (~125 bytes single instance)

The class definition is emitted before the first instance:

43 06 50 65 72 73 6f 6e 96      class def: 6 fields named "Person"
   02 69 64 04 6e 61 6d 65 ...  field name strings
60                              instance reference (first class)
ba                              id: 42 (compact long)
0c "Ada Lovelace"               name (short string)
15 "ada@analytical.engine"      email
59 17 07                        birth_year (3-byte int form)
58 96 0d "math..." 0a "prog..."  tags list
54                              active = true

How to read this appendix

The bytes above are the same Person record. Reading vertically through the formats, you can see the design choices in their purest form: which formats spend bytes on field names, which on length prefixes, which on alignment padding, which on type tags, which on framing, and which manage to encode the value in near-minimal size by trading off all of these.

The single most useful exercise the appendix supports is to look at any specific design choice across formats. How does each format encode 42? Look at the first byte after each format's header. How does each format encode the boolean true? Look at the last byte before the terminator. How does each format encode the absence of email? Take the present encodings and visualize the diff against the absent ones; for some formats, it's a missing key; for others, a flipped bit; for others, a discriminant byte that changes from 1 to 0.

The appendix is a compact view of the book's central thesis: the same value, encoded by formats with different design priorities, produces dramatically different bytes. The bytes are the design.

Appendix B: Benchmark Methodology

This book has refused to publish a benchmark shootout. The foreword explained briefly why, and several chapters touched the topic in passing. This appendix collects the substantive argument: most published serialization benchmarks are misleading, the misleading ones are usually misleading in predictable ways, and the small number that are honest are dramatically less useful than they look.

The appendix also describes what an honest benchmark looks like, in case you want to run one yourself. The conclusion will probably be that you should not.

The seven ways benchmarks lie

The implementation, not the format. A serialization benchmark measures the implementation, not the format. Two Protobuf implementations can differ in encode/decode speed by 5× or more. A benchmark of "Protobuf vs. CBOR" using the slowest Protobuf implementation against the fastest CBOR library produces a defensible-looking result that is actually a comment on library quality. The problem is that the benchmark's title suggests it is about the formats; readers extract format-level conclusions from implementation-level data.

The mitigation: every published benchmark should name the exact library and version under test, and the conclusions should be phrased in terms of those libraries, not the formats.

The payload that suits the format under test. Pick a payload with many small integers, and Protobuf wins on size. Pick a payload with many strings, and the gap shrinks. Pick a payload with deeply nested optional fields, and Avro looks bad while schemaless formats look comparable. The choice of payload determines the conclusion. Benchmarks that present results from a single payload, especially one chosen by the author, should be treated as a directional hypothesis rather than a definitive finding.

Cold-start vs. steady-state misalignment. Some benchmarks measure the first encode/decode of a fresh process, including JIT warmup, allocation overhead, and code-path priming. Some measure the steady state after thousands of iterations. The ratio of these two numbers can be 10× or more for some formats, and the choice of which to report depends on the use case. Benchmarks that do not specify which they measured can be off by an order of magnitude in either direction.

Allocator overhead. Encoding and decoding allocate memory for the in-memory representation. The allocator's cost is sometimes the bulk of the benchmark, and the allocator's behavior depends on the language runtime, the OS, the binary's heap configuration, and the memory pressure on the system. A benchmark of "decode speed" that runs in a constrained-memory container produces different numbers than one running on a large machine.

Compression layer ignorance. Most production deployments compress the wire bytes (gzip, snappy, zstandard). Benchmarks that report uncompressed sizes ignore the layer that often dominates the actual size on the wire. Two formats that produce substantially different uncompressed bytes may produce similar compressed bytes, and the size question for the deployment is about compressed bytes.

Network-stack omission. A benchmark of "RPC speed" that measures only the encode-and-decode time misses the network stack, the TLS layer, the proxy, and the queueing delays that in practice dominate end-to-end latency. The format choice matters in proportion to its share of the total time, and that share is often single-digit percent.

Selective quoting. Benchmark results that show one format "3× faster" sometimes come from comparisons where the absolute time is microseconds, and the 3× is from 1µs to 3µs. The relative number is real; the absolute number is below the noise floor of the actual production workload. Benchmark quoting that strips the absolute numbers loses critical context.

The eighth way benchmarks lie: composition

These seven failure modes compound. A benchmark that uses an old library version on a payload chosen for its format's strengths in cold-start mode without compression and reports relative numbers is wrong in seven ways at once. The seven failures do not cancel; they compound. A benchmark with five of these problems is essentially noise, regardless of how careful the rest of its methodology was.

This is the genuine reason the book has refused to publish a shootout. A benchmark that I would believe (no compounding failures, fair payload selection, multiple library versions, both cold-start and steady-state numbers, with-and-without compression, in a real network stack) would take weeks to produce, would apply only to one specific workload, and would need to be re-run every time a library updated. The result would be modestly useful for that specific workload and actively misleading for any other. The honest answer was to not publish.

What an honest benchmark looks like

If you must run one yourself, the structure is:

  1. Define the workload precisely. What is the payload shape? How many records per message? What is the access pattern (encode, decode, partial decode, random access)? What is the compression layer? What is the language runtime, the OS, the hardware?

  2. Choose libraries deliberately. Use the most-recent stable version of each format's primary library, with the configuration most production deployments would use. If multiple libraries exist for a format, run all of them or document which you chose and why.

  3. Measure both encode and decode separately. Encode time is paid by writers; decode by readers; the costs are not symmetric and many workloads care about one more than the other.

  4. Measure size at multiple compression levels. Uncompressed, compressed with the codec you would use in production, and ideally with two or three codecs for comparison.

  5. Measure cold-start and steady-state separately. Both numbers are useful for different workloads; reporting only one obscures.

  6. Measure resource use. Memory allocations, peak heap, CPU utilization. Format choice can affect these even when throughput numbers are similar.

  7. Run on representative hardware. A benchmark on a 64-core server tells you nothing about a 4-core embedded device. If your deployment target differs from the benchmark machine, note it explicitly.

  8. Run repeatedly with statistical rigor. Single runs are noise. Report means, standard deviations, and outlier counts. Outlier behavior often differs more between formats than mean behavior.

  9. Include a real workload, not a synthetic one. Synthetic workloads (random data, fixed-size records) miss the characteristics of real data (skewed distributions, correlated fields, locality patterns) that affect format performance.

  10. Publish the code and the data. A benchmark that cannot be reproduced is not a benchmark.

Why most benchmarks fail

The methodology above is expensive. Producing one benchmark for one workload takes a person-week. Producing benchmarks for the range of workloads you actually care about takes more. Most people do not have the time, and so they publish a benchmark that took an afternoon and skipped the steps that would have cost most of the work.

The afternoon-benchmark genre is not malicious. It is the result of well-meaning engineers trying to inform a community discussion about format choice. The community would be better served by no benchmark at all than by an afternoon-benchmark, because the afternoon-benchmark is treated as authoritative and informs decisions that should have been driven by other considerations.

What format-choice decisions should be driven by instead

Almost everything except benchmarks. The decision frameworks chapter laid out the questions to ask. The format chapters laid out the trade-offs each format makes. The right inputs to a format-choice decision are: the workload's structural properties (boundary, schema control, read/write ratio, determinism, language landscape, data shape, operational appetite), the format's track record on the specific properties that matter, and the team's familiarity with the surrounding tooling. Speed is somewhere on the list, but rarely near the top.

The corollary is that if a format is fast enough for your workload, the speed difference between it and an alternative is mostly irrelevant. This is the case for almost every workload. The cases where speed dominates the choice are real but rare. Most teams are choosing between formats whose speeds are within 2× of each other, and within 2× the speed difference is almost always smaller than the difference in operational properties.

A small acknowledgment

The argument above is not that no benchmarks should ever be run. It is that benchmarks should be run for your specific workload, under your specific conditions, at the time you make the decision, and that published benchmarks should be treated as background context rather than as authoritative input. The distinction is small but important. Run your own benchmarks if the format choice is on the margin. Do not let someone else's benchmarks make the choice for you.

A summary list

Things to remember about benchmarks:

  • The benchmark measures the implementation, not the format.
  • The payload determines the result; the wrong payload produces the wrong conclusion.
  • Cold-start and steady-state numbers can differ by an order of magnitude.
  • Allocator overhead, network stack, and compression all matter and are often missing.
  • Selective quoting (relative numbers without absolute context) is misleading.
  • Composing several of these failures is what produces the benchmarks that fail most badly.
  • Honest benchmarks are expensive to produce.
  • Most format-choice decisions should not be benchmark-driven.

The next appendix is the glossary, where the vocabulary introduced across the book is collected for reference.

Appendix C: Glossary

A reference for the vocabulary used throughout the book. Terms are listed alphabetically. Cross-references to other glossary entries are italicized; cross-references to chapters are by chapter number.

Alignment. The requirement that a multi-byte value start at a byte offset divisible by some power of two. Zero-copy formats (Cap'n Proto, FlatBuffers, rkyv, Apache Arrow) require alignment so that values can be read with native CPU loads. The cost is padding bytes between fields.

Append-only evolution. A schema-evolution rule that allows new fields to be added at the end of a record but forbids any other change. FlatBuffers and Cap'n Proto use this rule.

ASN.1. Abstract Syntax Notation One. A schema language specified by the ITU-T, with a family of encoding rules (BER, DER, PER, OER, etc.) that produce bytes from values. Chapter 20.

Backward compatibility. A schema-evolution property: code written for the new schema can read bytes produced under the old schema.

BER. Basic Encoding Rules. The original ASN.1 encoding, TLV-shaped and non-deterministic. Chapter 20.

Big-endian. A byte-ordering convention where the most significant byte of a multi-byte value comes first.

Bond. Microsoft's schema-first wire format. Chapter 25.

Borsh. Binary Object Representation Serializer for Hashing. A deterministic binary format from the NEAR Protocol ecosystem. Chapter 22.

BSON. Binary JSON. The wire format used by MongoDB. Chapter 6.

Canonical encoding. A subset of a format's encoding rules that produces a unique byte representation for each value. CBOR, ASN.1 DER, Cap'n Proto, and several others have canonical-encoding subsets.

Cap'n Proto. A zero-copy schema-first format with a capability-based RPC layer. Chapter 13.

CBOR. Concise Binary Object Representation. RFC 8949. Chapter 5.

CDR. Common Data Representation. The OMG-standardized wire format used by DDS and ROS 2. Chapter 24.

Codegen. Generation of language bindings from a schema at build time. Most schema-first formats have a codegen step.

Columnar. A data layout where all values of one field across many records are stored contiguously, in contrast to row- oriented layouts. Parquet, ORC, and Apache Arrow are columnar. Chapters 16-19.

Compression. A separate layer applied on top of a serialization format to reduce wire size. Common codecs include gzip, snappy, zstandard, LZ4. The format choice is orthogonal to the compression choice but the two interact.

Conformance suite. A set of test cases, in a language- neutral format, that any implementation of a format must pass.

Decoder hostility. A defensive-coding posture where the decoder validates input thoroughly to handle malformed, adversarial, or corrupted bytes. Chapter 32.

Delta encoding. A compression technique where successive values are encoded as differences from the previous value. Effective for sorted or nearly-sorted columns; used in Parquet and ORC.

DER. Distinguished Encoding Rules. The deterministic subset of ASN.1 BER, used for X.509 certificates. Chapter 20.

Deterministic encoding. A property of a format where the same value always produces the same bytes. Required for hashing, signing, and content-addressable storage.

Dictionary encoding. A compression technique where a column's values are replaced by integer indices into a separate dictionary buffer. Effective for low-cardinality columns; used in Parquet, ORC, Apache Arrow.

Discriminant. A byte or integer in a serialized union or enum that identifies which variant the value is.

Evolution. Schema change over time. The format's evolution rules determine which changes are safe and which are breaking. Chapter 11.

Extension marker. ASN.1's ... syntax that indicates fields may be added in future versions; old decoders skip the new fields cleanly. Chapter 20.

Field tag. A small integer (or, in some formats, a string) that identifies a field within a record. Tagged-field formats use field tags as the wire-level identifier; positional formats do not.

FlatBuffers. A zero-copy schema-first format originally for mobile games. Chapter 12.

Forward compatibility. A schema-evolution property: code written for the old schema can read bytes produced under the new schema (typically by ignoring unknown fields).

Hessian. A self-describing Java RPC format with object- reference preservation. Chapter 27.

IDL. Interface Definition Language. The schema language of a format. Protobuf's .proto files, Thrift's .thrift files, Cap'n Proto's .capnp files are all IDLs.

Ion. Amazon's typed binary-and-text data format. Chapter 7.

Length-delimited. A wire-format pattern where a value is preceded by an integer giving its byte length, followed by that many bytes. Used for strings, byte arrays, and nested messages in many formats.

Little-endian. A byte-ordering convention where the least significant byte of a multi-byte value comes first. The native order on x86 and most ARM systems.

Magic number. A fixed sequence of bytes at the start of a file or message that identifies the format. Parquet's PAR1 is an example.

MessagePack. A schemaless self-describing binary format, data-model-equivalent to JSON. Chapter 4.

MD5 hash gate. ROS 1's mechanism for enforcing exact schema match between publisher and subscriber. Chapter 24.

NBT. Named Binary Tag. The format used by Minecraft. Chapter 23.

OER. Octet Encoding Rules. A modern ASN.1 encoding that trades some of PER's density for simpler implementation. Chapter 20.

Optional field. A field that may or may not be present in a record. Different formats represent absence differently: omission (MessagePack, CBOR), null union branch (Avro), zero discriminant byte (Borsh, Postcard), missing tag in vtable (FlatBuffers).

ORC. Optimized Row Columnar. A columnar storage format in the Hive lineage. Chapter 18.

Parquet. The dominant at-rest analytical storage format. Chapter 17.

PER. Packed Encoding Rules. The dense ASN.1 encoding used in cellular signaling protocols. Chapter 20.

Postcard. A no_std-friendly serde-binary format for Rust. Chapter 26.

Predicate pushdown. An optimization where a query's filter predicate is evaluated against column statistics before reading the actual data, allowing entire row groups or files to be skipped.

Protobuf. Protocol Buffers. Google's schema-first wire format. Chapter 8.

Repetition level. Parquet/Dremel-style metadata indicating which level of nesting a value belongs to. Used to encode repeated and optional nested fields.

Reserved. Protobuf's keyword for retiring a field number without deleting it from the schema, so that the number cannot be reused for a different field. Chapter 8.

rkyv. A zero-copy serde-adjacent format for Rust. Chapter 15.

RLE. Run-Length Encoding. A compression technique that replaces runs of identical values with a count and a single value. Used by Parquet and ORC.

ROS msgs. The wire formats of the Robot Operating System (ROS 1 custom; ROS 2 uses CDR). Chapter 24.

SBE. Simple Binary Encoding. The format for low-latency financial trading. Chapter 14.

Schema. A description of the structure and types of values that will be encoded. Schemas can live in the bytes (Avro container files), in a file (Protobuf .proto), or in source code (rkyv, postcard).

Schema registry. An external service that stores and distributes schemas, often with automated compatibility checking. The Confluent Schema Registry is the canonical example.

Schema resolution. Avro's algorithm for reconciling a reader's schema with a writer's schema at decode time.

Schemaless. A format that does not require an external schema; the bytes carry their own type and structural information. MessagePack, CBOR, BSON, NBT are schemaless.

SCALE. Simple Concatenated Aggregate Little-Endian Encoding. Substrate/Polkadot's deterministic binary format. Chapter 22.

Self-describing. A format whose bytes contain enough information to be decoded without external metadata. Note that self-describing and schemaless are not synonyms; Avro Object Container Files are self-describing but schema-required.

Smile. Jackson's binary JSON format with key sharing. Chapter 7.

Tagged-field evolution. A schema-evolution model where fields are identified by stable numeric tags on the wire, allowing additions, removals, and renames without breaking compatibility. Used by Protobuf, Thrift, Bond.

Thrift. Apache Thrift, the schema-first wire format from Facebook. Chapter 9.

TLV. Tag-Length-Value. A wire format pattern where each value is preceded by a type tag and a length. Used by ASN.1 BER.

UBJSON. Universal Binary JSON. Chapter 7.

Varint. A variable-length integer encoding. Multiple flavors exist (LEB128, Protobuf's varint, ZigZag-encoded signed varint); they differ in details but share the property that small values use few bytes.

Vtable. Virtual Table. FlatBuffers' per-instance metadata that records which fields are present in a particular instance. Chapter 12.

Wire format. The on-the-wire byte representation of a serialization format, distinct from the in-memory representation or the source-code IDL.

XCDR2. Extended CDR version 2. The CDR variant most used by ROS 2. Chapter 24.

XDR. External Data Representation. RFC 4506. Chapter 21.

ZigZag encoding. A mapping from signed integers to unsigned integers that places small absolute values close to zero, so that varint encoding produces compact bytes for negative numbers. Used by Protobuf's sint32/sint64, Thrift Compact, Avro, and several others.

Zero-copy. A property of a format where decoded values are accessed directly from the byte buffer without parsing or copying. FlatBuffers, Cap'n Proto, rkyv, Apache Arrow in the in-memory case.

Appendix D: Further Reading

Annotated pointers to primary sources, secondary sources, and adjacent material that goes deeper than this book has space for. The book has paraphrased rather than quoted; the reading list is where to go for verbatim authority on each format.

Specifications and primary sources

Protobuf. The language reference at protobuf.dev and the encoding documentation at protobuf.dev/programming-guides/encoding. The .proto file syntax is documented separately for proto2 and proto3. The Buf documentation (buf.build/docs) is the best contemporary source for the schema-evolution discipline that proto3 requires.

Thrift. The Thrift specification on the Apache Thrift wiki, plus the BinaryProtocol and CompactProtocol specifications as separate documents. The fbthrift project's documentation (github.com/facebook/fbthrift) covers Facebook's fork's extensions.

Avro. The Apache Avro specification at avro.apache.org/docs. The Confluent Schema Registry documentation (docs.confluent.io/platform/current/schema-registry) covers the operational pattern that most Avro deployments use.

MessagePack. The specification at msgpack.org, plus the GitHub repository (github.com/msgpack/msgpack) for the implementation list.

CBOR. RFC 8949 (the current spec, replacing RFC 7049). RFC 8610 for CDDL. RFC 8152 for COSE. The IANA tag registry at iana.org/assignments/cbor-tags is the authoritative list of semantic tag assignments.

BSON. The specification at bsonspec.org. The MongoDB manual covers the practical implications.

Smile. The specification at github.com/FasterXML/smile-format-specification. The Jackson data-format module is the canonical implementation.

UBJSON. The specification at ubjson.org.

Amazon Ion. The specification at amzn.github.io/ion-docs, including separate documents for the binary format, the text format, and the type system.

FlatBuffers. The documentation at flatbuffers.dev, including the Building With FlatBuffers tutorial. The reference grammar for .fbs files is in the source distribution. TensorFlow Lite's schema (tensorflow.org/lite/microcontrollers/library) is a substantial real-world example.

Cap'n Proto. The documentation at capnproto.org, including the encoding spec, the RPC protocol spec, and the design rationale. Kenton Varda's blog posts on the format's history are worth reading.

SBE. The specification at github.com/real-logic/simple-binary-encoding. The FIX Trading Community's FIX/SP1 documents cover the financial-trading deployment context.

rkyv. The book at rkyv.org/book and the API documentation at docs.rs/rkyv. The format is moving fast; pin to a specific version when reading.

Apache Arrow. The documentation at arrow.apache.org, including the columnar format specification and the IPC format specification. The Arrow Columnar Format document is the authoritative reference for the in-memory layout. Wes McKinney's blog posts at wesmckinney.com cover the format's history.

Parquet. The Apache Parquet specification at parquet.apache.org/docs. The original Dremel paper (Melnik et al., 2010) is essential reading for the repetition/definition- level encoding. Twitter's engineering blog has historical material on Parquet's development.

ORC. The specification at orc.apache.org/specification. The Hortonworks-era documentation covers the operational patterns most ORC deployments use.

Feather. The Apache Arrow documentation, since Feather V2 is Arrow IPC. The original Feather V1 specification is preserved at github.com/wesm/feather.

ASN.1. The ITU-T X.680 series of recommendations is the authoritative standard. X.680 covers the language; X.690 covers BER, CER, DER; X.691 covers PER; X.696 covers OER. ASN.1 — A Communication Between Heterogeneous Systems by Olivier Dubuisson is the most accessible textbook.

XDR. RFC 4506 (current; replaces RFC 1014). The Stellar documentation (developers.stellar.org) covers the modern blockchain use of XDR.

Borsh. The specification at borsh.io. The NEAR Protocol documentation covers the smart-contract usage.

SCALE. The specification in the Substrate documentation (docs.substrate.io). The parity-scale-codec Rust crate documentation covers the canonical implementation.

NBT. The specification on the Minecraft Wiki (minecraft.wiki/w/NBT_format). The wiki.vg site covers the network-protocol uses of NBT inside Minecraft.

ROS msgs. The ROS 1 documentation at wiki.ros.org/Messages, the ROS 2 documentation at docs.ros.org. The DDS specification (OMG document formal/2015-04-10) covers the underlying wire format.

Bond. The documentation at microsoft.github.io/bond, plus the GitHub repository at github.com/microsoft/bond.

Postcard and bincode. The crate documentation on docs.rs (docs.rs/postcard, docs.rs/bincode). The bincode 2.0 release notes cover the migration from 1.x.

Hessian. The Hessian 2.0 specification at caucho.com/resin-3.1/doc/hessian-serialization.xtp (Caucho's historical site; mirrored elsewhere). The Apache Dubbo documentation covers the modern deployment context.

Secondary sources and analyses

Several secondary sources are worth knowing about as context for the formats covered:

Designing Data-Intensive Applications by Martin Kleppmann (2017) includes a serialization-format chapter that complements this book's coverage; Kleppmann's framing is more concise and less opinionated than this book's.

The Dremel paper (Melnik et al., 2010, "Dremel: Interactive Analysis of Web-Scale Datasets") is the foundational paper for columnar nested-data encoding and is essential for understanding Parquet, Arrow, and ORC.

The Avro specification itself is unusually well-written for a specification document and is recommended reading even if you do not use Avro.

The Cap'n Proto encoding documentation is similarly well- written and conveys the design rationale better than most format specs.

The blog posts at wesmckinney.com cover the rationale for Arrow and the history of pandas, which is essential context for the analytical-data-format space.

The book RFC 8949 in 100 Pages (a community resource, not an official IETF publication) is a more accessible introduction to CBOR than the RFC itself.

Adjacent material

Several adjacent topics deserve mention:

Compression codecs. Most binary serialization deployments use compression on top. The standard references are: zstandard's documentation at facebook.github.io/zstd, Snappy's at google.github.io/snappy, gzip's RFC 1952. Yann Collet's blog covers the algorithmic side.

Cryptographic protocols. TLS, X.509, COSE, JOSE. The RFCs are authoritative; for accessible introductions, the IETF informational RFCs (8446 for TLS 1.3) are well-written.

Database storage formats. RocksDB's SST file format, LevelDB's table format, MongoDB's WiredTiger format, PostgreSQL's heap format. These are not covered in this book but use design techniques (LSM-tree variants, B+-tree variants) that intersect with serialization-format design.

Distributed log formats. Kafka's log format, AWS Kinesis's record format. These wrap the formats covered in this book in their own framing layers.

Schema languages. JSON Schema, OpenAPI, GraphQL SDL. These are not binary serialization formats but cover schema-evolution ground that overlaps with the formats in this book.

A note on what this book is not

The book has been deliberately scoped to binary serialization formats. JSON, XML, YAML, TOML, CSV, MIME-type-encoded protocols, and other text formats are mostly out of scope, with mentions where they intersect (Protobuf's JSON encoding, Avro's JSON encoding, BSON's Extended JSON). A full survey of text formats would be a different book.

The book has also been scoped to single-record-and-stream formats. Database storage formats (LSM trees, B-trees, columnar storage layers underneath query engines), filesystem formats (ZFS, Btrfs, ext4 inode structures), and networking-protocol framing (TCP segments, IP packets, HTTP/2 frames) are all outside scope. They are interesting; they are not what this book covered.

A closing pointer

If you encounter a format not covered here, the most useful exercise is the one chapter 2 introduced: take the format's specification and locate, within an hour, where it sits on the seven axes. The exercise will tell you what kind of format it is, what trade-offs it makes, and where in this book's conceptual landscape it belongs. That is the skill the book has tried to teach. The formats themselves come and go; the skill is durable.