A Field Guide to Failure
This chapter is a catalog. Real bugs, in real software, that came down to exception-safety violations. Where I can link to a public post-mortem, I do; where the bug was reported in a bug tracker or paper, I cite that. The point is to make the abstractions of the previous chapters concrete: each of these is what the failure mode looks like when it ships.
I have ordered them roughly by impact, descending. Starting with the most expensive single instance.
The DAO (June 2016): ~$50M
Already discussed at length in chapter 8. The summary in field-guide form:
- Class: Reentrancy / partial state update across opaque call.
- Mechanism: A function transferred Ether to an attacker-controlled address before updating the attacker’s recorded balance. The attacker’s address was a contract that, on receiving Ether, called back into the function, observing the un-updated balance and withdrawing again. Recursively.
- Fix shape: Checks-effects-interactions ordering. Update the balance before the external call.
- Cost: ~3.6 million ETH, ~$50M at the time, ~$11B at 2024 prices. Resolved by hard-forking the Ethereum chain to reverse the transfer, which itself was the most divisive event in cryptocurrency history.
- Why it shipped: The pattern of “send funds, then update accounting” was widespread; the contract had been audited and the auditors did not flag this pattern. The reentrancy attack was known to specialists but not widely understood among Solidity developers in mid-2016.
- Reading: https://hackingdistributed.com/2016/06/18/analysis-of-the-dao-exploit/
Parity Multisig Wallet Self-Destruct (November 2017): ~$280M
A library contract used by many wallets had a kill function with an access modifier missing. An attacker took ownership of the library and called kill, which selfdestruct’d the library. All wallets that depended on it had their delegate-call targets removed; their funds became permanently inaccessible.
- Class: Logical-invariant violation due to insufficient access control around a state-destroying operation.
- Mechanism: The destroyed library contract was a shared dependency. Destroying it left all depending contracts in a state where their core functionality (signing transactions) failed. This is exception safety in the cross-system sense: an operation on one contract violated invariants of every contract that depended on it.
- Fix shape: Access modifiers, plus the structural lesson that state-destroying operations need stronger preconditions than state-modifying ones. In exception-safety language: rolling back a
selfdestructis impossible; the operation has no inverse; therefore it cannot participate in any saga that might fail. The lesson the smart-contract community drew was “the no-go destructive operations need their own access discipline.” - Cost: ~514,000 ETH frozen, ~$280M at the time, much more now.
- Reading: https://www.parity.io/blog/security-alert-2/
The Therac-25 (1985–1987): six deaths, more injuries
The Therac-25 was a radiation therapy machine whose software contained, among many bugs, a race condition between operator input and beam control state. A specific quick edit by an operator could leave the beam in a high-current configuration intended for a different therapy mode, resulting in massive radiation overdoses.
- Class: Concurrent partial-state update; race between two state machines.
- Mechanism: The operator’s input was processed while the beam controller was still mid-transition between modes. The resulting state was inconsistent: the controller believed it was in low-current X-ray mode (operator input had been accepted), while the hardware was still in high-current electron mode (mode-switch hadn’t completed). The basic guarantee was violated at the system level: a partial state update was visible to a downstream component.
- Fix shape: Strong-guarantee state transitions enforced by interlocks. Don’t let operator input transition modes that are mid-transition. The pattern is identical to “don’t release the lock before completing the state update.”
- Cost: Six known deaths from radiation overdose; an unknown number of additional injuries.
- Reading: Nancy Leveson, Safeware, 1995; Leveson and Turner, “An Investigation of the Therac-25 Accidents,” IEEE Computer, July 1993.
The Therac-25 predates much of the formal exception-safety literature. The post-mortem identifies exactly the mistake — a state transition was not atomic with respect to operator input — but does not use the vocabulary, because the vocabulary did not yet exist. This is the field at its worst: lessons learned on machines that kill people, in 1986, that we are still teaching with a different vocabulary in 2024.
The Ariane 5 Flight 501 (June 1996): ~$370M
A floating-point-to-integer conversion in the inertial reference system threw an overflow exception in the Ariane 5’s first launch. The exception, having no handler in that code path, propagated up and triggered the system’s diagnostic-mode response — which, in flight, was to send the diagnostic data instead of guidance data to the main computers. The main computers interpreted the data as guidance, commanded the rocket into an aerodynamic regime it could not survive, and the vehicle self-destructed 39 seconds after launch.
- Class: Unhandled exception with no rollback; the diagnostic mode itself was the partial-state behavior, since it sent data that downstream code interpreted as guidance.
- Mechanism: The inertial reference software was reused from the Ariane 4. The conversion to 16-bit integer was safe within Ariane 4’s flight envelope; Ariane 5 had higher horizontal velocities, which overflowed. The exception handler had been written assuming “if this fires, we’re on the ground; send diagnostics to the panel.” In flight, the panel was the main computer’s data input.
- Fix shape: Strong guarantee at the system boundary: if the conversion fails, do nothing — do not send a different kind of data. Specifically, the fix would have been to have a no-throw fallback (saturate the integer at MAX_INT16, log the saturation), preserving the system invariant that “the data on the bus is always guidance data.”
- Cost: $370M of payload destroyed; the second-largest single failure of an unmanned rocket up to that point.
- Reading: J.L. Lions et al., “Ariane 501 Inquiry Board Report,” July 1996. Available online from ESA. https://www.di.unito.it/~damiani/ariane5rep.html
The Ariane 5 report is required reading. It is short, calm, and devastating. The closing recommendations are textbook exception-safety advice in different language: validate inputs at boundaries, do not silently substitute degraded behavior, exception handlers must understand the system context they will run in.
Knight Capital (August 2012): $440M in 45 minutes
Knight Capital deployed an updated trading system. The deployment partially completed: seven of eight servers were updated; one was not. The new code repurposed a flag that the old code interpreted as “execute parent orders” (test mode); when the old code received production orders flagged this way, it executed them as live, repeatedly.
- Class: Cross-version state inconsistency; partial deployment with no rollback.
- Mechanism: The seven new-code servers and the one old-code server interpreted the same wire-level flag differently. Production traffic, distributed across all eight, hit the old-code server and was treated as a flood of test orders; the old “test” code path turned out to send them to the live market.
- Fix shape: Atomic deployment (either all servers update or none); explicit retirement of repurposed flags; integration tests that verify behavior across mixed versions during deployment. In exception-safety language: a saga step (server update) may fail or partially apply; the system must be designed so that the partial-application state is benign.
- Cost: $440M in 45 minutes. The firm did not survive — it was acquired (effectively a bankruptcy) within a week.
- Reading: SEC release No. 70694, October 16, 2013. https://www.sec.gov/litigation/admin/2013/34-70694.pdf
The Knight Capital incident is, structurally, a saga compensation failure: the deployment “saga” had no compensation defined for the partial-deployment state. This is what chapter 9’s saga discussion is about. In a financial context, with high-throughput markets, the consequences of an uncompensated partial state were 8-figure-per-minute.
The Heartbleed bug (April 2014): credentials of unknown numbers of users
Different shape: not an exception-safety bug per se, but adjacent. OpenSSL’s heartbeat handler trusted a length field in the request without bounds-checking, leading to a read-out-of-bounds that could leak server memory.
I include it because the fix path is exception-safety-adjacent. The proper handling of a malformed request — one whose length field is implausible — is to not respond at all (rather than respond with whatever happens to be in memory). The basic guarantee at this layer would say “if input validation fails, no observable side effect,” which is what the fix achieved: validate the length before reading.
- Class: Input-validation failure; partial response sent on bad input.
- Mechanism: Bounds-check missing; the response code happily wrote the requested length of bytes from a smaller buffer, including everything after the buffer.
- Fix shape: Validate inputs at the trust boundary; on validation failure, no response (or a fixed-shape error response). The strong guarantee at the protocol level: malformed input produces no observable side effect that depends on memory contents.
- Cost: Unknown; estimates of compromised credentials run into the millions.
- Reading: https://heartbleed.com/
The Cloudflare 2017 cache leak (“Cloudbleed”)
Cloudflare’s HTML rewriter, on certain malformed HTML, ran past a buffer’s end and emitted memory contents into responses. Some of those responses were cached, by Cloudflare and by Google’s crawlers, including credentials and session tokens.
- Class: Like Heartbleed, an input-validation/buffer-handling bug. Adjacent to exception safety: the failure mode is “partial output that exposes internal state,” which is the basic-guarantee failure at the data-flow level.
- Mechanism: An equality comparison was changed from
==to>=in a parser, causing an off-by-one in a buffer pointer comparison. The parser walked off the end of a buffer. - Fix shape: Mostly a code-correctness fix, but the meta-lesson — input handlers should fail closed, not produce degraded output — is exception-safety advice.
- Cost: Memory contents from millions of requests cached by third parties; required Google and others to purge caches.
- Reading: https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/
A short list of exception-safety bugs in well-known C++ libraries
Without going to the same depth, a few smaller bugs that are illustrative:
std::vector::insertwith a throwing copy constructor (pre-C++11): Some implementations would leave the vector with the basic guarantee where the standard had mistakenly suggested the strong. Resolved with the C++11 specification clarifications.- Boost.Asio handler exception leaks: Various Boost.Asio versions had paths where an exception thrown from a completion handler could leak the allocator state for the operation. Patched in 1.50-ish.
- Chromium’s
base::SequencedTaskRunner: A historical bug where a task posted from within another task’s execution could observe the runner mid-shutdown, with locks held in a state the task was not designed for. The fix involved making shutdown a more carefully-staged process. The bug shape is “concurrent shutdown is itself an exception-safety problem at the system level.”
These are not eight-figure incidents. They are a daily reality of programming in unsafe languages.
Recurring themes
Reading down the list, a few patterns recur:
-
The bug is rarely in the throw site. Almost every entry is a bug at the call site of a throwing operation, where the caller did not arrange for state to be consistent if control transferred. The throwing operation itself was working correctly.
-
The fix is structural, not local. “Add a try/catch” is rarely the right answer. The right answers involve restructuring code so the partial state is not visible, or making operations atomic, or designing the system so partial-application states are benign.
-
The vocabulary is missing. Most of the post-mortems do not use the words “basic guarantee” or “strong guarantee.” They describe the bugs accurately but with non-standard vocabulary, which makes pattern recognition harder. The Therac-25 report uses the language of “interlocks.” The Ariane 5 report uses the language of “exception handling and reuse.” These are correct descriptions; the field has not yet aligned on a shared vocabulary.
-
The cost is borne by people other than the programmers. Each of the high-impact incidents on this list cost lives, money, or reputation, paid by users and operators, not by the engineers who wrote the buggy code. This is a fact about the political economy of software, not an excuse, and it is part of why the industry has moved slowly on this. The negative externalities are hard to internalize.
-
The same bugs keep happening. Reentrancy was rediscovered in smart contracts ten years after exception safety was formalized in C++. Saga compensation failures keep happening every time a new generation of distributed-systems engineers learns the lessons. The Therac-25 race conditions are conceptually identical to bugs we still see in industrial control systems. We are not learning.
What to do with a field guide
Use it as input to your code review. When you see a function that mutates state, ask yourself: which of these incidents could happen in this code path, scaled down? Almost any function in any non-trivial codebase has the shape to cause a small version of one of these. The discipline of exception safety is, in part, the discipline of recognizing those shapes before they ship.
The final chapter is the practical guide: what to actually do, in working code, given that you do not have time to formally verify everything.
Further reading
- The post-mortem links above. Read at least one in full; they are short and humbling.
- Software Engineering Disasters — there is no single book by this name, but Nancy Leveson’s Engineering a Safer World (2011) is the closest thing, and is excellent.
- The Risks Forum (comp.risks) archive: http://catless.ncl.ac.uk/Risks. Decades of computing-related-failure case studies, many of which are exception-safety-shaped.
- “Lessons learned from the SoftBank outage,” “AWS S3 2017 outage,” any major cloud-provider post-mortem — all read as variations on the patterns in this chapter.