Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

“We had thought… that exceptions would be a way to handle errors. We were wrong. They are a way to handle the inability to handle errors.” — paraphrased, with no apologies, from twenty-five years of C++ committee discussion

This is a book about a problem that the working software industry has, on the whole, decided not to think about.

The problem is small to state. When control flow leaves a function in a way that the function did not write code to handle, what is the state of the world? Are the invariants the function maintained still true? Are the resources the function acquired released? If a partial change was applied to a data structure, has it been undone, or has it been left half-applied for the next caller to discover the hard way?

The answers — for the code you ship, today, in production — are usually one of:

  1. I have never asked.
  2. I asked once and the answer was unsatisfying so I stopped asking.
  3. I’m pretty sure RAII or defer or with handles it.

None of these is wrong, exactly. They are just not enough. The discipline of exception safety is the discipline of asking these questions on purpose, with vocabulary precise enough to actually answer them, and then writing code that gives an answer you can defend in front of someone who knows what to look for.

Who this book is for

You know what try and catch are. You probably know what finally is. You may have written a destructor, or a defer statement, or a with block, or a using clause. You have heard the words “exception safe” and you have a hunch they mean something more than “doesn’t crash.”

You are, in other words, every working engineer.

This book is not a tutorial. It is closer in spirit to Exceptional C++ by Herb Sutter and Modern C++ Design by Andrei Alexandrescu — books that took the problem seriously and assumed you’d already met the basics. The difference is scope: those books were about C++. This one is about every language you might ship in, and every place outside of try/catch where the same shape of problem keeps showing up wearing a different hat.

What this book is not

It is not a defense of exceptions, and it is not an attack on them. The Go community has, more or less, decided exceptions are a mistake; the C++ community has, more or less, decided they are tolerable if handled with care; the Rust community has, more or less, decided that the unwinding mechanism should exist but be used as a last resort and ideally not in library code at all. These are reasonable, mutually contradictory positions held by intelligent people, and you will not find this book trying to settle the argument. You will find it pointing out that the underlying problem doesn’t go away just because you removed the keyword. If you panic in Go, or return an Err(_) from a Rust function that was halfway through a multi-step state mutation, or your iMessage handler gets an EINTR mid-write, you have an exception-safety problem with a different name.

How this book is organized

Part I (chapters 1–4) does the foundational work: what exceptions actually are at the machine level, the formal vocabulary of the three guarantees, RAII as the canonical defense and its limits, and the small set of patterns that buy you the strong guarantee when you need it.

Part II (chapters 5–6) is the cross-language tour. C++, Java, Python, C#, Rust, Go, and JavaScript get one chapter; Common Lisp gets its own, because the condition system is, frankly, a different category of object than what those other languages call exception handling, and pretending otherwise would be dishonest.

Part III (chapters 7–9) is where the book earns its keep. Concurrent code. The DAO. Signal handlers. Saga compensations. Each one is the same problem dressed differently, and the failure modes are the same failure modes, and the fix patterns are the same fix patterns, and once you see the shape you cannot unsee it.

Part IV (chapters 10–12) is what to do about it. Tooling, real bug post-mortems, and a short list of disciplines that get you most of the way without rewriting your codebase.

A note on tone

I am going to be direct about how badly the industry has failed to grapple with this. That is not an aesthetic choice; it is the only honest framing of the situation. We collectively ship enormous amounts of code that handles errors in ways that nobody has thought through carefully, and it bites us — not theoretically, but in production, in money, in human time, and occasionally, in dollars on a blockchain that someone took because we didn’t reorder three lines.

I will, at points, admit that I do not fully grok parts of this. You should not fully grok them either after the first read. The problem is genuinely subtle. If after this book you can argue precisely about which guarantee a function provides and what would have to change about the function to upgrade it, you have already made yourself more dangerous than 95% of the engineers the field has produced. That is the goal.

Let’s begin.

The Invisible GOTO

Edsger Dijkstra published “Go To Statement Considered Harmful” in 1968. Its core argument, stripped of the polemic that gave it the title, is this: control flow that jumps to an arbitrary label, computed dynamically, defeats the human reader’s ability to reason about program state. The reader has to consider, at every point in the program, the possibility that control just arrived from anywhere.

If you accept Dijkstra’s argument — and the working software industry, eventually, did — then you have to also accept the awkward corollary: throwing an exception is a goto. It is a goto to a label you cannot see, computed dynamically by walking up the call stack, possibly across many function boundaries, possibly out of code you did not write, into a handler you have never read. It is, by every metric Dijkstra cared about, worse than goto. A goto at least named its target.

We accepted this trade because we got something for it. We got separation of concerns: the place that detects the failure does not have to know how to handle it. The place that knows how to handle it does not have to thread error propagation through every intermediate function. The intermediate functions get to be written as if errors do not happen, with the actual handling pulled out to a higher level where it belongs. This is, genuinely, a good idea.

What we did not get — and what we mostly still do not have — is a clear-eyed view of what is happening at the moment the exception fires.

What actually happens

Let’s get concrete about a C++ throw, which is the canonical case (Java, C#, and Python all behave similarly enough that the differences don’t matter for this section).

void inner() {
    Widget w;
    do_something();           // (1) might throw
    register_widget(w);       // (2)
}

void outer() {
    Logger log;
    try {
        inner();              // (3)
    } catch (const Error& e) {
        log.write(e.what());  // (4)
    }
}

If do_something() at (1) throws, the runtime does the following, in roughly this order:

  1. Construct the exception object. The thrown value has to live somewhere outside the about-to-be-destroyed stack frames. Implementations vary; a common one (the Itanium ABI, used by GCC and Clang) heap-allocates a small structure with the exception object embedded.
  2. Begin stack unwinding. The runtime walks up the stack frame by frame. For each frame, it consults a table — generated at compile time, often called the .eh_frame or .gcc_except_table — that says: “in this address range, here are the destructors to run, and here is whether there is a catch handler.”
  3. Run destructors for in-scope automatic objects. In our example, when unwinding leaves inner, w’s destructor runs. When unwinding enters outer’s try block boundary, log is not destroyed (it’s still in scope at the catch point), but any temporaries inside the try are destroyed.
  4. Match the catch handler. The runtime checks each catch clause in textual order, against the exception’s static type. If a match is found, control transfers to the handler.
  5. Destroy the exception object after the handler returns (or rethrows, or terminates).

That table-driven walk is the “zero-cost” exception model: in the no-throw case, try blocks compile to nothing extra at runtime, because the address-range-to-handler mapping is in a side table consulted only when an exception fires. The cost is loaded into the throw path, which is dramatically slow — typically tens of microseconds per throw, vs. nanoseconds for a normal return. This is why the conventional wisdom “don’t use exceptions for control flow” exists. It is not aesthetic. It is that you bought the speed of the no-throw path by paying for it in the throw path, and if you throw a lot, you have made a poor trade.

(Older implementations used a different model — “setjmp/longjmp exceptions” — where every try block had a runtime cost but throw was cheaper. Almost nobody uses this anymore; it survives in some embedded toolchains. Visual C++ on x86 is a hybrid for historical reasons. On x86-64, everyone uses table-based.)

The takeaway is that, machine-mechanically, an exception is two things glued together: a non-local control transfer, and a sequence of destructor calls produced by the unwinder. The first part is the GOTO. The second part is what makes the GOTO survivable, and we’ll spend much of this book on it.

“Exceptional” is a misnomer

Bjarne Stroustrup, who invented C++ exceptions, has written repeatedly that they are intended for “exceptional” conditions: things outside the normal flow of the program. This is true as a design intent. As a description of how exceptions are used, it is not.

Allocator failure (std::bad_alloc) is normal in any process that allocates. Parsing input that turns out to be malformed is normal in any program that consumes user input. End-of-file is normal in any program that reads files. Each of these is regularly modeled with exceptions. They are not exceptional; they are expected. The word “exceptional” has caused real harm here, because programmers, hearing it, conclude that they don’t need to think about exception paths very carefully — they’re rare, after all.

A more honest framing: an exception is a non-local error return. It is the error case of a function, communicated to the caller’s caller’s caller without each intermediate caller having to participate. This is, again, a good idea. But it is not rare. In a typical C++ codebase, the set of functions that can throw is approximately all of them, because almost any function can call something that calls something that allocates, and std::bad_alloc is always available.

C++ tried to constrain this with throw() exception specifications. This was, by widespread consensus, a failure — they were checked at runtime rather than statically, the runtime check called std::terminate on violation, and the result was that nobody used them. C++11 deprecated them and introduced noexcept, which is a different thing: noexcept is a property the compiler reasons about for optimization (move operations, container exception safety), and violating it calls std::terminate rather than letting the exception propagate. noexcept is useful where it is true. It does not make the rest of the language any more predictable.

Java tried a different constraint with checked exceptions: the type system tracks which exceptions a method can throw, and callers must either catch them or declare them. This was more rigorous than C++ and arguably worse in practice — programmers responded to the inconvenience by wrapping checked exceptions in unchecked ones, or by declaring throws Exception everywhere, defeating the type system entirely. We’ll come back to this in chapter 5.

The point is that every serious attempt to bound which functions can throw has, in practice, foundered on the fact that very few functions actually qualify. Throwing is the rule, not the exception, and “exceptional” was always the wrong word for it.

What “the call stack” means at the moment of throw

Here is something almost no one teaches and almost everyone needs:

When do_something() throws, the call stack contains every function above it. Every one of those functions had local invariants that may not have been finished. Every one of those functions had partially-constructed data structures, half-modified state, locks held, files opened, network sockets in indeterminate states. The unwinder is going to walk through all of them, in reverse order, running destructors. This is the only mechanism the language provides to clean up after them.

If a function modified two related data structures and only got through one before the throw, the unwinder cannot help. It does not know about the relationship. It runs destructors. That is the entirety of its job.

So when you write code that lives in the middle of a call stack — code you expect other code to call — you are, every time, asking yourself a question whether you realize it or not: what does the world look like if I throw partway through? The destructors of my locals will run. Anything I new’d but haven’t yet handed to a smart pointer will leak. Anything I mutated through a non-owning pointer will stay mutated. Anything I committed to a file or socket has been observed.

The body of this book is what to do about that question. But the question itself is the precondition for everything that follows. If you don’t see the question — if your mental model of “an exception” is a magical thing that takes you to your handler — you will write code that, when it throws, leaves the world in a state nobody can describe.

A small concrete example

Here is a fragment that I have seen, in some shape, in approximately every codebase I have ever worked in:

void Account::transfer_to(Account& other, int amount) {
    balance_ -= amount;
    log_transfer(amount, &other);   // might throw
    other.balance_ += amount;
}

What is the state of the world if log_transfer throws?

balance_ has been decremented. other.balance_ has not been incremented. Money has vanished. The destructors of *this and other will run when those objects eventually go out of scope, and they will obediently destroy themselves with the wrong balances.

Notice what is not wrong: there is no leak, no use-after-free, no undefined behavior. The program is, by every check the language can apply, well-formed. The bug is that the invariant the programmer cared about — that money is conserved — was violated by a partial update across a possibly-throwing call. The compiler had no way to know about the invariant. The unwinder has no way to know about it. The exception fires, the locals go away, the program eventually crashes or doesn’t, and the books don’t balance.

This is the entirety of exception safety, in miniature. The next chapter gives you a vocabulary for talking about it.

Further reading

  • Bjarne Stroustrup, The Design and Evolution of C++ (1994), §16, on the history of exception specifications.
  • Itanium C++ ABI, “Exception Handling” — the actual specification of how the unwinder works, if you have a strong stomach: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
  • Edsger Dijkstra, “Go To Statement Considered Harmful,” Communications of the ACM, March 1968. Read it. It is two pages and it is right.

The Three Guarantees

The vocabulary in this chapter was largely codified by David Abrahams in the late 1990s, in a series of papers and standards-committee documents that became the basis for how the C++ Standard Library specifies exception behavior. Abrahams’s contribution was not the idea that operations might be exception-safe to varying degrees — that was already in the air — but the insistence that there are exactly three useful contractual levels, distinct enough that confusing them produces wrong code, and that every operation in a serious library should declare which level it provides.

The three levels, named:

  1. The no-throw guarantee: the operation will not throw, full stop.
  2. The basic guarantee: if the operation throws, no resources are leaked, all invariants are preserved, but the visible state of objects may have changed in unspecified ways.
  3. The strong guarantee: if the operation throws, the visible state is exactly as it was before the operation began. Either it did the thing entirely, or it did nothing.

There is a fourth level worth naming explicitly even though Abrahams did not formalize it:

  1. No guarantee (sometimes called no exception safety): if the operation throws, anything may have happened. Resources may have leaked, invariants may be broken, the program may be in a state from which no further operation is meaningful. Code that provides no guarantee is broken code. This used to be common; it is still common; nobody likes to admit it.

Most of the rest of this book is about how to recognize each level, how to upgrade between them, and what they cost.

The no-throw guarantee

This is the easiest to define and the easiest to lie about.

A no-throw operation will not, under any circumstances, throw an exception. Not just “won’t throw normally” — won’t throw at all. Calls to it can be made unconditionally; their failure mode, if any, must be communicated by some other channel (return value, side effect, abort).

In C++, no-throw operations are the building blocks of everything else. If you cannot rely on at least some operations to not throw, you cannot recover from an exception, because the recovery code itself might throw. Specifically:

  • Destructors must not throw. The standard does not literally forbid it (until C++11, which made it the default), but throwing from a destructor during stack unwinding from another exception calls std::terminate and ends the program.
  • swap operations on standard-library types are required to not throw. This is structurally important, as we’ll see in chapter 4: copy-and-swap depends on it.
  • Move constructors and move assignment, if marked noexcept, allow standard containers to use them in operations that need the strong guarantee. If they aren’t noexcept, containers fall back to copies, which is slower but exception-safer. This is the famous std::vector reallocation behavior.
  • Primitive-type assignment, integer arithmetic on built-ins, pointer assignment, and similar elemental operations.

Examples of operations that look no-throw but are not, in real C++:

// LOOKS no-throw. ISN'T.
int compute(int x) {
    return x * 2;  // can't throw...
}

// ...except this:
struct Counter {
    int n;
    Counter(int n_) : n(n_) {}
};

int compute(int x) {
    return Counter(x * 2).n;  // Counter's ctor is implicitly noexcept(false)
                              // unless declared otherwise.
}

The second compute does not throw in practice, but the type system does not know that. If you use it in a context that requires noexcept (move operations, certain container operations), the compiler will reject it or fall back to a slower path. In modern C++, the noexcept keyword on a function is the contract:

int compute(int x) noexcept { return x * 2; }

If a noexcept function tries to throw, the runtime calls std::terminate. This is a deliberate, blunt design: noexcept is a load-bearing property in the type system, used by container code to choose between code paths, and a function that lies about it must be punished severely enough that the lie cannot survive testing.

The thing to internalize about the no-throw guarantee is that it is contagious in one direction: a no-throw operation can only call other no-throw operations. As soon as you call something that can throw, you can throw, regardless of whether the throw is “likely.” The contract is binary.

In Java, this guarantee is harder to state because almost everything can throw RuntimeException. The closest analog is “won’t throw a checked exception,” which is partial. Python, Ruby, and JavaScript don’t really have a meaningful no-throw concept; in Rust, panic-free is approximately the same idea, with the same contagion property. Go’s analog is “won’t panic,” which most code does not bother to verify.

The basic guarantee

The basic guarantee says: if an operation throws, no resources are leaked and all invariants are preserved, but the observable state of the objects involved may have changed in unspecified ways.

The two clauses do different work. “No resources are leaked” is about memory, file handles, locks, network connections — the things RAII (chapter 3) directly addresses. “Invariants are preserved” is the harder part. An invariant is anything that must always be true about an object: a vector’s size() <= capacity(), a hash table’s bucket count being a power of two, a binary tree’s order property, an account’s balance being non-negative. Invariants are properties the type promises to maintain, and the basic guarantee says throwing an exception cannot leave the type in a state that violates them.

Here is transfer_to from chapter 1, modified to provide the basic guarantee:

void Account::transfer_to(Account& other, int amount) {
    balance_ -= amount;
    try {
        log_transfer(amount, &other);
        other.balance_ += amount;
    } catch (...) {
        balance_ += amount;  // restore
        throw;
    }
}

This is better. If log_transfer throws, we restore balance_ and re-throw. The invariant “money is conserved across the two accounts” is preserved — we end with the same balances we started with. But notice what we still don’t have: if balance_ += amount succeeds, but then a hypothetical operation between it and the end of the function throws, we have a partial state visible. (In this exact code there’s no such operation, but in real systems there often is.)

The basic guarantee is achievable for most code, most of the time, with reasonable discipline. RAII handles the leak side; thoughtful ordering handles the invariant side. This is the level the C++ Standard Library generally provides for operations that aren’t trivially no-throw, and it is the realistic floor for production code.

A subtlety: the basic guarantee allows the visible state to change. If vector::push_back throws because the new element’s copy constructor threw, the standard says the vector is left in a “valid state” — but the standard does not promise the size or contents are unchanged. (Different implementations make different choices here; libstdc++ and libc++ generally leave the vector unchanged for the throw-during-copy case, but this is not portable contract.) The user has to either query and re-establish state explicitly, or use a strong-guarantee primitive instead.

The strong guarantee

The strong guarantee says: the operation either succeeds completely, or it throws and the visible state is bit-for-bit indistinguishable from what it was before the operation started.

This is the transactional guarantee: commit-or-rollback at the level of a single operation. It is the closest thing C++ provides to a database transaction’s atomicity property, and the analogy is exact.

Here is an account transfer with the strong guarantee, written awkwardly to make the structure visible:

void Account::transfer_to(Account& other, int amount) {
    // Phase 1: do everything that might throw, on a side copy.
    int new_self_balance = balance_ - amount;
    int new_other_balance = other.balance_ + amount;
    log_transfer(amount, &other);  // might throw — fine, no state changed yet

    // Phase 2: commit. These operations must not throw.
    balance_ = new_self_balance;
    other.balance_ = new_other_balance;
}

This works because int assignment is noexcept. Once the throwing operation (log_transfer) is past, the rest of the function is no-throw, so we can guarantee that either we never touched balance_ and other.balance_, or we touched both successfully.

The pattern generalizes: do the work that might throw on a side copy first, then swap or assign the results into place using only no-throw operations. This is the heart of copy-and-swap, of pimpl swapping, of two-phase commit. Chapter 4 develops it in detail.

The strong guarantee is the most expensive level. It typically requires extra copies, or a careful split of the operation into “preparation” and “commit” phases, and many operations cannot be cheaply written this way. The C++ Standard Library only provides the strong guarantee for a subset of operations, and where it does, the documentation generally calls it out. (For example, std::vector::push_back provides it if the element’s copy/move is exception-safe; std::map::insert provides it.)

The honest reality of the strong guarantee in production: most code does not need it, the basic guarantee is enough, and trying to provide the strong guarantee where the basic one suffices is a real source of complexity and slowness. The places where the strong guarantee actually matters are usually places where some external observer (a user, a database, another service, the file system) might see the half-completed state and act on it. Inside a single object, in a single thread, between two member-function calls — basic is generally fine.

Examples of code that claims one and provides another

This is the part that working engineers most need to internalize, because in real codebases the gap between claimed and provided guarantee is enormous.

Example 1: vector::push_back, almost-but-not-quite strong

template<class T>
void vector<T>::push_back(const T& v) {
    if (size_ == capacity_) reallocate(capacity_ * 2);
    new (data_ + size_) T(v);
    ++size_;
}

Where can this throw?

  • reallocate allocates memory, which can throw bad_alloc. If it does, we haven’t touched anything observable yet. Fine.
  • T(v) (copy construction of the new element). If this throws, size_ has not been incremented, so we haven’t observably added an element. The new memory we allocated is gone (well, leaked, in this snippet — fix that with RAII).
  • After the ++size_, nothing. We’re done.

So this snippet is almost strong. The hole: if reallocate succeeded — moved or copied old elements into the new buffer, freed the old buffer — and then the new-element construction throws, we’ve already irreversibly changed the underlying buffer pointer. The visible vector still has the same elements, but a moves-during-reallocation vector implementation has executed those moves, which might have side effects on the source elements.

The standard’s resolution: if T’s move constructor is noexcept, the implementation is allowed to move during reallocation; otherwise it must copy. With copies, if the new-element copy then throws, we can free the new buffer and keep using the old one — strong guarantee. With moves, the implementation has already moved-from the old elements, can’t get them back, and must commit to the new buffer — basic guarantee.

This is a real, documented, intentional trade-off in the standard. The lesson is that “what guarantee does this provide” can depend on properties of the type parameter. There is no shortcut to actually thinking it through.

Example 2: assignment operator, the classic basic-claiming-strong

class Image {
    char* data_;
    size_t size_;
public:
    Image& operator=(const Image& other) {
        delete[] data_;                    // (1)
        data_ = new char[other.size_];     // (2) might throw bad_alloc
        size_ = other.size_;               // (3)
        std::memcpy(data_, other.data_, size_);
        return *this;
    }
};

If new char[other.size_] at (2) throws, what state is the object in?

data_ has been deleted but is now a dangling pointer. size_ is unchanged. The destructor will eventually delete[] data_, which is undefined behavior. We have provided no guarantee. This code is broken.

The basic guarantee fix: assign to a local first, then swap.

Image& operator=(const Image& other) {
    char* new_data = new char[other.size_];           // might throw, fine
    std::memcpy(new_data, other.data_, other.size_);  // might throw, fine
    delete[] data_;
    data_ = new_data;
    size_ = other.size_;
    return *this;
}

This provides the strong guarantee, because all the throwing operations happen before any state mutation. (We could go further with copy-and-swap, but this is enough.)

Example 3: Java assignment looking strong, providing basic

class Cache {
    private Map<String, byte[]> entries = new HashMap<>();
    private long totalBytes = 0;

    public void put(String key, byte[] value) {
        byte[] old = entries.put(key, value);
        if (old != null) totalBytes -= old.length;
        totalBytes += value.length;
    }
}

If entries.put throws (it can: HashMap can resize, allocation can fail), totalBytes has not been touched yet. Good. If value.length throws… it can’t, length is a final int. Good.

But what if entries.put succeeds, then we update totalBytes, then much later something else in the call chain throws? The Cache is now in a state where entries and totalBytes are consistent. So this is, somewhat surprisingly, fine — the basic guarantee, possibly the strong guarantee depending on what put does on resize-failure. Java’s HashMap is documented to be in a valid state on OutOfMemoryError, with no formal guarantee about what got inserted; you would need to check by reading the source.

The point is that even this trivial code requires you to read the documentation of every called function to know what guarantee you’re providing. There is no shortcut.

The cost gradient

GuaranteeTypical costWhen you need it
No-throwFree, but constrains the implementation.Building blocks for everything else. Destructors. Move ops. swap.
BasicRAII discipline; thoughtful ordering.Default for most production code.
StrongExtra copy, or two-phase commit pattern.When external observers might see partial state. Transactional code.

A useful rule of thumb: aim for the basic guarantee everywhere, the strong guarantee at API boundaries that mutate observable state, and the no-throw guarantee for the small set of primitives you build the strong guarantee out of. Everything else is over- or under-engineering.

The next chapter is about the discipline that makes the basic guarantee mostly mechanical: RAII. The chapter after that is about the patterns that buy you the strong guarantee on top of it.

Further reading

  • David Abrahams, “Exception-Safety in Generic Components,” Generic Programming: Proceedings of a Dagstuhl Seminar, 2000. The foundational paper. Find it online; it is free and short.
  • Herb Sutter, Exceptional C++ (1999), items 8–19. Long out, never out of date.
  • Bjarne Stroustrup, The C++ Programming Language, 4th ed., Appendix E (“Standard-Library Exception Safety”). The closest thing to an official statement of guarantees by the language’s designer.
  • Andrei Alexandrescu, “Generic: Change the Way You Write Exception-Safe Code — Forever,” C/C++ Users Journal, 2003. Introduces ScopeGuard, which we’ll meet in chapter 4.

RAII (and What It Doesn’t Solve)

Resource Acquisition Is Initialization is the worst-named good idea in programming. Bjarne Stroustrup coined the phrase in the early 1990s, and almost everyone who hears it for the first time concludes the name means something other than what it means. The name describes the mechanism (resources are acquired in constructors and released in destructors), not the purpose (deterministic cleanup at scope exit, including exception-induced scope exit).

A better name, retroactively, would be Scope-Bound Resource Management — SBRM, occasionally seen in C++ literature. But the field stuck with RAII, so we will too.

This chapter does three things. First, restate RAII precisely enough to argue about. Second, walk through the canonical applications. Third — and this is the chapter’s actual reason for existing — enumerate what RAII does not solve, because the working assumption that “we have RAII, so we’re exception-safe” is one of the most common and most incorrect beliefs in production C++.

RAII, precisely

The mechanism is: an object’s constructor acquires a resource, and its destructor releases it. The destructor will run at scope exit, regardless of how scope is exited (return, fall-through, throw). Therefore, if you wrap a resource in such an object and only manipulate it through that object, the resource will be released exactly once on every control-flow path, including the exception path.

Three properties matter:

  1. Determinism. The destructor runs at a precisely known point — when the object’s lifetime ends. In a stack-allocated case, that’s scope exit. There is no garbage collector latency, no finalizer queue, no maybe-eventually. This is what distinguishes RAII from try { } finally { }-style cleanup in garbage-collected languages: the cleanup is structural, not procedural.

  2. Exception integration. The unwinder runs destructors as it walks up the stack. This is the mechanism by which RAII is exception-safe: the destructor doesn’t need to know about exceptions, and the throw site doesn’t need to know about the destructor. They are connected only by being in the same scope, which is exactly the connection the language already has machinery for.

  3. Composition. RAII objects compose: an RAII object that owns another RAII object cleans up in the right order automatically, because destruction runs in reverse declaration order. You can build arbitrarily complex resource graphs and the cleanup is mechanical.

C++ leans hard on this. The Standard Library provides std::unique_ptr (single owner), std::shared_ptr (refcounted), std::lock_guard and std::unique_lock (mutex acquisition), std::ifstream/std::ofstream (files), std::thread (joining or detaching on destruction), std::vector (memory), std::scoped_lock (multi-mutex, deadlock-free acquisition order), and many more. Almost everything in modern C++ that owns a resource is an RAII type.

Where RAII works

The pattern is at its best when:

  • The resource has a single owner at any time.
  • The resource’s release is itself no-throw (or, at worst, can be safely ignored on failure).
  • The release operation is idempotent or reliably called exactly once.

Memory, file descriptors, mutexes, scoped database transactions, scoped logging contexts, scoped feature flags — all are good fits. Here’s a scoped-mutex example to fix the pattern visually:

class BankAccount {
    std::mutex mu_;
    int balance_;
public:
    void deposit(int amount) {
        std::lock_guard<std::mutex> lock(mu_);
        balance_ += amount;
    }  // lock released here, including via exception
};

If balance_ += amount throws (it can’t, but if it could) the lock is released. If we returned normally, the lock is released. There is exactly one cleanup site, which is the destructor of lock_guard, which the language calls automatically.

A more interesting example: scope guards. Andrei Alexandrescu’s ScopeGuard (and Boost’s BOOST_SCOPE_EXIT, and C++23’s P0052) generalizes RAII to arbitrary cleanup actions:

void update_two_files() {
    write_to_file_A();
    auto rollback_A = make_scope_guard([] { restore_file_A(); });

    write_to_file_B();
    auto rollback_B = make_scope_guard([] { restore_file_B(); });

    // both succeeded; commit by dismissing the guards
    rollback_A.dismiss();
    rollback_B.dismiss();
}

If write_to_file_B() throws, rollback_A runs in its destructor and undoes the change to file A. (rollback_B was never constructed, since the throw happened before the assignment.) If both succeed, both guards are dismissed and neither rollback runs. This is, structurally, a transaction implemented with RAII. We’ll see more of this pattern in chapter 4.

Where RAII leaks

Now the harder part. RAII, despite the surrounding evangelism, does not solve all of exception safety. Specifically:

1. RAII solves resource problems, not invariant problems.

Recall Account::transfer_to from chapter 1:

void Account::transfer_to(Account& other, int amount) {
    balance_ -= amount;
    log_transfer(amount, &other);   // throws
    other.balance_ += amount;
}

There is no resource leak here. There are no constructors-and-destructors that could help. The bug is that the invariant “money is conserved” is violated by a partial mutation. RAII has nothing to say about this. You can wrap every object in unique_ptr and add lock_guards on every mutex and the bug remains, because the bug is not a leaked resource.

The fix is not RAII; it is ordering — do the throwing work first, the no-throw mutations last — or scope guards, or copy-and-swap, all of which are RAII-adjacent but not RAII per se. The destructor mechanism is doing structural work for you, but the work it’s doing is “run this rollback”, which you had to write yourself.

2. RAII does not protect partial construction across multiple sub-objects.

class Connection {
    Socket sock_;        // (a)
    Buffer buf_;         // (b)
    std::vector<int> q_; // (c)
public:
    Connection() : sock_(open_socket()), buf_(allocate_buffer()), q_() {}
};

If open_socket() succeeds (so sock_ is constructed) and then allocate_buffer() throws, what happens? The language unwinds: sock_’s destructor runs, buf_ was never constructed (throw happened in its initialization), q_ was never constructed. So sock_ is cleaned up. Good.

But consider:

class Connection {
    int fd_ = -1;
    char* buf_ = nullptr;
public:
    Connection() {
        fd_ = open_socket_raw();      // (a)
        buf_ = allocate_buffer_raw(); // (b) throws
    }
    ~Connection() {
        if (fd_ != -1) close_socket_raw(fd_);
        if (buf_) free_buffer_raw(buf_);
    }
};

This also looks fine — the destructor cleans up, right? Wrong. The destructor of an object only runs if its constructor completed successfully. If the constructor throws, the language considers the object never to have existed; only fully-constructed sub-objects (data members) get destructed. So fd_ leaks: we set it, but our ~Connection never runs.

The fix is to wrap each raw resource in its own RAII type:

class Connection {
    FileDescriptor fd_;   // RAII wrapper
    OwnedBuffer buf_;     // RAII wrapper
};

Now if OwnedBuffer’s construction throws, FileDescriptor’s destructor runs because it is a fully-constructed sub-object. This is what the C++ Core Guidelines mean when they say “use RAII consistently”: one resource, one RAII type. Mixing raw and managed in the same class is a common mistake.

3. RAII does not save you from bad destructors.

A throwing destructor during stack unwinding from another exception calls std::terminate. This is not a theoretical issue; it shows up in real code, particularly when the destructor logs to a remote service that can fail, or commits a transaction that can fail.

class FileTransaction {
    std::ofstream f_;
public:
    ~FileTransaction() {
        f_.close();
        commit_metadata();  // can throw
    }
};

If FileTransaction’s scope is being unwound due to another exception, and commit_metadata() throws, the program terminates. The fix is either to absorb the exception in the destructor (and log/swallow), or to expose an explicit commit() method that can throw, leaving the destructor as a rollback.

class FileTransaction {
    std::ofstream f_;
    bool committed_ = false;
public:
    void commit() {       // explicit, can throw
        f_.close();
        commit_metadata();
        committed_ = true;
    }
    ~FileTransaction() {
        if (!committed_) {
            try { rollback(); } catch (...) {}
        }
    }
};

This pattern — explicit commit, automatic rollback — is the right shape for any RAII type that performs a non-trivial finalization. Boost.ScopeGuard’s dismiss() is the same idea.

4. RAII does not help across object boundaries when ownership is shared.

std::shared_ptr releases when the last reference dies. If two shared_ptrs reference each other (a cycle), neither dies. Memory leaks. This is not, strictly, an exception-safety problem — but the cycle often forms during exceptional code paths, when the cleanup that would have broken the cycle didn’t run because the throw happened first.

The standard answer is std::weak_ptr for the back-edge, with a discipline of identifying which direction “owns” the relationship and using weak_ptr for the other. In practice this discipline is honored unevenly, and shared-ownership cycle leaks are a perennial bug in production C++.

5. RAII does not address logical atomicity.

Suppose you have three updates that must be applied as a unit: write to disk, update an in-memory index, send a network message. RAII can ensure each individual resource is cleaned up. It cannot ensure the meaning of “all three or none.” If the disk write succeeds and the network send throws, you have a written-but-unreplicated state on disk. A scope guard could roll back the disk write — but rollback is itself fallible (what if the disk has filled up since?), and now you’re writing the rollback’s rollback.

This is the same problem databases solved with write-ahead logging and two-phase commit. Scope guards approximate it for the in-process case. Distributed systems use sagas. Smart contracts use the checks-effects-interactions pattern. We will see all of these in later chapters; for now, register the point that RAII handles cleanup of known resources, and leaves cross-cutting atomicity as an exercise.

6. RAII does not exist in most languages.

Java, Python, JavaScript, Go, Ruby — none of them have destructors that run deterministically at scope exit. They have approximations: Java’s try-with-resources, Python’s with, C#’s using, JavaScript’s using declarations (recently), Go’s defer. Each of these is a procedural form of RAII: instead of “construct an object, register cleanup,” it’s “explicitly say, at this scope, run this cleanup at exit.”

This is almost equivalent — and where it isn’t, the difference is exactly the asymmetry that bites you. Specifically:

  • They don’t compose through ownership chains. If you put a with-scoped Python object inside a list and then return the list, the __exit__ runs at the list-scoping level, not when the list is later destroyed. There is no automatic propagation of “this thing owns that thing, so cleaning up this thing must clean up that thing.”
  • They are syntactic, so users can forget them. RAII in C++ enforces ownership at the type level: if you accept a std::lock_guard&&, you have the lock; if you accept an int, you do not. In Python, open(path) and with open(path) as f: look identical to a type checker.

We’ll come back to this in chapter 5. For now, accept that RAII as I’ve described it is a C++-shaped solution, and other languages either approximate it or just live with the consequences.

RAII is necessary and not sufficient

Consider this the chapter’s thesis. RAII is necessary because without it the basic guarantee for resource cleanup becomes a manual discipline maintained by every author of every function — and maintained inconsistently, as the catalog of historical CVEs demonstrates. With RAII, resource cleanup becomes a property of the type system, enforced structurally, and the basic guarantee for resource cleanup becomes free.

RAII is not sufficient because exception safety is more than resource cleanup. It is also invariant preservation, atomicity, and the maintenance of meaning across mutations that may be interrupted. RAII cannot help with these unless you do the work to encode them as resources. Sometimes that’s natural (scope guards, lock guards). Sometimes it isn’t (multi-object updates, cross-system invariants).

The next chapter is about the patterns that take you the rest of the way: from “no leak” to “no observable damage,” which is the strong guarantee.

Further reading

  • Bjarne Stroustrup, “Why doesn’t C++ provide a ‘finally’ construct?” — http://www.stroustrup.com/bs_faq2.html#finally. Stroustrup’s argument that RAII obviates finally. He is right, conditionally on having destructors run deterministically.
  • C++ Core Guidelines, §R: Resource Management — https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-resource.
  • Eric Niebler, “C++ Coroutines: Under the covers”, on how RAII interacts with coroutines (and where it doesn’t): coroutines suspend mid-function, which is exactly the kind of partial-state-with-resources-held situation RAII was designed to prevent. Worth reading after chapter 7.

Strong Guarantee Patterns

The strong guarantee — commit-or-rollback at the function level — is not free, and most code does not need it. But for the operations that do need it (anything that mutates observable state in a way that another thread, process, or user might see partway through), there is a small, well-known set of patterns that achieve it. This chapter walks through them.

The unifying idea behind every pattern in this chapter is the same trick: separate the parts of the operation that can throw from the parts that mutate observable state, and arrange for the throwing parts to happen first, on a side copy. Once everything that can throw is past, the mutation is reduced to a sequence of no-throw operations, and the strong guarantee falls out for free.

The patterns differ in how they arrange for the side copy and the eventual swap. They are:

  1. Copy-and-swap. The classic. Mostly used for assignment operators.
  2. Pimpl with swap. Same idea, applied at the object level via opaque pointer.
  3. Two-phase commit at the function level. Generalization. Build the result, then commit.
  4. Scope guards (commit-or-rollback). When you can’t sensibly build a side copy.
  5. Persistent / functional data structures. When you can avoid mutation entirely.

Copy-and-swap

The pattern is most often shown as an assignment operator:

class Image {
    char* data_;
    size_t size_;
public:
    void swap(Image& other) noexcept {
        std::swap(data_, other.data_);
        std::swap(size_, other.size_);
    }

    Image(const Image& other);  // copy ctor, normal
    ~Image();                   // delete[] data_

    Image& operator=(Image other) {  // by value!
        swap(other);
        return *this;
    }
};

The key moves:

  1. operator= takes its argument by value, not by reference. The caller’s other is constructed via the copy constructor (or move constructor). If that throws, the assignment never runs — *this is untouched.
  2. Inside operator=, all we do is swap. swap is noexcept. We are now guaranteed to not throw.
  3. The original *this state is now in other, which is a function parameter and goes out of scope at the function’s end. Its destructor cleans up.

This achieves the strong guarantee, plus it covers the self-assignment case naturally (assigning x = x makes a copy first), plus it unifies copy-assignment and move-assignment if swap works on both. The cost: a copy. For some types the copy is expensive, in which case you pay it and you provide the strong guarantee, or you don’t and you provide the basic guarantee. Pick.

A subtlety: swap must be noexcept. If swap throws, the whole pattern collapses, because we end up in a half-swapped state with no recovery path. This is why the C++ Standard Library specifies that std::swap is noexcept for types whose move constructors and assignment operators are noexcept, and why writing your own swap for any non-trivial type means thinking about whether your member-wise swap can throw. Memberwise swap of pointer/integer fields cannot. Memberwise swap of std::string cannot (string’s swap is noexcept). Memberwise swap involving a type with a throwing swap… can. Avoid that.

Pimpl with swap

Pimpl is “pointer to implementation.” The class declares an opaque unique_ptr<Impl>, and all the actual data and member functions live in Impl. This is most often discussed as a compile-time-firewall pattern (changes to Impl don’t recompile users), but it also has an exception-safety property: making a copy of an object means making a copy of its Impl, which is one allocation, and swapping is then a pointer swap, which is noexcept.

// header
class Widget {
    struct Impl;
    std::unique_ptr<Impl> p_;
public:
    Widget(/*...*/);
    Widget(const Widget& other);
    Widget& operator=(Widget other) noexcept {
        std::swap(p_, other.p_);
        return *this;
    }
    ~Widget();
    // public API forwards to p_->...
};

// .cpp
struct Widget::Impl { /* lots of fields */ };
Widget::Widget(const Widget& other) : p_(std::make_unique<Impl>(*other.p_)) {}
Widget::~Widget() = default;

Same shape as copy-and-swap. The win is that the swap of an arbitrarily complex Impl reduces to swapping one pointer, which is unambiguously noexcept. The cost is one heap allocation per Widget and one indirection on every member access. For most code this is invisible. For tight inner-loop types, it is unacceptable; pimpl is not free.

Two-phase commit at the function level

The earlier transfer_to rewrite was an instance of this:

void Account::transfer_to(Account& other, int amount) {
    int new_self = balance_ - amount;
    int new_other = other.balance_ + amount;
    log_transfer(amount, &other);     // throwing work
    balance_ = new_self;              // commit (no-throw)
    other.balance_ = new_other;       // commit (no-throw)
}

The structure is: compute the new state into local variables (might throw, fine), do all the throwing work (might throw, fine), then commit the new state with a sequence of no-throw operations (cannot throw).

This is a pattern, not a syntactic construction. You apply it by reading the function and asking: “where are the throwing operations? where are the mutations? can I move all the throwing operations above all the mutations?” If yes, you can provide the strong guarantee. If no, you need a different pattern.

The pattern fails when:

  • The mutation must precede the throwing operation. (Some kinds of inserts into trees: you have to allocate the node and link it in before you can validate the rebalance.)
  • The mutations are themselves throwing. (Inserting into a vector while holding the strong guarantee for the whole batch.)
  • The “side copy” is prohibitively expensive. (The object holds a million elements and you want to modify two.)

For these cases, you have to use scope guards or accept the basic guarantee.

Scope guards (commit-or-rollback)

Scope guards generalize RAII to arbitrary cleanup actions. The pattern is: register an undo operation; if you reach the end of the operation, dismiss the guards (the undos do not run); if you throw before that point, the guards run their undos in reverse order.

template<class F>
class ScopeGuard {
    F f_;
    bool dismissed_ = false;
public:
    ScopeGuard(F f) : f_(std::move(f)) {}
    ~ScopeGuard() { if (!dismissed_) try { f_(); } catch(...) {} }
    void dismiss() noexcept { dismissed_ = true; }
};

template<class F>
ScopeGuard<F> make_guard(F f) { return ScopeGuard<F>(std::move(f)); }

Used:

void Inventory::move_item(Item& it, Bin& from, Bin& to) {
    from.remove(it);
    auto undo = make_guard([&] { from.add(it); });

    to.add(it);  // throws? undo runs, item back in `from`.
    undo.dismiss();
}

This is the general-purpose pattern when you cannot avoid in-place mutation. It is also the pattern you reach for when you have to coordinate cleanup across multiple resources where each cleanup action has its own subtlety. It composes well: each guard’s lambda captures whatever it needs, and the destructor order takes care of running them in reverse.

Caveats:

  • The undo itself must not throw, or if it does, you must absorb the throw (the example does, with try/catch(...) in the destructor). A throwing undo during normal stack unwinding from another exception calls terminate. This is the same constraint as any destructor.
  • The undo must actually undo. Writing the rollback for a complex operation is itself error-prone; in many cases you discover the rollback is approximately as much work as the original action. This is a real cost, not a syntactic one.
  • The undo must be effective even given partial state. If from.remove(it) had partial side effects that from.add(it) doesn’t restore, you didn’t actually roll back. Test the rollback path; this is precisely the part of the code that production almost never exercises.

Scope guards are in C++23 (std::experimental::scope_exit, eventually std::scope_exit), Boost (BOOST_SCOPE_EXIT), and folly (folly::ScopeGuard). D has them as a language feature (scope(exit), scope(success), scope(failure)). If your language doesn’t have them, write a five-line version; you’ll use it daily.

Persistent data structures

If you can avoid mutation entirely, the strong guarantee is automatic. A persistent data structure is one where operations return a new structure rather than mutating the existing one, with the new structure sharing as much memory with the old as possible. If the operation throws halfway through, the old structure is unaffected — you never had a reference to the half-built new one.

Clojure’s persistent maps and vectors are the canonical example; Scala’s immutable collections, Haskell’s everything, and (in C++) immer provide similar primitives. The cost is constant-factor memory and time overhead from the structural sharing — usually a small multiplier, sometimes worth it.

This is not a pattern you apply to existing code; it’s a choice you make about your data structures. Where you make it, exception safety becomes a non-issue, because there is no observable mutation to worry about. Where you don’t, the other patterns in this chapter still apply.

When the strong guarantee is achievable, when it’s prohibitive

A pragmatic taxonomy of operations:

Operation shapeStrong guarantee?Cost
Pure function (no mutation)Trivially yes.Free.
Single-field assignmentYes.Free.
Multi-field update of one objectUsually yes via two-phase commit.Small; cost of computing on the side.
Update of many fields where some computations require partial-built stateUsually no. Use scope guards for partial undos, accept basic.Modest.
Cross-object update (e.g. transfer between two accounts)Yes via two-phase commit if no-throw assignments at the end.Small.
Container insertion with strong guaranteeStandard library offers this (e.g. vector::push_back if move is noexcept).A copy if move isn’t noexcept.
Container insertion that breaks invariants on exception (e.g. partial sort)No general technique. Sort, then commit, if you can.A full copy of the data being sorted.
External effect (file write, network send, syscall with side effects)Approximate strong guarantee via rollback-by-compensation.Variable. The rollback is itself fallible.
Distributed state mutationStrong guarantee impossible without consensus or sagas.Large. See chapters 8 and 9.

The hardest cases are the bottom two. There is no in-process technique that buys you the strong guarantee for an operation that has irreversibly committed state outside the process. You can compensate (saga), you can use a coordinator (two-phase commit at the distributed level), you can shrink the window (write-ahead log + idempotency), but you cannot, in general, make a network send un-happen.

How to recognize the difference

A useful exercise on a function you’ve written: read it line by line, and at each line, ask “what is the visible state of every observable object if I throw right here?” If the answer is “the same as before the call started,” you have the strong guarantee at that point. If the answer is “the same as before, plus some bookkeeping in *this,” you have the basic guarantee at that point and need to either accept it or restructure to push the throwing line earlier.

Most production C++ code, walked through this exercise, reveals that:

  • The first 60% of functions are accidentally strong-guarantee, because they happen to do all their throwing operations first.
  • The next 30% are basic-guarantee, with one or two specific lines that need either reordering (cheap) or a scope guard (medium).
  • The remaining 10% are interesting — they involve cross-cutting state that is hard to roll back, and the right answer is usually “redesign so the operation is no longer transactional,” not “make it transactional.”

That last category is where the lessons of the rest of this book apply most. The smart contract problem (chapter 8) is exactly this: cross-cutting mutable state, with throws (or reentrancy, which is the same shape) capable of interleaving the mutations. The fix is structural, not local.

A worked example: a safe_replace for a vector

Here is a small problem to make the patterns concrete: replace the ith element of a vector with a new value, providing the strong guarantee. Sounds trivial. Let’s see.

template<class T>
void safe_replace(std::vector<T>& v, size_t i, const T& new_val) {
    v[i] = new_val;  // basic guarantee
}

This is basic, not strong. If T’s assignment operator throws partway through copying, v[i] is in some unspecified valid state — possibly different from both the old and new values.

Two-phase commit version:

template<class T>
void safe_replace(std::vector<T>& v, size_t i, const T& new_val) {
    T tmp = new_val;            // copy might throw, fine
    using std::swap;
    swap(v[i], tmp);            // must be noexcept
}

If T’s swap is no-throw (true for almost all standard library types), this provides the strong guarantee. We made a copy of new_val outside the vector; if that throws, the vector is untouched. Once the copy succeeds, we swap, which can’t throw, so we either swap or we never tried.

Move-aware version:

template<class T>
void safe_replace(std::vector<T>& v, size_t i, T new_val) {  // by value
    using std::swap;
    swap(v[i], new_val);
}

Now the caller can pass an rvalue and we move-construct new_val from it. If T’s move constructor is noexcept, this is one move and one swap, both no-throw, and the only operation that could throw is the caller’s expression that produced the rvalue.

The lesson: even a one-line operation has nuances under exception safety, and the right answer is sensitive to the type being operated on. There are no universal rules, only patterns.

What to remember

  • The strong guarantee is achieved by separating throwing operations from mutating operations, and arranging for all throws to happen before any mutation.
  • Copy-and-swap, pimpl-and-swap, and two-phase commit are syntactic encodings of this idea, suited to different shapes of problem.
  • Scope guards generalize to “register an undo, dismiss on success.” They are necessary when you can’t avoid in-place mutation but still want commit-or-rollback semantics.
  • Persistent data structures sidestep the problem entirely by removing mutation.
  • The strong guarantee is not always achievable, especially for operations with external side effects. The next chapters look at the cases where it isn’t, and what to do instead.

Further reading

  • Andrei Alexandrescu, “Generic: Change the Way You Write Exception-Safe Code — Forever,” C/C++ Users Journal, December 2000. Introduces ScopeGuard.
  • Herb Sutter, Exceptional C++, Items 17–19, on the canonical assignment operator and copy-and-swap.
  • C++23 P0052, “Generic Scope Guard and RAII Wrapper for the Standard Library.”
  • Phil Bagwell, “Ideal Hash Trees,” Tech. Rep. EPFL, 2001 — the underlying paper for Clojure’s persistent maps. Worth reading for the mental model of immutable structural sharing, even if you never implement one.

Exception Safety Across Languages

The previous chapters used C++ as the canonical battleground for two reasons. First, C++ is where the formal vocabulary was developed, and where the design trade-offs are most explicit. Second, C++’s deterministic destruction makes the mechanics visible: you can see, line by line, what runs when. In every other mainstream language, the same problem exists, but the mechanism is hidden behind syntactic sugar or absent entirely, which makes the problem easier to ignore and harder to fix.

This chapter is a tour. Each language gets a section: how exception handling actually works there, what guarantees the language and standard library promise, and where the local culture has decided to look the other way.

Java: checked exceptions and the schism

Java’s distinguishing feature, and the source of its exception-safety neurosis, is checked exceptions. The type system tracks the set of exceptions a method can throw, and callers must either catch them or declare them in their own signatures. The intent was to make error paths visible at the type-system level, the way Haskell makes effects visible.

The reality, after twenty-five years, is that checked exceptions:

  • Force callers to handle errors at the wrong abstraction level (the immediate caller is rarely the right place to handle a SQLException).
  • Bleed implementation details through abstractions (changing a method to use a database means changing every method up the call stack to declare throws SQLException).
  • Drive widespread throws Exception in signatures, which defeats the type check.
  • Drive widespread wrapping of checked exceptions in RuntimeException subclasses, which also defeats the type check.

C# explicitly chose not to have checked exceptions, citing this experience. Newer JVM languages (Kotlin, Scala) have abandoned them. Java itself has effectively abandoned them in idiomatic code: Stream, CompletableFuture, and the java.util.function interfaces all wrap checked exceptions, because they had no choice — Function<T, R> cannot have a throws clause without being parameterized over the exception type, which Java’s generics cannot easily express.

The exception-safety question in Java is therefore: given that nearly all exceptions are unchecked, how do you reason about partial state on throw?

The honest answer is that Java code is usually basic-guarantee by accident. The garbage collector handles the leak side: any object allocated and then orphaned by a throw will be reclaimed, eventually. There are no destructors, so resources held outside memory require try-with-resources (Java 7+) or explicit try/finally:

try (Connection conn = pool.acquire();
     PreparedStatement stmt = conn.prepareStatement(sql)) {
    // use stmt
}  // close() called on stmt and conn in reverse order, even on throw

This is procedural RAII: a syntactic form that the compiler turns into try/finally. It works for resources that implement AutoCloseable. It does not work for invariants that aren’t tied to a closeable.

For invariant preservation, Java code typically follows the same patterns we covered in C++:

public void transferTo(Account other, long amount) {
    long newSelf = this.balance - amount;
    long newOther = other.balance + amount;
    audit.logTransfer(amount, other);   // can throw
    this.balance = newSelf;             // primitive assignment, can't throw
    other.balance = newOther;
}

Two-phase commit translates directly: long assignment is atomic and exception-free, so once the throwing operation is past, the rest is no-throw.

Where Java differs from C++ in important ways:

  • No deterministic destruction means you cannot encode “I own this resource” at the type level. The compiler will not enforce that you close() a FileInputStream. try-with-resources mitigates this for narrow scopes, but does not for fields of a long-lived object.
  • Exception chaining (Throwable.getCause()) is good and used widely. Wrapping a low-level exception in a higher-level one without losing the trace is idiomatic.
  • finally runs after the catch. This is a small thing that bites people; resource cleanup that should happen before the higher-level handler decides what to do has to be in finally, not after catch.
  • Throwing from finally shadows the original exception. This is a known footgun; some teams ban any non-cleanup logic in finally for this reason.

The local culture is to mostly trust that the basic guarantee is provided by GC + try-with-resources, to provide the strong guarantee where it matters by two-phase-commit on primitive fields, and to mostly not think about it otherwise. This works most of the time. The cases where it doesn’t work tend to involve mutable shared state across threads — see chapter 7.

Python: EAFP and the cost of pretending

Python’s culture summarizes itself as Easier to Ask Forgiveness than Permission. The idiomatic pattern is to attempt the operation and catch the exception if it fails, rather than checking preconditions in advance.

try:
    value = d[key]
except KeyError:
    value = default

vs.

if key in d:
    value = d[key]
else:
    value = default

The EAFP version is preferred in Python style guides. It’s also slightly more correct under concurrent mutation (no TOCTOU between the check and the use), and slightly faster in the common case. The point is that exceptions are not exceptional in Python — they are routine flow control for a wide class of operations: dict lookups, attribute access, file I/O, type coercion.

This has consequences for exception safety:

  • Python code throws constantly. AttributeError, KeyError, TypeError, ValueError, StopIteration are all routine. Every line of Python is potentially a throw site.
  • The basic guarantee is the floor and the ceiling. GC handles memory. with handles closeable resources:
with open(path) as f:
    data = f.read()
  • Beyond this, you are on your own. Python provides no destructors with deterministic timing (__del__ exists but runs at GC time, which is non-deterministic and forbidden during interpreter shutdown).
  • The strong guarantee is the user’s problem. There is no syntactic support, no library convention, no idiom. If you want it, you write two-phase commit by hand.

A specific Python pitfall: __init__ that fails partway leaves the object partially constructed and referenced by self, which means it can be observed if you do something silly like assign it before init finishes. The standard pattern is to do all validation before mutation, which is good advice in any language.

Python’s contextlib provides ExitStack, which is the closest thing the language has to scope guards:

from contextlib import ExitStack

def update_two_files():
    with ExitStack() as stack:
        a = stack.enter_context(open('a', 'w'))
        a.write('new content')
        stack.callback(restore_a)

        b = stack.enter_context(open('b', 'w'))
        b.write('new content')
        stack.callback(restore_b)

        # both succeeded
        stack.pop_all().close()  # dismiss; nothing rolls back

This is good and underused. The standard library also provides contextlib.suppress, contextlib.contextmanager, and contextlib.closing — the building blocks of exception-safe Python — and you should know them by heart if you write Python in production.

The specific cost of Python’s culture: because exceptions are routine, every Python function is implicitly the middle of an exception path. The basic guarantee for arbitrary Python code is thus a stronger claim than the basic guarantee for, say, Java code that uses checked exceptions narrowly. Python programmers pay this cost in defects that look like “the cache got into an inconsistent state somehow.” The fix is the same as everywhere else: identify the throwing operations, identify the mutations, and order them.

C#: using and IDisposable, and a more honest exception story

C# took the lessons of Java’s checked-exceptions experiment and chose not to repeat them. All exceptions in C# are unchecked. The language compensates with three things:

  1. IDisposable and using. Same procedural-RAII pattern as Java’s try-with-resources, but introduced earlier and more deeply embedded:
using (var conn = pool.Acquire())
using (var stmt = conn.PrepareStatement(sql)) {
    stmt.Execute();
}

C# 8 introduced using declarations (without the parenthesized scope), which behave like Go’s defer — cleanup at end of enclosing block:

public void DoStuff() {
    using var conn = pool.Acquire();
    using var stmt = conn.PrepareStatement(sql);
    stmt.Execute();
}  // both disposed here, in reverse order

This is the closest mainstream syntactic match to C++’s RAII, and it’s used heavily.

  1. Async-friendly exception handling. try/catch works across await points, with the runtime handling the trampolining. This is non-trivial — the call stack at the catch site is a logical async stack, not the physical one — and C# does it transparently. Other languages have struggled with this; we’ll come back to it in chapter 7.

  2. finally clauses that aren’t first-class but are well-integrated. C#’s try/finally does what you expect, and the language has avoided most of Java’s footguns around finally-shadowing.

C# code’s exception-safety story is otherwise close to Java’s: GC handles leaks, using handles closeable resources, two-phase commit handles invariants where it matters, and most code provides the basic guarantee accidentally. The cultural difference is that C# shops more often have explicit conventions about asynchronous cancellation (which is exception-shaped, even when modeled as OperationCanceledException), because the .NET ecosystem has been more rigorous about cancellation tokens than the JVM has been about its various interruption mechanisms.

Rust: the deliberate avoidance of unwinding

Rust has unwinding. Rust does not want you to use it. This is not a contradiction; it is a design choice.

panic! is Rust’s analog of throwing. It triggers stack unwinding (by default; you can compile with panic = "abort" to skip unwinding entirely), which runs Drop implementations on the way up — Rust’s RAII. So the mechanism is there, and Drop provides exception-safe cleanup of resources by exactly the same mechanism as C++ destructors.

But: idiomatic Rust does not panic for recoverable errors. It returns Result<T, E>. The type system enforces this — Result is the conventional error-return type, the ? operator propagates it cheaply, and panic is reserved for “the program has reached a state it cannot meaningfully continue from.”

What does this buy you?

  1. Errors are visible in signatures. A function that can fail returns Result. A function that returns a non-Result cannot fail (modulo panics, which are reserved for programmer error). This is closer to Haskell’s effect system than to Java’s checked exceptions, because Result is just a value, not a parallel control flow.

  2. ? makes propagation ergonomic. let x = foo()?; is almost as concise as exception-throwing equivalents, but the propagation is type-checked.

  3. No invisible throw sites. A line of Rust code can panic only if it does an explicit panic!, an unwrap, an indexed access (which can panic on out-of-bounds), an arithmetic operation that overflows in debug mode, or calls a function that does one of those. The first four are visually obvious; the fifth requires reading.

  4. std::panic::catch_unwind exists for the cases where you must contain a panic — typically at FFI boundaries, where a panic crossing into C code is undefined behavior. It is intentionally awkward, because you should not use it as a general exception-handling mechanism.

  5. Library code is generally written assuming panic = "abort". That is, library authors are encouraged to provide the basic guarantee (no double-frees, no use-after-free, no UB) under panic, but not the strong guarantee for operation atomicity. The expectation is that users who want strong-guarantee semantics build them out of explicit transaction types, not out of panic-recovery.

The interesting case in Rust is poisoning. If a thread panics while holding a Mutex, the mutex is poisoned — subsequent attempts to lock it return Err(PoisonError). This is Rust admitting, in the type system, that exception safety in the presence of shared state is hard, and forcing the user to acknowledge the possibility that the protected state is inconsistent. You can recover from poison (.into_inner()), but the language makes you say so explicitly.

This is, in the author’s view, the most honest design choice any mainstream language has made about exception safety. Most languages let you silently ignore the possibility of inconsistent state after a panic. Rust makes you write code that names it.

Go: panic, recover, and pretending

Go has panic and recover. Go programmers, by strong convention, do not use them.

The official line — restated repeatedly by the Go team — is that panic is for unrecoverable errors and programmer mistakes. Recoverable errors are returned as values, with the famous if err != nil ceremony at every step. recover is provided so that a server’s request handler can survive a panic in user code without crashing the whole process, and basically not for any other purpose.

This works, in a Go-shaped way. The if err != nil pattern is verbose but makes every error path visible. The defer statement provides procedural RAII: any defer’d call runs at function exit, including on panic.

func transferTo(self *Account, other *Account, amount int) (err error) {
    defer func() {
        if r := recover(); r != nil {
            err = fmt.Errorf("transfer panicked: %v", r)
        }
    }()
    self.balance -= amount
    if err := logTransfer(amount, other); err != nil {
        self.balance += amount  // manual rollback
        return err
    }
    other.balance += amount
    return nil
}

A few things to notice:

  • The exception-safety problem from chapter 1 is exactly the same in Go. Returning err instead of throwing does not change the fact that self.balance was decremented before logTransfer failed.
  • The fix is exactly the same: either reorder so the throwing operation is first, or do manual rollback. Go’s lack of exceptions does not make this disappear; it just makes the err != nil checks visible.
  • defer provides RAII for the cases where it works (file close, mutex unlock). Go’s idiomatic use of defer immediately after acquiring the resource is the syntactic equivalent of constructing an RAII object.
  • The Go community’s relative indifference to exception-safety vocabulary is, in my view, a side effect of believing that “no exceptions” means “no exception-safety problem.” It does not.

A specific Go pitfall: panic in a deferred function will replace the original panic. This is sometimes useful, often dangerous, and the rules for what recover returns in nested deferreds are subtle enough that production Go code generally avoids any panic logic that can’t be expressed as the canonical “wrap a request handler in a recover.”

JavaScript: a cautionary tale

JavaScript has try/catch/finally, throws everything from any line, has no destructors, and no real concept of resource ownership. The story for exception safety is roughly:

  • Use try/finally for cleanup. There is no using, though there is a TC39 proposal for explicit resource management (using declarations) that is at Stage 3 as of this writing — so by the time you read this, modern JS may have something equivalent to C#’s using.
  • The garbage collector handles memory.
  • For everything else, you are on your own, and your one tool is try/finally.
  • async/await propagates throws across await points, similar to C#. Exceptions thrown in async functions become rejected promises, which is occasionally confusing.
  • Unhandled promise rejections used to silently fail; modern runtimes warn loudly about them.

The single most important pattern in async JavaScript for exception safety is Promise.all / Promise.allSettled — the difference being whether one rejection cancels the others or all are awaited. If you launch parallel async operations and one throws, you almost always want Promise.allSettled and explicit handling, because Promise.all’s behavior of “first rejection wins, others run to completion but their results are discarded” is rarely what you wanted, and the discarded results may have side effects you weren’t planning to deal with.

JavaScript’s culture around this is, charitably, casual. The basic guarantee is provided by the GC and try/finally. The strong guarantee is rare enough that most code does not pretend to provide it. Production bugs that look like “the cache state got weird after that one error in 2017” are common, and almost always trace to half-completed mutations on an exception path.

A small comparison table

LanguageResource cleanup mechanismException type-checkingCultural posture
C++RAII (deterministic destructors)noexcept opt-inException safety formalized by Abrahams; widely understood, unevenly applied
Javatry-with-resourcesChecked exceptions (declining)Mostly trust GC; two-phase commit on primitives where it matters
C#using (block and declaration)All uncheckedSimilar to Java, more rigorous about async cancellation
Pythonwith, contextlib.ExitStackNoneEAFP — exceptions are routine; basic guarantee is the floor
RustDropResult<T, E> for errorsPanics are programmer-error; mutex poisoning forces acknowledgment
GodeferNone; errors are values“We don’t use panic” — but the underlying problem is identical
JavaScripttry/finally, using (proposed)NoneCasual; relies heavily on GC

What’s common across all of them

Stripped of the syntactic differences, the same problem appears in every language:

  1. A mutation may be partially applied when a non-local control transfer happens.
  2. The mechanism that runs cleanup (destructor, defer, finally, Drop, using, with) only addresses cleanup of registered resources, not preservation of invariants.
  3. The strong guarantee, where you want it, requires the same patterns: separate the throwing work from the mutating work, and arrange for the mutation to be no-throw at the moment of commit.

If you internalize the patterns from chapters 1–4, you can apply them in any of these languages. The syntax changes; the structure does not. This is the value of the formal vocabulary: it is portable.

The next chapter introduces a system that, I will argue, is genuinely different — strictly more powerful than any of the above — and that almost nobody uses.

Further reading

  • Anders Hejlsberg interview, “The Trouble with Checked Exceptions,” 2003 — the C# designer’s case against Java’s choice. https://www.artima.com/articles/the-trouble-with-checked-exceptions
  • Effective Java by Joshua Bloch, items 49–77 (the exceptions chapter). The standard treatment for Java idioms.
  • The Rust Programming Language, chapter 9 (“Error Handling”). Specifically the panic-vs-Result section.
  • Steve Klabnik, “The Rust Panic Hooks,” for the recovery-at-FFI-boundary use case. Also the Rust Reference’s section on std::panic::catch_unwind.
  • “Errors are values,” Rob Pike, https://go.dev/blog/errors-are-values. The Go team’s stated position.

The Common Lisp Condition System

I want to be careful in this chapter. Lisp partisans have been telling the rest of the industry that Lisp solved exception handling forty years ago, and the rest of the industry has been declining to listen, and the result has been a bad-faith argument on both sides. Lisp partisans usually overstate their case (the condition system is not a silver bullet, and the things it enables are usable but not always wanted). The non-Lisp world usually understates it (the condition system is not a fancy try/catch; it does things that try/catch cannot do, no matter how it’s dressed up).

What I’m going to argue here is the narrower claim: the Common Lisp condition system is strictly more expressive than any mainstream exception system, the additional expressiveness solves a specific class of exception-safety problem that other systems cannot, and the reason it is not more widely adopted is sociological rather than technical.

You do not have to like Lisp to learn from this chapter. You do have to be willing to read parenthesized code for ten minutes.

What try/catch actually is

Before showing what’s different, let’s nail down what mainstream exception handling does.

When you throw (in any of C++/Java/C#/Python/JavaScript), the call stack between the throw site and the catch site is unwound. The frames are gone. Whatever local state existed in those frames is destroyed (in C++, via destructors; in GC languages, made unreachable). Control resumes at the catch handler, which now stands in a fresh frame above whatever frame the catch was in.

This is terminating exception handling. The thrown error terminates everything between the throw and the catch.

Two consequences:

  1. The handler cannot fix the problem at the site of the failure. It can only pick up the pieces at a higher level. If a parser at depth 12 of the call stack hits a malformed token and you’ve got the catch at depth 1, you can decide what to do at depth 1, but you cannot decide to keep parsing from depth 12 — frame 12 is gone.

  2. The handler has no information beyond what the throw site put in the exception object. If the throw site forgot to include some piece of context, you can’t go back and ask. The frames that had that context are destroyed.

These are not limitations of any particular implementation; they are constitutive of what termination means. Every try/catch system in mainstream use is a termination system.

What the condition system adds

Common Lisp’s condition system separates signaling a condition from deciding what to do about it. When code signals, the stack is not immediately unwound. Instead, the runtime walks up the stack looking for handlers, calls each handler with the condition, and the handler can choose:

  1. Ignore. Decline; the next handler up the stack gets a chance.
  2. Handle by transferring control. This is the equivalent of catch: the handler unwinds the stack to its own frame and runs.
  3. Handle by invoking a restart. This is the new thing. The handler can invoke a named recovery action that the signaling code registered at or below the signal point. The signaling frame is still alive — it has not been unwound. The restart runs in that frame, can use its locals, and returns a value to the signal site, after which execution continues normally.

That last option is what termination systems cannot do. The try/catch model destroys the signaling frame; the condition system preserves it, and lets recovery happen in place.

A concrete example

Here is a small example, in Common Lisp. We have a function that parses lines from a log file, with the option of letting the user fix bad lines and try again, skip the line, or use a default value.

(define-condition malformed-log-line (error)
  ((line :initarg :line :reader bad-line)))

(defun parse-line (line)
  (restart-case
      (if (well-formed-p line)
          (parse-it line)
          (error 'malformed-log-line :line line))
    (use-default ()
      :report "Use a default empty entry."
      (make-empty-entry))
    (skip-line ()
      :report "Skip this line and continue."
      nil)
    (use-value (new-line)
      :report "Use a corrected version of the line."
      :interactive (lambda () (list (read-line)))
      (parse-line new-line))))

(defun parse-log (file)
  (with-open-file (s file)
    (loop for line = (read-line s nil :eof)
          until (eq line :eof)
          for entry = (parse-line line)
          when entry collect entry)))

restart-case registers three named restarts: use-default, skip-line, and use-value. Each has a name and a handler body. When something signals 'malformed-log-line inside the dynamic extent of restart-case, the restarts are available but not yet invoked.

Now the caller can choose how to handle:

;; (1) Skip all malformed lines silently.
(handler-bind ((malformed-log-line
                (lambda (c)
                  (declare (ignore c))
                  (invoke-restart 'skip-line))))
  (parse-log "/var/log/messy.log"))

;; (2) Substitute a default for malformed lines.
(handler-bind ((malformed-log-line
                (lambda (c)
                  (declare (ignore c))
                  (invoke-restart 'use-default))))
  (parse-log "/var/log/messy.log"))

;; (3) Re-throw to the next handler up.
(handler-bind ((malformed-log-line
                (lambda (c)
                  (declare (ignore c))
                  ;; do nothing; next handler up will see it
                  )))
  (parse-log "/var/log/messy.log"))

;; (4) Drop into the debugger interactively.
(parse-log "/var/log/messy.log")
;; -> if no handler is bound, the user gets a prompt:
;;    "1: USE-DEFAULT — Use a default empty entry.
;;     2: SKIP-LINE   — Skip this line and continue.
;;     3: USE-VALUE   — Use a corrected version of the line."

Notice (1) through (3): the same parser code, with the handling policy set by the caller, declaratively, without modification to the parser. This is the equivalent of catching the exception — but the parser is not unwound. After invoke-restart 'skip-line fires, control returns to inside parse-line, which returns nil from its restart-case, and the loop in parse-log continues to the next line.

Notice (4): with no handler bound, the system drops into the debugger and asks the user, at runtime, which restart to invoke. This is not a stack trace. It is a live, interactive choice presented to a human, with the program paused, and the human’s answer determining how the program continues. After the choice, execution resumes from the signal site.

This is, frankly, magic. It is also old (the design dates to the early 1980s, mostly attributed to Kent Pitman) and has been stable for thirty-plus years.

Why this is more powerful than try/catch

Termination systems can model restart-style recovery only by encoding the recovery as a return value or callback parameter, threaded through every intermediate function. The condition system makes it part of the dynamic environment, like exception handling itself.

Concretely:

  • In a try/catch system, if you want the parser to retry on a corrected line, you must either: (a) make parse-line take a callback that returns Either<ParsedLine, NeedsRetry> and write the retry logic at the call site, or (b) catch the exception, fix it, and call parse-line again — but the call has to be at the catch site, which is at the outer level, so you’ve lost the parser’s state.
  • In the condition system, the parser’s restart-case declares “here are the recovery actions I support.” The caller’s handler-bind says “for this kind of condition, invoke this recovery.” The two are decoupled, the parser keeps its state, and the recovery runs inside the parser.

This is exception-safety relevant in the following way: a strong-guarantee operation in the condition-system world can sometimes avoid the strong guarantee entirely by recovering in place. The whole problem of “unwind, partial state, atomicity” only arises because we chose to unwind. The condition system gives us a way not to.

Where the condition system genuinely shines

Three classes of problem:

  1. Validation with user interaction. The parsing example above. A long-running batch process that hits malformed input can pause, ask the operator what to do, and continue from that point, rather than aborting the batch or writing logic to checkpoint and resume.

  2. Library code with policy choices that belong to the caller. A network library that hits a slow response can signal a slow-response condition, with restarts wait-longer, use-cached, give-up. The library doesn’t have to guess what the caller wants. The caller doesn’t have to write a parameter for every possible policy.

  3. Recoverable resource exhaustion. Out of memory? Signal memory-low with restarts use-disk-cache, evict-oldest, give-up. Memory pressure handlers in the condition-system world can be policies registered higher up the stack, transparent to allocation sites.

Note the pattern: each of these is a place where, in mainstream languages, the choice between “fail” and “recover” happens at the call site of the failure-detection function, requiring those call sites to know what the policy is. The condition system pushes the policy out to the caller, where it belongs, without changing the API of the failure-detection function.

Where the condition system does not help

  • Resource cleanup on unwinding. When a handler chooses to terminate (transfer control out, unwinding the stack), the cleanup story is the same as anywhere else. Common Lisp has unwind-protect, which is the Lisp try/finally. There is no automatic destructor mechanism — Lisp objects are GC’d, like Java. So RAII-style scoped cleanup is not the strength here; the condition system addresses the control-flow part of exception safety, not the resource-cleanup part.

  • Atomicity across multiple mutations. Recovering in place doesn’t help if you’ve already mutated half the world. The use-value restart in the example works because parsing a line is mostly pure; if it weren’t, you’d still need the strong-guarantee patterns from chapter 4.

  • Performance. Restart machinery has runtime cost. SBCL and CCL pay for it on every signal, not on every potential signal site, but the cost is non-zero.

The condition system is a complement to, not a replacement for, the disciplines we’ve been discussing.

Why almost nobody uses it

If the condition system is so powerful, why is it not in every modern language?

The honest answers are partly technical and partly social:

  1. Most languages chose termination handling first, and the design is not retrofittable. Once your runtime unwinds the stack on throw, you cannot offer in-place recovery without redesigning the runtime. Some languages (Smalltalk, Dylan) have condition-system-like mechanisms; they came from the same lineage.

  2. The condition system is genuinely more complex to learn than try/catch. The trade-off — separating signaling, handling, and restart — adds a vocabulary and a discipline that most programmers will not learn unless they are forced to. Termination handling, for all its limitations, fits in your head in five minutes.

  3. The Lisp ecosystem is small. The condition system is the most expressive feature of a language whose user base never reached the critical mass that would make adopting its ideas obviously profitable. The mainstream languages that have borrowed from Lisp (lambdas, closures, garbage collection, REPLs) borrowed things easier to retrofit.

  4. Restartable signals make compiler optimization harder. A function that may signal — and continue — is harder to inline, to reason about for parallelism, to hoist invariant computations out of. C++ chose, deliberately, to make the no-throw path zero-cost; the condition system implies a small but pervasive cost on the signaling path.

  5. The interactive debugger is a hard sell to teams that ship to production. Common Lisp’s “drop the user into the debugger and offer restarts” is a development-time superpower and a production-time horror. Production code typically binds the top-level handler to log-and-exit, which throws away most of the value. The condition system shines most in the development loop, which is where it began.

None of these are good arguments for the design choice the rest of the industry made. They are explanations.

A worked example: a strong-guarantee batch operation in Common Lisp

Here is the condition-system equivalent of the Inventory::move_item from chapter 4:

(define-condition move-failed (error)
  ((item :initarg :item :reader failed-item)
   (reason :initarg :reason :reader failed-reason)))

(defun move-item (item from to)
  (let ((removed-from-from nil)
        (added-to-to nil))
    (unwind-protect
         (handler-case
             (progn
               (bin-remove from item)
               (setf removed-from-from t)
               (bin-add to item)
               (setf added-to-to t))
           (error (c)
             ;; rollback
             (when (and removed-from-from (not added-to-to))
               (bin-add from item))
             (error 'move-failed
                    :item item :reason c)))
      ;; cleanup that runs whether we succeeded or threw
      nil)))

Two things to notice:

  1. The structure is the same as the C++ scope-guard version. The condition system does not magically eliminate the need for two-phase commit or rollback; the resource-cleanup and invariant-preservation problem is independent of the control-flow problem.

  2. The condition system additionally allows us to expose recovery options to callers, which try/catch cannot:

(defun move-item (item from to)
  (restart-case
      (handler-case
          (progn
            (bin-remove from item)
            (handler-case (bin-add to item)
              (bin-full ()
                ;; rollback first
                (bin-add from item)
                (error 'move-failed :item item :reason :destination-full))))
        (bin-full (c)
          (declare (ignore c))
          (error 'move-failed :item item :reason :destination-full)))
    (force-into-overflow ()
      :report "Put the item in the overflow bin and continue."
      (bin-add (overflow-bin) item))
    (return-to-source ()
      :report "Put the item back in the source bin."
      ;; we know we removed it; put it back
      (bin-add from item))))

A caller bound to force-into-overflow for bin-full conditions can move items in bulk without aborting the batch on the first full destination, and without changing move-item’s signature. The recovery policy lives at the call site of the bulk operation, where the policy belongs; the per-item function exposes the choice and lets the caller bind it.

The same pattern can be approximated in mainstream languages by passing a callback. The difference is that the condition-system version makes the callback dynamic-scoped and named, which means callers many frames up the stack can bind it once and have it picked up by every nested call to move-item, without intermediate functions having to forward it.

This is what dynamic scope is good for, and most languages do not have it because of well-known issues with dynamic scope (action at a distance, hard to type-check). The condition system uses dynamic scope exactly for the case where it is most useful and least dangerous: error handling, where the “signal sender doesn’t know the receiver” pattern is structural.

What I want you to take from this chapter

I do not expect you to switch to Common Lisp. I want you to know that:

  1. Termination is a choice, not a fact about exception handling. Choosing termination buys simplicity and loses expressiveness. Most languages made this choice silently.

  2. Recovery in place is possible, and where it’s possible, it sidesteps the strong-guarantee problem entirely — there’s no atomicity question if the operation never partially completed.

  3. Dynamic-scoped policies are a powerful idea that has been mostly forgotten outside Lisp. Whenever you find yourself threading a “what to do on failure” callback through twelve layers of API, you are reinventing condition handlers, badly.

  4. The interactive-debugger-with-restarts workflow changes what “encountering a bug” means in development. If you have never had a bug pause your program, drop you into a REPL with full access to the live state, let you inspect and fix the problem, and then resume from where the bug occurred, you do not know how dehumanizing it is to write code without that capability. Try it once. The Common Lisp implementations SBCL and CCL are free and easy to install; Practical Common Lisp by Peter Seibel is online and free; an afternoon will show you what the rest of us have lost by not having this.

The next chapter goes back to the mainstream world, where the condition system does not exist, and looks at what happens to exception safety when you add concurrency on top of it.

Further reading

Exception Safety in Concurrent Code

Exception-safe sequential code is hard. Exception-safe concurrent code is roughly an order of magnitude harder, and the reason is not subtle: in sequential code, the time between “an exception fires” and “control reaches a handler” is short and involves only one thread of control. In concurrent code, that interval may be observable by other threads, who may act on the partially-mutated state before the throwing thread has cleaned up, before any handler has run, before any rollback could conceivably take effect.

This chapter walks through the specific problems that emerge when exceptions and concurrency interact: lock-holding under throw, partial state visible across threads, the interaction with thread cancellation, and the surprisingly subtle question of what “the strong guarantee” even means in a multi-threaded context.

The basic problem: locks held under throw

class Cache {
    std::mutex mu_;
    std::unordered_map<std::string, std::string> entries_;
public:
    void put(const std::string& key, const std::string& value) {
        std::lock_guard<std::mutex> lock(mu_);
        entries_[key] = value;       // (1) might throw bad_alloc
        update_metadata(key);        // (2) might throw
        notify_observers(key);       // (3) might throw
    }
};

If update_metadata at (2) throws, the lock is released by lock_guard’s destructor — that part is fine. But what state is entries_ in? entries_[key] = value succeeded. Whatever invariant update_metadata was supposed to maintain in another part of the cache (perhaps a separate lookup table, an LRU list, something) has not been maintained. The cache is now internally inconsistent, the lock is released, and another thread is free to observe the half-updated state.

This is the internal version of the problem from chapter 1, with one new wrinkle: in sequential code, the throwing thread will eventually unwind to a handler that can decide to fix or ignore the inconsistency. In concurrent code, any other thread can call Cache::get between the throw and the handler, and will see whatever state the throw left.

Rust calls this poisoning and bakes it into std::sync::Mutex. If a thread panics while holding the lock, the next attempt to acquire the lock returns Err(PoisonError) instead of granting access. Code that wants to access the (potentially inconsistent) state has to call .into_inner() on the error explicitly, acknowledging that “yes, I know the state may be wrong, I’m taking responsibility.”

C++ has no equivalent. C++’s std::mutex happily releases on the throw and grants the next acquisition without comment. Java and Go are the same. The discipline of “leave the protected state consistent before you throw” is, in the non-Rust languages, entirely on the programmer’s shoulders, and the language has no way to remind you that you forgot.

The fix for the snippet above:

void put(const std::string& key, const std::string& value) {
    // do throwing work outside the lock if possible
    auto new_entry = make_entry(key, value);  // might throw
    auto new_metadata = compute_metadata(key); // might throw

    std::lock_guard<std::mutex> lock(mu_);
    // critical section is now no-throw
    entries_[key] = std::move(new_entry);
    metadata_[key] = std::move(new_metadata);
}

Reorder so all throwing operations happen outside the lock, and the critical section becomes a sequence of no-throw moves and assignments. This is two-phase commit at the concurrent level: the “side copy” is constructed under no-lock, then committed under-lock with no-throw operations.

This pattern is not always achievable. If you need to read protected state in order to compute the new state (read-modify-write), you can’t do the computation outside the lock. The standard answer here is atomic compare-and-swap: read the current state, compute the new state without holding the lock, then atomically swap if the state hasn’t changed in the meantime, retrying if it has. Lock-free data structures generalize this. But that is a different chapter, in a different book.

Strong guarantee under concurrency: redefined

Recall the strong guarantee: “either the operation completes, or the visible state is exactly as before.”

Under concurrency, visible to whom? Three answers, increasingly demanding:

  1. Visible to the same thread. After the throw, this thread sees the same state it would have seen without calling the operation. Easy, satisfied by careful sequential design.

  2. Visible to other threads after the throw is fully handled. Once this thread’s exception has propagated to its handler and any cleanup has run, other threads see consistent state. Achievable with two-phase commit under a lock, as above.

  3. Visible to other threads at every moment, including during the operation. Other threads may observe the operation in progress, but they see the state as either fully-before or fully-after, never partially-applied. Strict strong guarantee. Requires either no observable intermediate state (atomic update) or a coordination mechanism that makes the intermediate state invisible (RCU, copy-on-write).

Most real concurrent code provides (1) by accident, (2) by careful design, and (3) only when the data structure is explicitly designed for it. The C++ Standard Library’s concurrent containers (shared_mutex, the various atomic types, etc.) provide (2) under their stated contracts and (3) only for individual atomic operations.

The interesting cases are when (3) should be required and isn’t. Example: a logging system where a log line is written to a buffer and a sequence number is incremented. If the buffer write throws after the sequence number is incremented, an external observer reading “current sequence number” sees a number that doesn’t correspond to any actual log line. Subtle, occasionally consequential.

Thread cancellation: the special case

Some platforms support thread cancellation: another thread requests this thread to stop, and at well-defined cancellation points, the thread is interrupted. POSIX threads have pthread_cancel; Java has Thread.interrupt(); .NET has cooperative cancellation tokens.

In C++, on most implementations, cancellation is implemented as a special exception — a “forced unwind” that runs destructors normally but cannot be caught by ordinary catch. The intent is that cleanup runs but the cancellation is ultimately uninterruptable. In practice, this means that any code that is exception-safe in the ordinary sense is also cancellation-safe, modulo the rule that you must not catch the cancellation exception, which is enforced by the runtime.

This is mostly fine, but it introduces a subtle constraint: any catch (...) clause in C++ might be catching a cancellation exception, and re-throwing it (or letting it propagate) is essential to allowing the cancellation to actually take effect. The idiom:

try {
    do_work();
} catch (...) {
    cleanup();
    throw;  // critical: re-throw to propagate cancellation
}

Code that does catch (...) {} (catch all and swallow) will silently absorb thread cancellation, defeating the mechanism. This has caused real bugs.

Java’s interruption is similar in spirit: InterruptedException is checked, and the convention is that catching it without re-throwing or restoring the interrupted status leaves the thread in a state where the interruption has been silently lost. Effective Java item 70 covers this in detail; the short version is “Thread.currentThread().interrupt(); after catching InterruptedException if you don’t re-throw.”

In Go, context.Context is the cancellation mechanism, and it is not exception-shaped: it is a value-based signal, with code expected to check ctx.Done() at strategic points. This avoids the silent-swallow problem at the cost of explicit checks. Whether you prefer this trade-off depends on your views about visibility-versus-ergonomics in error handling.

Async cancellation

Async/await frameworks (C# Task, Python asyncio, JavaScript Promise/async, Rust async) all have to deal with cancellation across await points, where the call stack at the cancellation point is logical rather than physical. The languages handle this in different ways:

  • C# / .NET: CancellationToken passed explicitly. Awaited operations check the token; on cancellation, throw OperationCanceledException. The exception unwinds through await boundaries normally.
  • Python asyncio: Task.cancel() causes the next await to raise CancelledError at the awaiting site. The stack is logical; the unwinding happens as the awaited tasks return errors that propagate up.
  • JavaScript: No built-in cancellation. AbortController provides an out-of-band signal; libraries are expected to check it. There is no exception-shaped cancellation mechanism in the language.
  • Rust async: Cancellation is “drop the future.” Drop runs destructors. The future cannot run any more. This is, in some ways, the cleanest model — cancellation is just resource cleanup of an unfinished computation — but it has surprising consequences for code that “must complete” parts of an operation.

The Rust model deserves a second look, because it interacts with exception safety in a way that is not obvious. If a Rust async function has done step_1() and is partway through await step_2(), and the future is dropped, step_2’s Drop runs, and the function’s local state is dropped, but step 1’s effects on external state remain. The basic guarantee for resource cleanup is preserved (the Drop chain runs), but the strong guarantee for the operation’s logical atomicity is not, because Rust can’t synthesize a rollback of step_1. As of this writing, the async-Rust community is still building patterns for this — the cancellation-safety literature is recent and ongoing.

Lock ordering and deadlock under exception

A specific concurrent failure mode worth flagging: if you acquire two locks and an exception fires between them, the cleanup must release them in the right order. RAII gets this right automatically (destructors run in reverse construction order). But if you acquire locks in a function and re-throw to the caller, the caller may then try to acquire the same locks in a different order, deadlocking.

void Service::transfer(Account& a, Account& b, int amount) {
    std::scoped_lock lk(a.mu_, b.mu_);  // deadlock-free acquisition
    // ...
}

std::scoped_lock (C++17) is the right answer: it acquires multiple mutexes in a deadlock-free order using std::lock, releases them in reverse on destruction, and is exception-safe by construction. Older code using std::lock_guard paired with std::lock separately is more error-prone. The footgun: if you lock a.mu_ first and then throw before acquiring b.mu_, you’re safe. But if a different code path locks b.mu_ first and then a.mu_, you have lock-ordering inconsistency, and that can deadlock with the first path. Exception cleanup never causes the deadlock; the deadlock was structural, just exposed by the exception path having different timing.

std::scoped_lock and the equivalent in other languages exist because the manual version is so easy to get wrong. Use them.

A small disaster: the destructor-throws-during-unwinding interaction

Combined with concurrency, the rule “destructors must not throw” becomes harder to satisfy. A destructor that, say, releases a lock and then logs to a remote service is fine in normal operation. During stack unwinding from an exception in another thread, the remote service may have become unreachable. The destructor’s logging call throws, the runtime sees a throw during unwinding, and std::terminate is called.

The fix is, again, to absorb exceptions in destructors:

~RemoteLogTransaction() {
    try {
        if (!committed_) remote_log_.send_rollback();
    } catch (...) {
        // swallow; we're already unwinding
        local_log_.write("rollback failed; remote service unreachable");
    }
}

This is ugly. It is also necessary. Every destructor in concurrent C++ code that does anything beyond pointer cleanup needs to consider this case.

What concurrent exception safety looks like, in practice

A short list of the disciplines that work:

  1. Do throwing work outside critical sections. Compute the new state with no lock held; commit under-lock with no-throw operations. Two-phase commit at the concurrent level.

  2. Use lock-free or atomic primitives where the strong concurrent guarantee is required. std::atomic<T>::compare_exchange for read-modify-write. std::shared_mutex for read-mostly state. These are tools designed for the case where the state-consistency window must be zero.

  3. Use std::scoped_lock (C++17) or equivalents for multi-mutex acquisition. The deadlock-avoidance and exception-safe release are built in.

  4. Handle thread cancellation explicitly. Re-throw cancellation exceptions in C++. Restore interrupt status in Java. Honor cancellation tokens in C#. Treat cancellation as a first-class case, not as “exceptions are exceptional.”

  5. Audit destructors for exception throws, especially in concurrent code where the calling environment may be more error-prone than the development environment. A destructor that throws in production but not in test is the worst kind of production bug.

  6. Recognize that mutex poisoning (Rust) is a feature, not a bug. Other languages should learn from it. Where the language doesn’t help, leave a comment that the protected state may be inconsistent after a throw, and treat that as a first-class failure mode.

  7. Prefer immutable data with atomic pointer swaps over mutable data with mutexes. The exception-safety argument is: an immutable structure cannot be partially mutated, because it cannot be mutated at all. A swap of a pointer to it is atomic. The throw-during-mutation problem disappears. The cost is a copy.

The honest summary

Concurrent exception safety is not a discipline anyone gets right by intuition. It is a set of patterns, applied with care, and continually undermined by the ordinary engineering pressures of “this code worked, let’s not touch it” and “we don’t have time to audit every destructor.” The result, in practice, is that almost every long-lived C++ codebase contains some number of latent exception-safety bugs in its concurrent code, and these manifest occasionally as production incidents whose post-mortem says “we shipped a fix to handle the case where x throws while y was in state z.”

The next chapter is about how this exact problem reappears in a place that does not involve try/catch at all, but is nonetheless the same problem in different costume.

Further reading

  • Hans-J. Boehm, “Threads Cannot Be Implemented as a Library,” PLDI 2005. Foundational paper on why concurrency must be a language-level concept, with implications for exception interaction.
  • Java Concurrency in Practice, Goetz et al., chapter 7 (“Cancellation and Shutdown”). The clearest treatment of the interrupt-and-cancellation interaction in any language.
  • “Cancellation safety in async Rust” — see the tokio documentation and the Rust async working group’s ongoing discussion. As of this writing, the formal definition is still being refined.
  • The Rustonomicon, “Poisoning” section: https://doc.rust-lang.org/nomicon/poisoning.html

The Reentrancy Connection

Here is the central thesis of this chapter: Ethereum smart-contract reentrancy attacks are exception-safety bugs. Not analogous to exception-safety bugs. Not metaphorically related. Structurally identical. The same shape of mistake, with the same fix. The only thing different is the costume.

If this seems like a stretch, stay with me. By the end of the chapter I want this to feel obvious.

The shape of the problem, abstracted

In any programming language with the following properties, the bug exists:

  1. There is a function F that mutates state across multiple steps.
  2. Between two of those steps, F calls another function G.
  3. G’s execution is, from F’s perspective, opaque — F does not know what G will do.
  4. G is capable of re-entering F’s context: calling back into F, observing F’s partially-mutated state, or transferring control such that the rest of F is not yet executed.

Substitute concrete things for F, G, and “re-enter”:

FG“re-enter”
C++ function with multi-step state updateA throwing operationThe throw skips the rest of F
Method holding a mutex while modifying shared stateCode that releases the lockAnother thread sees partial state
Smart contract functionAn external contract callThe external contract calls back into F
Signal handlerA non-async-signal-safe library callA second signal arrives
Database transactionApplication code that readsApplication reads partially-committed state

These are, again, not analogous. They are the same problem. In each case, F made a mistake by assuming control would return to it linearly after G returned, and that assumption was wrong. Whether the assumption was wrong because G threw, or because G blocked while another thread interfered, or because G was an external contract that called back, is a matter of detail. The structural error is the same.

A brief refresher: how Ethereum reentrancy works

Smart contracts on Ethereum are programs that hold balances of Ether (or tokens) and execute when called. A contract function may call another contract — and that other contract is, from the caller’s perspective, opaque, since contracts are deployed independently and the caller may not know what the callee does.

The classic vulnerable pattern, in Solidity:

contract Vulnerable {
    mapping(address => uint) public balances;

    function withdraw(uint amount) public {
        require(balances[msg.sender] >= amount);

        // (1) Send the Ether
        (bool success,) = msg.sender.call{value: amount}("");
        require(success);

        // (2) Update the balance
        balances[msg.sender] -= amount;
    }
}

The flaw: at step (1), Ether is sent to msg.sender. If msg.sender is a contract, that contract’s code runs as part of receiving the Ether (Ethereum allows this; receiver code runs synchronously within the sender’s transaction). That code can call withdraw again, recursively, before step (2) has run, which means balances[msg.sender] is still its original value. The check at the top still passes. Money flows out a second time. And a third. And until either the gas runs out, the source contract is drained, or the attacker chooses to stop.

The DAO hack of 2016 was exactly this. Approximately $50 million in Ether was drained from a contract that held investor funds. A community-divisive hard fork was performed to recover most of the funds, splitting the chain into Ethereum and Ethereum Classic. It is the most consequential exception-safety-shaped bug in the history of computing, by dollar value.

Why this is exception safety

Re-read the Solidity code with chapter 1’s vocabulary in mind.

withdraw is a function that mutates state across multiple steps. Between checking the balance and decrementing it, it makes an external call. The external call is opaque — withdraw does not know what msg.sender.call will do. The external call is capable of re-entering withdraw — by calling back into the same contract — and observing the partially-mutated state (specifically, the state where the balance has not yet been decremented).

This is the same mistake as Account::transfer_to from chapter 1, with one variable renamed:

void Account::transfer_to(Account& other, int amount) {
    balance_ -= amount;
    log_transfer(amount, &other);   // throws — control transfers out
    other.balance_ += amount;
}

In Account::transfer_to, control transferred out via throw. The state at that moment: balance_ decremented, other.balance_ not yet incremented. Whoever runs next sees inconsistent state.

In Vulnerable::withdraw, control transferred out via external call. The state at that moment: Ether sent, balances[msg.sender] not yet decremented. Whoever runs next — including a re-entrant call from the very contract being called — sees inconsistent state.

The same partial-mutation, the same “control unexpectedly leaves the function,” the same observable inconsistency. In one case the mechanism is throw, in the other it’s an external call. In one case the observer is the catch handler, in the other it’s a re-entering caller. The bug is the same.

The fix is the same too

Recall the fix for transfer_to from chapter 4: do all the throwing work first, then commit with no-throw operations.

void Account::transfer_to(Account& other, int amount) {
    int new_self = balance_ - amount;
    int new_other = other.balance_ + amount;
    log_transfer(amount, &other);     // throwing work first
    balance_ = new_self;              // no-throw commit
    other.balance_ = new_other;
}

Now look at the fix for the smart contract, by Solidity convention called “checks-effects-interactions”:

function withdraw(uint amount) public {
    // CHECKS
    require(balances[msg.sender] >= amount);

    // EFFECTS (commit state changes first)
    balances[msg.sender] -= amount;

    // INTERACTIONS (external calls last)
    (bool success,) = msg.sender.call{value: amount}("");
    require(success);
}

The order is: validate, then mutate state, then make external calls. By the time the external call runs, the state is fully consistent — if msg.sender’s code re-enters and reads balances, it sees the updated value. The re-entrant call’s require(balances[msg.sender] >= amount) will fail (or correctly succeed against the new, lower balance). No double-spending.

This is exactly the two-phase commit pattern from chapter 4. Compute and commit the no-throw mutations first; do the operations that may transfer control (throw, external call, send Ether) last. The order matters because what comes after the maybe-transferring operation may not happen, and the state must be self-consistent at the moment control could leave.

The Solidity community, having lost $50 million, codified “checks-effects-interactions” as a best practice. The C++ community, having dealt with exception safety for thirty years, codified “two-phase commit” as a best practice. Different communities, different vocabularies, identical pattern.

Other smart-contract incidents that are exception-safety bugs

The DAO is the famous one. Several others are worth knowing:

The Parity multisig wallet, 2017

A library contract used by many wallets had a function that, due to a missed access modifier, could be called by anyone. An attacker called it with arguments that re-initialized the library, then called another function that destroyed it, freezing approximately $300 million in Ether across all wallets that depended on it. The bug was not technically reentrancy, but the shape — a function leaves observable state in a configuration its callers did not expect — is the same family.

The bZx incidents, 2020

Multiple separate attacks on the bZx lending protocol exploited reentrancy-adjacent issues: a borrowed asset’s price could be manipulated by the borrower (using flash loans) before the loan-issuance code re-checked solvency. The attacker took advantage of the gap between “loan issued” and “solvency re-checked,” which is the same gap exception safety exists to close.

Cream Finance, 2021

A reentrancy attack on a lending protocol drained ~$130 million. The contract called a token’s transferFrom function before updating its internal accounting; the token (an attacker-deployed implementation of an ERC-20 variant called ERC-777, which has receiver hooks) called back into the lending contract during transfer, observed stale accounting, and borrowed against the same collateral repeatedly.

The pattern is: ERC-777 specifically allows token transfers to invoke receiver hooks. A contract written assuming ERC-20 semantics (transfers don’t trigger receiver code) is vulnerable when used with ERC-777 tokens. The shape of the assumption that broke is “the call I’m making won’t transfer control back to me.” This is the exception-safety assumption again, just at a different protocol layer.

The deeper insight: control-flow assumptions

Here is the abstraction worth carrying forward: every function has implicit assumptions about what the calls it makes will do with control.

In a “safe” call, control returns linearly to the caller, with the callee having done its job. The caller’s reasoning is local: “after this line, the state is X, because that’s what this function does.”

In an “unsafe” call, control may not return linearly:

  • It may not return at all (the callee throws, terminating up the stack).
  • It may return after re-entering the caller (the callee calls back, possibly calling the very function the caller is in).
  • It may be interrupted between the call and the return (a signal arrives, a thread is preempted, a transaction is aborted).

In all three cases, the caller’s local reasoning — “after this line, the state is X” — is wrong, because what came between the call and that line may have changed. The caller’s job is to write code that is correct under all of these possibilities, which means not assuming linear control flow across any non-trivial call.

This is exhausting to apply to every line of every function. So we develop discipline: the two-phase commit pattern, checks-effects-interactions, the strong guarantee. Each is a structural answer to “how do I make my code correct without proving the linearity of every call?”

The answer is always the same: finish your state mutations before you make a call whose effect on control flow you don’t fully understand.

A general checklist for “is this code reentrancy/exception-safe?”

For any function F, walk through:

  1. Identify every line where F calls something that might transfer control out of F and back in. This includes: any call that might throw; any call to external code, contracts, or callbacks; any call that releases a held lock or signals other threads; any call that yields in cooperative multitasking.

  2. For each such line, identify the state of F’s observable variables at that moment. “Observable” means visible to anything that could see the state during the period control is out of F: other threads, re-entrant callers, external observers.

  3. Ask: if control returns to F after some other code has run in between, is F’s state still consistent? Are the invariants still true?

  4. If not, restructure: either move the partial mutation after the transfer-of-control point (so the transfer sees pre-mutation state), or use a lock or guard to make the partial state invisible.

This checklist applies, unchanged, to:

  • C++ functions that throw
  • Smart contract functions that call external contracts
  • Concurrent functions that release locks
  • Signal handlers that call non-reentrant code
  • Database transactions that yield to other transactions

It is the same checklist. The reason exception-safety vocabulary is not used in the smart-contract world is that the smart-contract world inherited its vocabulary from a different family of disciplines — fault-tolerance, distributed systems, security — and arrived at the same place by a different road. The C++ world inherited its vocabulary from systems-programming-and-RAII. They are talking about the same thing.

What the smart-contract community got right that the C++ community didn’t

I want to give credit. The smart-contract community has done two things better than the C++ community in this area:

  1. Static analyzers for reentrancy are widely deployed. Tools like Slither, MythX, and Securify warn on potentially-reentrant patterns by default. They are not perfect, but they catch the obvious cases, and they’re run by default in many CI pipelines. The C++ analog — static analyzers for exception safety — is in nowhere near as healthy a state, as we’ll see in chapter 10.

  2. The “checks-effects-interactions” pattern is a part of basic Solidity education. Every introductory Solidity tutorial in 2024 covers it. By comparison, “the strong guarantee” is not part of basic C++ education at most universities, even in advanced courses. Engineers can finish a CS degree and a job interview without knowing the term.

The cost differential is part of why. A reentrancy bug in a smart contract costs eight figures, by the time the dust settles. An exception-safety bug in C++ costs a production incident and a post-mortem. Both are bad; one is more concentrated.

What the C++ community got right that the smart-contract community didn’t

In return:

  1. The strong guarantee is a sharper concept than checks-effects-interactions. “Checks-effects-interactions” is a pattern; “the strong guarantee” is a contract. The C++ vocabulary lets you say “this function provides the strong guarantee” as part of an API contract, with implications for callers. Solidity has nothing comparable. A Solidity function that follows checks-effects-interactions does not, by following it, expose any invariant to its callers — the caller has to inspect the implementation.

  2. The patterns generalize. Two-phase commit, scope guards, copy-and-swap apply to operations beyond the simple “external call last” case. The smart-contract community has the simple case well-handled but tends to flounder on more complex cases (multi-step operations with intermediate external calls, cross-contract atomicity).

If the two communities talked to each other more, both would benefit. The vocabulary on the C++ side is more precise; the tooling on the Solidity side is more deployed. Combining them would produce a better state of the art than either has on its own.

What this chapter wanted to leave you with

  1. Reentrancy is exception safety in another costume. Same problem, same fix, same reasoning patterns. The fact that the two communities developed independent vocabularies for it is a fact about software engineering’s limited self-awareness, not about the problem.

  2. Any function that calls something with non-trivial control-flow semantics is in the same situation. Whether the non-trivial semantics is “might throw,” “might recurse,” “might block,” or “yields to another fiber” is detail. The structural lesson is: finish your state mutations before you do that.

  3. The two-phase commit pattern is universal. You will see it again in databases, in distributed sagas, in lock-free programming, in signal handlers. Once you recognize it, the variations across domains are just dress.

The next chapter pulls more places out of the woodwork where this same problem hides under different names.

Further reading

Other Places This Problem Hides

Now that we have the shape of the problem clearly in mind — partial state mutation interrupted by an unexpected control transfer — this chapter is a tour of other places it hides under different names. Each of these is a real, distinct domain, with its own vocabulary and its own folklore, but each is structurally exception safety.

Signal handlers and async-signal-safety

A POSIX signal can interrupt a process at almost any instruction. The signal handler runs on the same thread, in the same address space, with the original code’s state in whatever condition it happened to be in.

This is exactly the exception-safety problem, with extra hostility. The interruption is at the instruction level rather than at function-call boundaries. The original code was not written with the signal handler in mind; the signal handler may not know what state the original code was in.

POSIX defines a list of async-signal-safe functions — those that may be safely called from a signal handler. The list is short: write, read, _exit, signal, sigaction, a handful of others. Conspicuously absent: malloc, free, printf, fprintf, anything that touches stdio, anything that allocates memory, anything that takes a lock. Why? Because the signal could have interrupted the original code in the middle of any of those operations. If the original code was halfway through malloc and held malloc’s internal lock, and the signal handler calls malloc, the handler deadlocks (or worse, depending on how malloc is implemented).

The structural correspondence:

Exception safetyAsync-signal-safety
Function might throwSignal might be delivered
Stack unwinds, partial state visibleSignal handler runs, partial state visible
Use no-throw operations in critical sectionsUse only async-signal-safe operations in handlers
Two-phase commit: do throwing work firstDo everything in main code; handler only signals

The standard pattern for signal handlers in production C: do as little as possible in the handler. Set a flag (specifically, a sig_atomic_t volatile), and let the main code poll it. This is the inverse of the strong guarantee — the handler defers all real work to the main code, which can do it safely.

volatile sig_atomic_t signal_received = 0;

void handler(int sig) {
    signal_received = 1;
    // that's it. Do nothing else.
}

int main(void) {
    signal(SIGINT, handler);
    while (running) {
        if (signal_received) {
            handle_signal();    // safe; we're in main code
            signal_received = 0;
        }
        do_work();
    }
}

signalfd (Linux), kqueue/EVFILT_SIGNAL (BSD), and pthread_sigmask plus a dedicated signal-handling thread are all variations on the same idea: get the signal out of the asynchronous interrupt context and into a context where you can do real work safely.

Interrupt handlers in kernel code

Move down the stack to the kernel, and the same problem reappears with even harder constraints.

A hardware interrupt — disk I/O completion, timer tick, network card receive — runs the interrupt handler at high CPU priority, possibly with other interrupts disabled, possibly on a CPU stack that is not the kernel’s normal stack. The handler must:

  • Not block on locks held by the code it interrupted (deadlock).
  • Not allocate memory in ways that might block (memory allocators are themselves interruptible).
  • Not call most of the kernel’s services, because those services may have been mid-operation.
  • Run quickly, because the interrupted code is waiting and other interrupts are queued.

The Linux kernel’s pattern is a split handler: a small “top half” that runs in interrupt context and acknowledges the hardware, and a larger “bottom half” (variously: tasklets, softirqs, workqueues) that runs at lower priority and does the actual work. The same shape as the userspace signal handler pattern: do as little as possible at high-disruption priority, defer the rest to a context where the constraints are weaker.

The exception-safety language for this: the top half must provide the no-throw guarantee to everything below it. It cannot leave shared state inconsistent, cannot fail in ways that propagate, cannot release locks it didn’t take. The “no-throw” here is at the level of “no failure is permitted to escape,” because the kernel above has no way to handle one if it did.

If you have written a kernel driver and dealt with the constraints on irq handlers, the rules in the C++ exception-safety chapters of this book should feel familiar. They are the same rules, applied at a different system layer.

Hardware exceptions: page faults and divide-by-zero

A hardware exception — page fault, divide-by-zero, illegal instruction, alignment violation — fires synchronously when an instruction can’t complete. The hardware transfers control to the kernel’s exception handler.

This is literally a CPU-level GOTO to an unguessable destination, with no software intermediary. The instruction was halfway through some operation; the hardware preserves enough state to return to the next instruction (or retry the failing one); the kernel decides what to do.

The interesting case for this book is page faults. A page fault is not a bug — it is the mechanism by which demand paging, copy-on-write, mmap’d files, and stack growth are implemented. Almost every memory access in a userspace program could, in principle, fault. The kernel handles the fault by allocating a page and resuming.

But: the original instruction is in some intermediate state when the fault occurs. On most architectures, the CPU re-runs the instruction after the kernel handles the fault, with the page now mapped. The instruction’s effects so far are either fully rolled back by hardware (good architectures) or partially preserved (some bad cases on older or stranger hardware). Most modern architectures (x86-64, ARM64) make memory accesses precise — the instruction is either fully complete or fully not — but this is itself a deliberate design property, paid for in microarchitectural complexity.

The exception-safety lens: hardware exceptions are the strong guarantee implemented in silicon. The instruction either committed its effects entirely or did not. The CPU spends real complexity to provide this; the alternative (basic guarantee at the instruction level) would make demand paging unimplementable, because no software fault handler could reason about what state the instruction left things in.

This is one of the deepest examples in the book of how the strong guarantee is the foundation on which other things rest. Strip it away at the hardware level and most of the operating system becomes incoherent.

Database transaction boundaries

A database transaction is commit-or-rollback. The strong guarantee, with industrial-strength implementation. ACID’s “A” is “atomicity,” which is exactly the property the strong guarantee names.

Most working programmers think of transactions as a database thing rather than an exception-safety thing. Look at it from the other direction:

  • A transaction commits or rolls back. The strong guarantee.
  • Rolled-back transactions leave no observable state. The strong guarantee’s contract.
  • The transaction log (write-ahead log, undo log) is the database’s mechanism for making rollback possible even after partial application — it is the system-level analog of a scope guard’s stored undo action.
  • Two-phase commit (the database protocol) is two-phase commit (the C++ pattern) at a different scale: prepare in a way that is reversible, then commit in a way that is not.

The vocabulary is older than C++’s. The mechanisms are the same.

A specific exception-safety problem at the transaction boundary: what does your application code do if the database driver throws after the transaction has been committed but before the commit’s confirmation reaches the application? Specifically:

try:
    cursor.execute("INSERT ...")
    conn.commit()                    # network call to DB
    update_local_cache(item)         # local mutation
except DatabaseError:
    rollback_local_cache(item)
    raise

If conn.commit() succeeds on the database side but the connection is dropped before the success is reported, the driver throws. The exception handler runs. update_local_cache was never called. The application’s local cache says “no item,” the database says “yes item.” The application has desynchronized from the system of record, in a way that no rollback can fix — the database commit did happen, irreversibly, before the failure was reported.

This is the generals’ problem applied to a single client and a single server, and it has no in-process solution. The only way forward is idempotency on retry: design the operation so that re-trying it is safe whether the previous attempt committed or not. Exactly-once delivery does not exist in the network model; at-least-once with idempotency does.

The exception-safety lesson: the strong guarantee, end-to-end, requires more than two-phase commit at the function level when there’s a network in the middle. You can have the strong guarantee for the database. You can have the strong guarantee for the local cache. You cannot have it for “both atomically” without a distributed coordinator, and even then with caveats. This is the territory of consensus algorithms, the subject of which is its own book.

Saga compensations in distributed systems

When you can’t have transactions across services — and you usually can’t, in a microservices architecture — you have sagas. A saga is a sequence of local transactions, each in a different service, with compensating transactions defined for rollback.

Order Service:    create order        (compensate: cancel order)
Payment Service:  charge card         (compensate: refund card)
Inventory:        reserve stock       (compensate: release stock)
Shipping:         schedule delivery   (compensate: cancel delivery)

If step 3 fails, the saga runs the compensations for steps 1 and 2 in reverse: release nothing (step 3 didn’t run), refund the card, cancel the order.

This is literally the scope-guard pattern from chapter 4, scaled up to multiple services and made durable. Each “compensation” is the rollback action a scope guard would run. The “dismiss on success” is the saga’s normal completion. The orchestrator (or choreographed event flow) is what guarantees that compensations actually run.

The way sagas differ from in-process scope guards:

  • Compensations may themselves fail. If the refund call fails, what now? You have a sequence of nested rollbacks, each of which can fail, and at some point a human has to look at the system and decide.
  • Compensations are not always exact inverses. “Send a marketing email” has no compensation: you can’t unsend the email. The saga either declines to include un-compensable steps, or accepts that some compensations are best-effort.
  • Compensations may not be valid if too much time has passed. Refunding a payment six months after the charge is harder than refunding it immediately. State drifts.

These are all reasons why distributed sagas are harder than they look, and why most production “distributed transaction” systems quietly accept that they are best-effort and have manual reconciliation processes for the cases where compensation fails.

The exception-safety language: a saga provides the basic guarantee at the system level (no resources leaked) and approximates the strong guarantee (rollback to before-state) but cannot generally provide it. “Approximates” is doing a lot of work in that sentence; it is in the gap between approximation and provision that real money lives.

Cooperative multitasking yields

A coroutine that yields to another coroutine has, from its own perspective, paused. From the perspective of any code that observes the yielded coroutine’s mutable state, the coroutine has left state in whatever shape it was in at the yield.

This is the same problem as a lock release in concurrent code: the yielded coroutine has not finished its mutation. Whatever state it was modifying may be partially modified. Other coroutines that read that state see the half-modified version.

In strict cooperative multitasking, the yielding coroutine controls when it yields, so it can ensure it yields only at points where its state is consistent. This is the cooperative version of “two-phase commit”: don’t yield in the middle of a multi-step state mutation. Modern async/await is closer to this — await is an explicit yield point, and the programmer can structure code so awaits happen between consistent states.

But in preemptive coroutines (some Lua versions, Erlang processes, kernel green threads), the runtime can preempt at points the coroutine did not choose. In those systems, the same care that applies to multi-threaded code applies, because preemption is exception-shaped: an unexpected control transfer that may leave state half-modified.

Memory allocation under low-memory conditions

A subtle one. A program that allocates memory under low-memory conditions may have an allocation throw (bad_alloc in C++) or return null (C, some other languages). The point at which the allocation fires is not predictable in advance — it depends on the running state of the entire process. Any allocation, in principle, could fail.

Code written to handle allocation failure gracefully is rare. Most code assumes allocations succeed, with bad_alloc propagated up to a top-level handler that logs and aborts. This is a defensible position — if you’re out of memory, your options are limited — but it means any function that allocates may be the source of a throw, which is most functions in any language with dynamic allocation, which is most languages.

The exception-safety implication: in a code base that wants to provide guarantees under allocation failure, the allocation discipline becomes pervasive. Every assignment of a vector, every use of std::string, every container operation is an allocation site. Codifying “this code is exception-safe under allocation failure” requires either avoiding allocations in critical paths (which is the embedded-systems and high-frequency-trading discipline) or accepting that the strong guarantee under allocation failure is not on offer.

The Linux kernel, famously, runs in GFP_ATOMIC mode in interrupt handlers, where allocation may simply return null (no waiting). Driver code is full of conditional logic for this: try to allocate, fail back to a slower path if not, ultimately drop a packet rather than block. This is exception safety, written as null-checks instead of try/catch, with the same structure.

Cancellation in async runtimes

We touched on this in chapter 7. Cancellation in async runtimes — Tokio, asyncio, .NET Tasks — is exception-shaped: a request to abort an in-progress operation, propagated either as an exception (most runtimes) or as a Drop (Rust). The cancelled operation has, by definition, not finished. Its in-progress state may be half-mutated.

The cancellation-safety question is: if your async function is cancelled at any await point, will it leave behind a consistent world?

This is the same question as exception safety. The framing is younger, the literature is thinner, and the patterns are still being worked out, but the underlying problem is the same. The Rust async community in particular has been building vocabulary around cancellation safety; expect the next decade of literature on it to repeat, in different terms, the patterns the C++ community established for exceptions in the 1990s.

A general pattern emerges

Across all of these:

DomainThe “control transfer”The “partial state”The pattern
C++ exceptionsthrowLocal state when throw firesTwo-phase commit, RAII
Smart contractsExternal callStorage state during callChecks-effects-interactions
Concurrent codeLock release / thread switchShared state at switchTwo-phase commit under lock
Signal handlersSignal deliveryProcess state when signal arrivesAsync-signal-safe handlers
Hardware exceptionsTrapInstruction state at trapArchitectural precise exceptions
Database transactionsTransaction abortTuples mid-updateWrite-ahead log + rollback
SagasStep failureCross-service stateCompensating transactions
Async/cancellationCancel signalAwait-suspended stateCancellation-safe code
Allocation failurebad_alloc / nullLocal state at allocationAllocate first, mutate after

In each row, the same shape: something might transfer control out of the function unexpectedly; the function must be written so that whatever state it’s in at the moment of transfer is recoverable or compensatable. The local mechanism differs. The structural pattern is identical.

This is, I think, the most useful thing to take away from the book. Once you see the shape, you see it everywhere. And the patterns the C++ community developed for the exception case — RAII, two-phase commit, scope guards, the strong guarantee — turn out to be the patterns everyone needs, in every domain. They are sometimes rediscovered with different names. Sometimes they are not rediscovered at all, and the bugs that result are reported as “logic errors” or “race conditions” or “reentrancy attacks,” and the post-mortems treat them as separate phenomena. They are not separate phenomena.

The next chapter is about the tooling that exists, mostly inadequate, for catching these.

Further reading

Tooling

Static analyzers for exception safety, as a general property, mostly do not exist. This is a stronger statement than I want to make, so let me qualify it: there are tools that catch some exception-safety bugs, in some languages, by looking for specific patterns. There is no tool, in any language I know of, that takes a function and tells you which of the three guarantees it provides. The general problem is approximately as hard as program verification, and we have not solved program verification.

This chapter is an honest tour of what tooling exists, what it catches, and where the gaps are.

What “static analysis for exception safety” would have to do

Imagine the ideal tool. You give it a function and ask “what guarantee does this provide?” To answer, the tool must:

  1. Identify every operation in the function that might throw.
  2. For each potential throw point, identify the state of every variable, every member of *this, every observable side effect that has been performed up to that point.
  3. Determine whether that state is consistent with the function’s preconditions — i.e., whether the function’s invariants hold.
  4. Determine whether any resources have been acquired but not yet released, accounting for RAII.
  5. Aggregate these results across all potential throw points to determine the weakest guarantee provided.

Each of these steps is a hard problem.

(1) requires knowing the throw set of every called function. This is transitive: a function that calls a function that calls a function that throws can throw. C++’s lack of a meaningful “throws what” annotation at the type-system level means this analysis must be inferred from source, which means it depends on having source for every called function, which is often not true at library boundaries.

(2) requires whole-function dataflow with abstract interpretation. Doable for small functions; hard for big ones, especially with branching, loops, and pointer aliasing.

(3) requires a specification of what the function’s invariants are. The compiler does not know what invariants you intend. This is the part that demands either annotations from the programmer or an after-the-fact specification language.

(4) is mostly tractable for languages with deterministic destruction and an “owns” relationship encoded in types. RAII makes this almost easy in C++. In garbage-collected languages it’s harder, because resource ownership is not visible in types.

(5) requires combining all the above into a guarantee classification, which is at least the join of a lattice of state-shapes, which is, well, expensive.

This is what would be required. Unsurprisingly, no tool does all of this. Most do parts of (1) and (4); a few do parts of (2); essentially none address (3) or (5) in generality.

What actually exists, in C++

noexcept and friends

The most useful “tool” in C++ for exception safety is the noexcept specifier itself, used as a static check via noexcept(expr):

template<class T>
void Container<T>::move_or_copy(T& dst, T& src) {
    if constexpr (std::is_nothrow_move_constructible_v<T>) {
        dst = std::move(src);
    } else {
        dst = src;
    }
}

This is the standard library’s pattern: branch on whether move is noexcept to choose between move (faster, basic guarantee under throw) and copy (slower, strong guarantee under throw). It’s a tool in the sense that the type system enforces correctness here — if T’s move constructor lies and throws despite being marked noexcept, std::terminate is called.

It is not a tool for checking exception safety; it is a tool for expressing a guarantee in code that other code can branch on. The check is local and shallow.

clang-tidy and cppcheck

clang-tidy has a handful of checks relevant to exception safety:

  • bugprone-throw-keyword-missing — flags Exception(...) constructed without throw.
  • bugprone-exception-escape — flags functions marked noexcept (or destructors, which are implicitly noexcept) that may, transitively, throw.
  • cppcoreguidelines-avoid-goto — irrelevant, but I include it because the chapter is about GOTOs.
  • cert-err58-cpp — flags non-throwing destructors that throw. (Subset of the bugprone-exception-escape check.)
  • cert-err54-cpp — flags handlers ordered such that base classes catch derived exceptions. (Trivial.)

cppcheck has similar coverage. None of these tools tell you whether a function provides the basic or strong guarantee. They catch the stupid cases — destructors that throw, noexcept functions that throw — and that is the limit of their ambition.

clang-static-analyzer and the path-sensitive cousins

The path-sensitive analyzers (Clang Static Analyzer, Coverity, PVS-Studio) can sometimes find more. Coverity has a “RESOURCE_LEAK” checker that catches some leaked resources on exception paths. PVS-Studio has a checker for “object is not destroyed before throwing.” These are real and useful, and they catch a class of bugs that the simpler clang-tidy checks miss.

But: they are all looking for leaks of registered resources, not for invariant violations. The Account::transfer_to bug from chapter 1 — where money disappears because the throw happens between two writes — would not be caught by any of these. There is no leak; there is no resource acquired and not released. The bug is invariant-shaped, and the tool has no specification of the invariant.

Concurrency-specific tools

ThreadSanitizer (TSan) catches data races, which is a different class of bug but interacts with exception safety. ThreadSanitizer will tell you that two threads accessed the same memory without synchronization, including when one of those accesses was on an exception path. This is genuinely useful: a common exception-safety bug in concurrent code is “I released the lock before mutating something,” which ThreadSanitizer can catch as a race.

AddressSanitizer (ASan) catches use-after-free, which is the dominant manifestation of “destructor of object X freed memory while object Y was still using it.” Exception-safety bugs that involve dangling pointers after partial cleanup show up here.

Neither of these is exception-safety-specific, but they catch downstream effects of exception-safety bugs, which is half a loaf and still useful.

What exists in other languages

Java

SpotBugs (formerly FindBugs) has checks for RV_RETURN_VALUE_IGNORED_BAD_PRACTICE (ignoring the return value of a method that might fail) and OBL_UNSATISFIED_OBLIGATION (a Closeable resource not closed on all paths). The latter is a genuine exception-safety check for resource leaks. It is also widely-deployed: most large Java code bases have some exposure to it, either via SpotBugs directly or via tools that wrap it (SonarQube, etc.).

ErrorProne (Google’s checker) has checks for MustBeClosedChecker and similar. The Java ecosystem’s ability to express ownership at the type level is limited, so these checks rely on annotations and convention rather than on type-level enforcement.

Rust

The Rust compiler enforces exception safety, in a narrow sense, automatically. The borrow checker ensures that if a function panics, any references it took are still valid (because they’re either copied or moved-out, and the move is irreversible). Drop runs deterministically, so resources are cleaned up.

What the compiler does not enforce is that invariants are restored after a panic. If you have a struct with two fields that must be kept in sync, and you panic between updating one and updating the other, the compiler does not notice. The discipline of “don’t leave the struct invalid” is on the programmer.

The mutex-poisoning mechanism is the closest Rust gets to a runtime check: it forces callers to acknowledge the possibility of inconsistent state.

cargo-audit, clippy, and various lints catch simpler cases. None of them check for the strong guarantee directly.

Go

golangci-lint’s default checks include errcheck (you ignored an error return) and various nilness analyses. Go’s language-level argument is that exception safety is replaced by “errors are values” and explicit err != nil checks, so the type system is supposed to catch propagation issues. The errcheck lint enforces that you actually check.

What this misses, of course, is the structural problem from chapter 1: even if you check every error, the partial-mutation problem is the same. Go’s tooling does not catch this. The community’s response, when prompted, is usually that the Go style guide implies a discipline that handles this — return early on errors, don’t mutate before validating — which is true in well-written Go and not enforced anywhere.

Solidity

The smart-contract space is the place where exception-safety-equivalent tooling is most developed:

  • Slither (Trail of Bits): static analyzer that catches reentrancy patterns, uninitialized state, dangerous external calls. Default-on in many CI pipelines.
  • MythX: SAAS product, multiple analyzers, including symbolic execution.
  • Securify (ETH Zurich): automated formal verification for a subset of properties.
  • Manticore: dynamic symbolic execution.
  • Echidna: property-based fuzzing for smart contracts.

The Solidity world’s tooling is — and I want to be clear about this — better than the C++ world’s tooling for the equivalent problem. The reason is incentive. A reentrancy bug in Solidity costs eight figures, payable in cryptocurrency; an exception-safety bug in C++ costs a production incident. The market has paid for tooling proportional to the cost.

If you take only one thing from this chapter, take this: the smart-contract world has shown that tooling for the structural pattern is possible to build. The C++ world’s tooling is bad not because the problem is fundamentally harder, but because we did not pay for better.

Property-based testing as a partial substitute

If static analysis can’t tell you which guarantee a function provides, dynamic checking can sometimes substitute. The technique:

  1. Wrap throwing operations to probabilistically throw at every call site. This converts the question “what does the function do under throw?” to a runtime check.
  2. Run the function under such conditions, observe the resulting state, and check the invariants.

This is approximately what Boost.Test’s BOOST_CHECK_NO_THROW, etc., are gesturing at, but the more interesting application is property-based testing: define a precondition, run the operation with throws injected, check the postcondition.

// pseudo-code
property("transfer_to is exception-safe", [](Account a, Account b, int amt) {
    Account a_orig = a, b_orig = b;
    int total_orig = a.balance() + b.balance();
    try {
        a.transfer_to(b, amt);  // some calls are randomly throwing
    } catch (...) {}
    int total_now = a.balance() + b.balance();
    return total_now == total_orig;  // money conserved
});

If the test ever fails, you have an exception-safety bug. The test will not find every bug — the throw-injection is probabilistic, and some bugs require very specific throw locations — but in practice, run for a few thousand iterations, it finds a lot.

Hypothesis (Python), QuickCheck (Haskell), proptest (Rust), Hedgehog (multiple languages), rapidcheck (C++) all support this style. The discipline of writing the property — of explicitly stating the invariant — is itself worthwhile, even if the tool catches nothing. The act of writing “money is conserved” forces you to think about what your invariants are, which is exactly what the function’s exception safety hinges on.

Fuzzing approaches

Fuzzing is property-based testing’s blunter cousin: throw random inputs at the function and watch for crashes. Modern fuzzers (libFuzzer, AFL++, honggfuzz) are coverage-guided, which makes them effective at exploring code paths that simple random testing would miss.

For exception safety specifically, coverage-guided fuzzing combined with throw injection is one of the most effective tools available. The fuzzer explores reachable code; the throw injection converts each reached call into a potential throw site; assertions in the test code check the invariants.

Google’s libFuzzer + ASan + TSan + UBSan stack catches a lot of exception-safety-adjacent bugs, even though none of the components is specifically about exception safety. The bugs manifest as use-after-free (cleanup partially ran) or as race conditions (locks released early) or as failed assertions (invariants violated). The fuzzer doesn’t know it’s looking for exception-safety bugs; it just finds them.

This is, in the absence of better static analysis, the most practical tool a working engineer can apply: fuzz with sanitizers, and make sure your test suite has assertions about invariants. The combination catches things human review misses.

Why the tooling is bad

A few reasons, in roughly increasing order of fundamental:

  1. The market has not paid for it. As above. The cost of an exception-safety bug is diffuse and hard to attribute. The cost of a reentrancy bug in a smart contract is concentrated and easy to attribute. The market has paid for tooling proportional to the attributable cost.

  2. The specification problem. Static analysis for exception safety needs invariant specifications, and we do not have a standard way to write them in mainstream languages. Languages with explicit specification (Eiffel, SPARK, Dafny) can do better; languages without (everything you actually use) cannot, beyond the syntactic check.

  3. The whole-program problem. Exception safety is a transitive property, propagating up call stacks across module boundaries. Effective analysis requires cross-module knowledge, which compilers and linkers do not, by default, share.

  4. The genuine difficulty of the analysis. Even given specifications and whole-program access, the analysis is exponential in the number of throw sites. Approximations exist; they have false positives, which programmers reject; or they have false negatives, which makes them useless.

The honest summary is that for the foreseeable future, exception safety is mostly maintained by discipline and code review, with tooling helping at the margins. This is unsatisfying. It is also true.

What to actually use, in practice

If you have a budget of one tool to deploy in your codebase, deploy a sanitizer-augmented fuzzer. Property-based tests are second. Static analyzers (clang-tidy, ErrorProne, etc.) are the easy default that catches a lot of low-hanging fruit. Code review with a checklist (chapter 12) is the universal floor.

If you are working in Solidity, deploy Slither. Run it in CI. The cost is trivial; the bugs it catches are not.

If you are working in Rust, you are living in the language with the strongest static guarantees on exception-safety-adjacent issues already. Lean on the borrow checker, take mutex poisoning seriously, and write code that handles Result deliberately.

If you are working in Common Lisp, your compiler is unlikely to help you, but your runtime debugger is the best in the industry. Use it.

The next chapter walks through real bugs that real engineers shipped, and what each of them came down to.

Further reading

  • The art of clang-tidy: clang-tidy’s check list is at https://clang.llvm.org/extra/clang-tidy/checks/list.html. Read the bugprone-* and cert-* sections.
  • John Regehr, “A Guide to Undefined Behavior in C and C++,” Embedded.com 2010. Tangential, but the same epistemic situation: tools catch some, miss most.
  • AFL++ documentation, on integration with sanitizers. https://aflplus.plus/
  • “Hypothesis: Test faster, fix more,” David MacIver. The Hypothesis documentation has a clearer explanation of property-based testing than any other source I know.
  • Trail of Bits, “Building secure smart contracts,” https://github.com/crytic/building-secure-contracts. Especially the Slither rule descriptions.

A Field Guide to Failure

This chapter is a catalog. Real bugs, in real software, that came down to exception-safety violations. Where I can link to a public post-mortem, I do; where the bug was reported in a bug tracker or paper, I cite that. The point is to make the abstractions of the previous chapters concrete: each of these is what the failure mode looks like when it ships.

I have ordered them roughly by impact, descending. Starting with the most expensive single instance.

The DAO (June 2016): ~$50M

Already discussed at length in chapter 8. The summary in field-guide form:

  • Class: Reentrancy / partial state update across opaque call.
  • Mechanism: A function transferred Ether to an attacker-controlled address before updating the attacker’s recorded balance. The attacker’s address was a contract that, on receiving Ether, called back into the function, observing the un-updated balance and withdrawing again. Recursively.
  • Fix shape: Checks-effects-interactions ordering. Update the balance before the external call.
  • Cost: ~3.6 million ETH, ~$50M at the time, ~$11B at 2024 prices. Resolved by hard-forking the Ethereum chain to reverse the transfer, which itself was the most divisive event in cryptocurrency history.
  • Why it shipped: The pattern of “send funds, then update accounting” was widespread; the contract had been audited and the auditors did not flag this pattern. The reentrancy attack was known to specialists but not widely understood among Solidity developers in mid-2016.
  • Reading: https://hackingdistributed.com/2016/06/18/analysis-of-the-dao-exploit/

Parity Multisig Wallet Self-Destruct (November 2017): ~$280M

A library contract used by many wallets had a kill function with an access modifier missing. An attacker took ownership of the library and called kill, which selfdestruct’d the library. All wallets that depended on it had their delegate-call targets removed; their funds became permanently inaccessible.

  • Class: Logical-invariant violation due to insufficient access control around a state-destroying operation.
  • Mechanism: The destroyed library contract was a shared dependency. Destroying it left all depending contracts in a state where their core functionality (signing transactions) failed. This is exception safety in the cross-system sense: an operation on one contract violated invariants of every contract that depended on it.
  • Fix shape: Access modifiers, plus the structural lesson that state-destroying operations need stronger preconditions than state-modifying ones. In exception-safety language: rolling back a selfdestruct is impossible; the operation has no inverse; therefore it cannot participate in any saga that might fail. The lesson the smart-contract community drew was “the no-go destructive operations need their own access discipline.”
  • Cost: ~514,000 ETH frozen, ~$280M at the time, much more now.
  • Reading: https://www.parity.io/blog/security-alert-2/

The Therac-25 (1985–1987): six deaths, more injuries

The Therac-25 was a radiation therapy machine whose software contained, among many bugs, a race condition between operator input and beam control state. A specific quick edit by an operator could leave the beam in a high-current configuration intended for a different therapy mode, resulting in massive radiation overdoses.

  • Class: Concurrent partial-state update; race between two state machines.
  • Mechanism: The operator’s input was processed while the beam controller was still mid-transition between modes. The resulting state was inconsistent: the controller believed it was in low-current X-ray mode (operator input had been accepted), while the hardware was still in high-current electron mode (mode-switch hadn’t completed). The basic guarantee was violated at the system level: a partial state update was visible to a downstream component.
  • Fix shape: Strong-guarantee state transitions enforced by interlocks. Don’t let operator input transition modes that are mid-transition. The pattern is identical to “don’t release the lock before completing the state update.”
  • Cost: Six known deaths from radiation overdose; an unknown number of additional injuries.
  • Reading: Nancy Leveson, Safeware, 1995; Leveson and Turner, “An Investigation of the Therac-25 Accidents,” IEEE Computer, July 1993.

The Therac-25 predates much of the formal exception-safety literature. The post-mortem identifies exactly the mistake — a state transition was not atomic with respect to operator input — but does not use the vocabulary, because the vocabulary did not yet exist. This is the field at its worst: lessons learned on machines that kill people, in 1986, that we are still teaching with a different vocabulary in 2024.

The Ariane 5 Flight 501 (June 1996): ~$370M

A floating-point-to-integer conversion in the inertial reference system threw an overflow exception in the Ariane 5’s first launch. The exception, having no handler in that code path, propagated up and triggered the system’s diagnostic-mode response — which, in flight, was to send the diagnostic data instead of guidance data to the main computers. The main computers interpreted the data as guidance, commanded the rocket into an aerodynamic regime it could not survive, and the vehicle self-destructed 39 seconds after launch.

  • Class: Unhandled exception with no rollback; the diagnostic mode itself was the partial-state behavior, since it sent data that downstream code interpreted as guidance.
  • Mechanism: The inertial reference software was reused from the Ariane 4. The conversion to 16-bit integer was safe within Ariane 4’s flight envelope; Ariane 5 had higher horizontal velocities, which overflowed. The exception handler had been written assuming “if this fires, we’re on the ground; send diagnostics to the panel.” In flight, the panel was the main computer’s data input.
  • Fix shape: Strong guarantee at the system boundary: if the conversion fails, do nothing — do not send a different kind of data. Specifically, the fix would have been to have a no-throw fallback (saturate the integer at MAX_INT16, log the saturation), preserving the system invariant that “the data on the bus is always guidance data.”
  • Cost: $370M of payload destroyed; the second-largest single failure of an unmanned rocket up to that point.
  • Reading: J.L. Lions et al., “Ariane 501 Inquiry Board Report,” July 1996. Available online from ESA. https://www.di.unito.it/~damiani/ariane5rep.html

The Ariane 5 report is required reading. It is short, calm, and devastating. The closing recommendations are textbook exception-safety advice in different language: validate inputs at boundaries, do not silently substitute degraded behavior, exception handlers must understand the system context they will run in.

Knight Capital (August 2012): $440M in 45 minutes

Knight Capital deployed an updated trading system. The deployment partially completed: seven of eight servers were updated; one was not. The new code repurposed a flag that the old code interpreted as “execute parent orders” (test mode); when the old code received production orders flagged this way, it executed them as live, repeatedly.

  • Class: Cross-version state inconsistency; partial deployment with no rollback.
  • Mechanism: The seven new-code servers and the one old-code server interpreted the same wire-level flag differently. Production traffic, distributed across all eight, hit the old-code server and was treated as a flood of test orders; the old “test” code path turned out to send them to the live market.
  • Fix shape: Atomic deployment (either all servers update or none); explicit retirement of repurposed flags; integration tests that verify behavior across mixed versions during deployment. In exception-safety language: a saga step (server update) may fail or partially apply; the system must be designed so that the partial-application state is benign.
  • Cost: $440M in 45 minutes. The firm did not survive — it was acquired (effectively a bankruptcy) within a week.
  • Reading: SEC release No. 70694, October 16, 2013. https://www.sec.gov/litigation/admin/2013/34-70694.pdf

The Knight Capital incident is, structurally, a saga compensation failure: the deployment “saga” had no compensation defined for the partial-deployment state. This is what chapter 9’s saga discussion is about. In a financial context, with high-throughput markets, the consequences of an uncompensated partial state were 8-figure-per-minute.

The Heartbleed bug (April 2014): credentials of unknown numbers of users

Different shape: not an exception-safety bug per se, but adjacent. OpenSSL’s heartbeat handler trusted a length field in the request without bounds-checking, leading to a read-out-of-bounds that could leak server memory.

I include it because the fix path is exception-safety-adjacent. The proper handling of a malformed request — one whose length field is implausible — is to not respond at all (rather than respond with whatever happens to be in memory). The basic guarantee at this layer would say “if input validation fails, no observable side effect,” which is what the fix achieved: validate the length before reading.

  • Class: Input-validation failure; partial response sent on bad input.
  • Mechanism: Bounds-check missing; the response code happily wrote the requested length of bytes from a smaller buffer, including everything after the buffer.
  • Fix shape: Validate inputs at the trust boundary; on validation failure, no response (or a fixed-shape error response). The strong guarantee at the protocol level: malformed input produces no observable side effect that depends on memory contents.
  • Cost: Unknown; estimates of compromised credentials run into the millions.
  • Reading: https://heartbleed.com/

The Cloudflare 2017 cache leak (“Cloudbleed”)

Cloudflare’s HTML rewriter, on certain malformed HTML, ran past a buffer’s end and emitted memory contents into responses. Some of those responses were cached, by Cloudflare and by Google’s crawlers, including credentials and session tokens.

  • Class: Like Heartbleed, an input-validation/buffer-handling bug. Adjacent to exception safety: the failure mode is “partial output that exposes internal state,” which is the basic-guarantee failure at the data-flow level.
  • Mechanism: An equality comparison was changed from == to >= in a parser, causing an off-by-one in a buffer pointer comparison. The parser walked off the end of a buffer.
  • Fix shape: Mostly a code-correctness fix, but the meta-lesson — input handlers should fail closed, not produce degraded output — is exception-safety advice.
  • Cost: Memory contents from millions of requests cached by third parties; required Google and others to purge caches.
  • Reading: https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/

A short list of exception-safety bugs in well-known C++ libraries

Without going to the same depth, a few smaller bugs that are illustrative:

  • std::vector::insert with a throwing copy constructor (pre-C++11): Some implementations would leave the vector with the basic guarantee where the standard had mistakenly suggested the strong. Resolved with the C++11 specification clarifications.
  • Boost.Asio handler exception leaks: Various Boost.Asio versions had paths where an exception thrown from a completion handler could leak the allocator state for the operation. Patched in 1.50-ish.
  • Chromium’s base::SequencedTaskRunner: A historical bug where a task posted from within another task’s execution could observe the runner mid-shutdown, with locks held in a state the task was not designed for. The fix involved making shutdown a more carefully-staged process. The bug shape is “concurrent shutdown is itself an exception-safety problem at the system level.”

These are not eight-figure incidents. They are a daily reality of programming in unsafe languages.

Recurring themes

Reading down the list, a few patterns recur:

  1. The bug is rarely in the throw site. Almost every entry is a bug at the call site of a throwing operation, where the caller did not arrange for state to be consistent if control transferred. The throwing operation itself was working correctly.

  2. The fix is structural, not local. “Add a try/catch” is rarely the right answer. The right answers involve restructuring code so the partial state is not visible, or making operations atomic, or designing the system so partial-application states are benign.

  3. The vocabulary is missing. Most of the post-mortems do not use the words “basic guarantee” or “strong guarantee.” They describe the bugs accurately but with non-standard vocabulary, which makes pattern recognition harder. The Therac-25 report uses the language of “interlocks.” The Ariane 5 report uses the language of “exception handling and reuse.” These are correct descriptions; the field has not yet aligned on a shared vocabulary.

  4. The cost is borne by people other than the programmers. Each of the high-impact incidents on this list cost lives, money, or reputation, paid by users and operators, not by the engineers who wrote the buggy code. This is a fact about the political economy of software, not an excuse, and it is part of why the industry has moved slowly on this. The negative externalities are hard to internalize.

  5. The same bugs keep happening. Reentrancy was rediscovered in smart contracts ten years after exception safety was formalized in C++. Saga compensation failures keep happening every time a new generation of distributed-systems engineers learns the lessons. The Therac-25 race conditions are conceptually identical to bugs we still see in industrial control systems. We are not learning.

What to do with a field guide

Use it as input to your code review. When you see a function that mutates state, ask yourself: which of these incidents could happen in this code path, scaled down? Almost any function in any non-trivial codebase has the shape to cause a small version of one of these. The discipline of exception safety is, in part, the discipline of recognizing those shapes before they ship.

The final chapter is the practical guide: what to actually do, in working code, given that you do not have time to formally verify everything.

Further reading

  • The post-mortem links above. Read at least one in full; they are short and humbling.
  • Software Engineering Disasters — there is no single book by this name, but Nancy Leveson’s Engineering a Safer World (2011) is the closest thing, and is excellent.
  • The Risks Forum (comp.risks) archive: http://catless.ncl.ac.uk/Risks. Decades of computing-related-failure case studies, many of which are exception-safety-shaped.
  • “Lessons learned from the SoftBank outage,” “AWS S3 2017 outage,” any major cloud-provider post-mortem — all read as variations on the patterns in this chapter.

What To Actually Do

You have read eleven chapters about a problem that the industry mostly does not think about, in code bases mostly not written with exception safety in mind, in languages mostly without good tooling for it. You presumably cannot rewrite your codebase in Rust tomorrow. You may not even be able to convince your team to read this book. What should you actually do?

This chapter is the practical answer. A short list of disciplines, ordered from “highest impact and lowest cost” to “high impact but harder to deploy.” If you do the first three, you have already moved your code-base ahead of 90% of production code.

1. Internalize the vocabulary

This is the cheapest thing on the list and the highest-leverage. You don’t have to write any code. You don’t have to deploy any tooling. You just have to be able to say which guarantee a function provides when you read or write it, with the words “no-throw,” “basic,” “strong,” “no guarantee.”

The discipline shows up in code review as a question: what does this function leave behind if it throws partway through? Once you train yourself to ask this on every PR, you find bugs.

void Cache::evict_lru() {
    auto victim = lru_list_.back();   // (1)
    storage_.erase(victim.key);       // (2) might throw
    lru_list_.pop_back();             // (3)
    --size_;                          // (4)
}

What does this function leave behind if (2) throws? victim is constructed (a copy of the back element). storage_ may have partially mutated. lru_list_ is unchanged. size_ is unchanged.

If storage_.erase provides the strong guarantee (it does, in std::unordered_map), then on its throw, nothing changed in storage_. The cache is consistent — same as before, no eviction happened. This function provides the strong guarantee. Good.

If storage_.erase provided only the basic guarantee, the cache could be left with storage_ partially modified, but lru_list_ and size_ unchanged. The invariant “every key in storage_ is in lru_list_” might be violated. This function would provide only the basic guarantee. Then the question is: is that good enough? Sometimes yes. Sometimes no.

The vocabulary forces you to ask. The answer is what you wanted; the asking is what was missing.

2. Order operations: throw first, mutate last

The single most useful pattern in this entire book is: if you have a sequence of operations where some can throw and some mutate observable state, put all the throwing operations first.

// BAD
balance_ -= amount;
notify_audit_log(amount);    // might throw
target.balance_ += amount;

// GOOD
notify_audit_log(amount);    // might throw — happens first
balance_ -= amount;
target.balance_ += amount;

Two things to verify:

  1. The throwing operation does not depend on the post-mutation state. (If notify_audit_log needs to know the new balance, you can’t reorder; you’ll need a side-copy approach.)
  2. The post-throw mutations do not throw. (If they do, the discipline is recursive: those throwing operations also need to come first, which may not be possible.)

When this discipline applies, it gives you the strong guarantee for free. When it doesn’t, you fall back to scope guards or side-copy patterns. This is the 80% rule: 80% of functions can be made strong-guarantee with simple reordering.

3. Use RAII (or its language’s equivalent) without exceptions

Every resource — memory, file handle, lock, network connection, database transaction — should be wrapped in an object whose destructor releases it. Or, in non-RAII languages, in a with / using / try-with-resources / defer block.

This is not optional. Code that does not use RAII for resource cleanup is broken on exception paths, full stop. There is no excuse for raw new/delete, raw lock/unlock, raw open/close in any modern language.

In code review:

  • Any raw pointer that owns memory: change to unique_ptr / shared_ptr / equivalent.
  • Any explicit unlock after a lock: change to lock_guard / scoped_lock / equivalent.
  • Any explicit close after open: change to with / using / try-with-resources.
  • Any explicit cleanup after a step: scope guard / defer.

This catches the resource-leak class of bug entirely. It does not catch the invariant-violation class, which is what the rest of the disciplines address.

4. Make destructors no-throw

Or, in languages without destructors, make scope-exit cleanups no-throw.

A destructor that throws during stack unwinding from another exception terminates the program. If your destructor does anything beyond pointer cleanup — releases a lock, closes a file, sends a message — it might throw. Wrap in try {} catch (...) {}, log the failure, swallow.

~RemoteTransaction() {
    try {
        if (!committed_) rollback();
    } catch (...) {
        local_log_.warn("rollback during destruction failed");
    }
}

This is genuinely ugly, and you are right to wince at it. It is necessary. The alternative is std::terminate in production, with no recovery, in a code path that may rarely fire and won’t be exercised in tests.

In Rust, Drop implementations cannot use ? and panicking from Drop is a known antipattern; the equivalent discipline is to log and continue.

5. Make swap no-throw, and use it

For any non-trivial type, define a swap member function that is noexcept. This is the building block for copy-and-swap, and more generally for any pattern where you need to commit a side-built result into place. Standard-library types already do this; user-defined types often do not.

class Widget {
public:
    void swap(Widget& other) noexcept {
        using std::swap;
        swap(p_, other.p_);
        swap(state_, other.state_);
    }
};

void swap(Widget& a, Widget& b) noexcept { a.swap(b); }

The cost is two functions and a discipline. The benefit is that you have a no-throw primitive available for any of the strong-guarantee patterns from chapter 4.

6. For each public mutating function, document the guarantee

In the function’s documentation comment (or in a code-review checklist applied to every public function), declare:

  • What invariants the function preserves.
  • What guarantee it provides under throw.
  • What the caller must do if the function throws.

Example:

/// Insert a new element into the cache.
///
/// Strong guarantee: if this function throws, the cache is unchanged.
/// Throws:
///   - std::bad_alloc if memory allocation fails
///   - InvalidKey if `key` does not satisfy `is_valid_key(key)`
void Cache::insert(Key key, Value value);

This is contract, not commentary. If you change the function later in a way that downgrades the guarantee, the documentation is wrong, which is a visible signal in review. Without the documentation, the guarantee is lost in the noise.

The convention in the C++ Standard Library, since C++11, is to document this consistently. The convention in your code base should be the same.

7. Test exception paths explicitly

For any function with non-trivial throw behavior, write tests that exercise the throw path. Inject a controlled throw at a known location; verify the postcondition holds; verify the function’s invariants are preserved.

TEST(Cache, RollbackOnInsertFailure) {
    Cache c;
    c.insert("a", "1");
    c.insert("b", "2");

    // make next allocation fail
    set_alloc_failure_after(0);
    EXPECT_THROW(c.insert("c", "3"), std::bad_alloc);
    set_alloc_failure_after(-1);

    // cache must be in pre-insert state
    EXPECT_EQ(c.size(), 2);
    EXPECT_EQ(c.get("a"), "1");
    EXPECT_EQ(c.get("b"), "2");
    EXPECT_FALSE(c.contains("c"));
}

The infrastructure for “make the next allocation fail” is small (a thread-local counter that operator new checks). The result is that exception paths are tested, not just hoped about. Most exception-safety bugs ship because nobody ever ran the code. Exercising the path catches them.

Property-based testing with throw injection (chapter 10) generalizes this: instead of testing one specific throw point, test all of them. The combination is powerful.

8. In concurrent code, do throwing work outside critical sections

The single concurrency-specific discipline that buys the most safety:

void Cache::insert(const Key& k, const Value& v) {
    auto entry = make_entry(k, v);   // throwing work, no lock
    std::lock_guard lock(mu_);
    storage_[k] = std::move(entry);  // no-throw mutations under lock
    metadata_.note_insert(k);        // no-throw
}

The principle is to push throwing operations before the lock acquisition, and ensure that everything inside the critical section is no-throw or at most basic-guarantee with rollback. The result is that the lock’s released state is always consistent.

When you cannot do this — read-modify-write that needs the protected state — use atomic compare-exchange or accept the basic guarantee with explicit rollback.

9. At system boundaries, prefer idempotency over transactions

For any operation that crosses a network or system boundary, design the operation so it can be retried safely. Two-phase commit at the in-process level is achievable; two-phase commit across systems is hard, expensive, and frequently does not actually work.

The technique:

  • Each operation has an idempotency key.
  • The receiving system records the key on first receipt and ignores duplicates.
  • The sender retries on failure.

This converts “the strong guarantee across the network” into “at-least-once delivery with idempotency-based deduplication,” which is achievable. Stripe’s API famously does this; AWS’s request signing includes idempotency tokens; the pattern is widespread because the alternative is “distributed transactions,” which usually don’t.

Not exception safety in the C++ sense, but exception safety in the system sense: the fact that an operation can be retried without compounding effects is what makes the higher-level “either it happened or it didn’t” actually true.

10. Adopt at least one tool

Pick one and put it in CI:

  • C++: ASan + TSan + UBSan + libFuzzer for the main code; clang-tidy for style and obvious bugs.
  • Java: SpotBugs, ErrorProne. Run them.
  • Rust: cargo clippy (already there). Take warnings seriously.
  • Solidity: Slither. Non-negotiable.
  • Python: mypy for type errors; property-based testing with Hypothesis for invariant violations.
  • Go: golangci-lint with errcheck and ineffassign.

These tools do not catch “the strong guarantee was violated”; they catch the downstream effects of exception-safety bugs (use-after-free, double-free, leaked resources). Catching those gets you most of the way, even though the abstraction is wrong.

11. Code review with a checklist

For PRs that touch mutating code, the reviewer asks:

  1. Does this function mutate state across multiple steps?
  2. Are any of those steps throwing operations?
  3. Are the throwing operations before or after the mutations?
  4. If after, is there a rollback or is the function only providing the basic guarantee?
  5. Is the basic guarantee sufficient for this function’s contract?

If the reviewer cannot answer these questions from reading the code, the PR needs a comment that documents the guarantee, or the code needs restructuring to make it obvious.

This is procedural, not technical. It works if you do it consistently and does not work if you don’t.

12. Recognize the same problem in disguise

Every time you encounter a “weird race condition,” a “cache consistency issue,” a “deployment failure mode,” or a “multi-step API call that left things in a strange state,” ask: is this exception safety with a different mechanism? It usually is.

The fix patterns are the same:

  • Order mutations after throwing/transferring operations.
  • Use a no-throw primitive for the commit step.
  • Roll back on failure with a scope guard or compensating action.
  • Make operations idempotent so retries are safe.
  • Make the partial-state period invisible (locks, transactions, copy-on-write).

Once you recognize the shape, the abstraction transfers across domains. The effort you put into thinking carefully about exception safety in one language pays off the next time you encounter a saga, a smart contract, a signal handler, or a cancelable async future.

A two-sentence summary

Exception safety is the discipline of writing code that is correct even when control flow leaves your function unexpectedly. The mechanism that causes the unexpected exit doesn’t matter — throw, panic, abort, external call, lock release, signal, page fault, network failure — and the patterns to handle it don’t either. RAII for cleanup, two-phase commit for atomicity, and the strong guarantee where it counts.

If you do only the throw-first-mutate-last reordering, the consistent RAII, and the documentation of guarantees on public mutators, you will have done more for the correctness of your code base than 95% of the production code in the world.

A note on humility

I want to close with a note. After eleven chapters and twelve years of thinking about this problem, I am still occasionally surprised by it. A function I would have sworn was strong-guarantee turns out to provide only the basic, because of a noexcept annotation I forgot to read on a member type. A reentrant call I did not see, because I assumed a callback was synchronous and it wasn’t. A signal handler that interacts with stdio in a way nobody noticed in code review.

The problem is genuinely subtle. The patterns help; the patterns are not magic. The discipline is what gets you most of the way, and the discipline is itself fragile because you have to apply it to every function, every line, every change, forever, while also shipping code on time.

I do not fully grok this. You should not fully grok it after one read. The people who appear to fully grok it have, in my experience, had it bite them enough times to develop a defensive crouch toward all mutating code, which is a state I recommend cultivating.

Good luck.

Further reading

  • Herb Sutter, “Exception-Safe Class Design,” Parts 1–3, C/C++ Users Journal, 2002. The clearest practical synthesis of the patterns.
  • Bjarne Stroustrup, “C++ Core Guidelines,” section E (Error handling). https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-errors
  • Practical Common Lisp, Peter Seibel, especially chapters 19–20. For the alternative model.
  • Pat Helland, “Standing on Distributed Shoulders of Giants,” ACM Queue 2016. For the system-level perspective.
  • The book ends here. The discipline begins now.

Bibliography and Sources

A consolidated list of references cited throughout the book, organized by topic. Where a paper or book is freely available online, the link is included. Where it is in print, you’ll have to find it.

Foundational papers and books

  • David Abrahams, “Exception-Safety in Generic Components,” Generic Programming: Proceedings of a Dagstuhl Seminar, Springer 2000. The paper that codified the three-guarantee vocabulary.
  • Bjarne Stroustrup, The Design and Evolution of C++, Addison-Wesley 1994. Especially §16 on the history of exception specifications.
  • Bjarne Stroustrup, The C++ Programming Language, 4th ed., Addison-Wesley 2013. Appendix E on standard-library exception safety.
  • Herb Sutter, Exceptional C++, Addison-Wesley 1999. The practical companion to Abrahams’s theoretical work.
  • Herb Sutter, More Exceptional C++, Addison-Wesley 2001.
  • Andrei Alexandrescu, Modern C++ Design, Addison-Wesley 2001.
  • Andrei Alexandrescu, “Generic: Change the Way You Write Exception-Safe Code — Forever,” C/C++ Users Journal, December 2000. The ScopeGuard paper.
  • Edsger Dijkstra, “Go To Statement Considered Harmful,” Communications of the ACM 11:3, March 1968.

Exception-handling internals

  • Itanium C++ ABI, “Exception Handling”: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
  • “Zero-cost exceptions” — see the GCC and Clang documentation on .eh_frame and .gcc_except_table.
  • Microsoft Visual C++ exception model documentation (varies by platform; x86 and x64 differ).

Common Lisp condition system

Concurrency

Smart contract reentrancy

Field-guide post-mortems

Distributed systems and sagas

  • Pat Helland, “Life beyond Distributed Transactions,” CIDR 2007.
  • Pat Helland, “Standing on Distributed Shoulders of Giants,” ACM Queue 14:2, 2016.
  • “Saga Pattern,” Microservices.io: https://microservices.io/patterns/data/saga.html
  • Hector Garcia-Molina and Kenneth Salem, “Sagas,” ACM SIGMOD 1987 — the original paper.

Tooling

Adjacent industry context

  • Anders Hejlsberg interview, “The Trouble with Checked Exceptions,” 2003: https://www.artima.com/articles/the-trouble-with-checked-exceptions
  • Effective Java, 3rd ed., Joshua Bloch, Addison-Wesley 2018. Items 49–77 (the exceptions chapter).
  • The Rust Programming Language, Steve Klabnik and Carol Nichols, chapter 9 (Error Handling).
  • Nancy Leveson, Engineering a Safer World, MIT Press 2011.
  • Linux Kernel Development, Robert Love, 3rd ed., Addison-Wesley 2010, chapter 7.

On the Risks Forum

The Risks Forum (comp.risks) is a continuous low-volume mailing list, archived at http://catless.ncl.ac.uk/Risks, that has been documenting computing-related failure modes since 1985. A great many of its entries are exception-safety bugs in disguise. Reading the archives chronologically is unsettling.

License

This work is dedicated to the public domain under the Creative Commons CC0 1.0 Universal Public Domain Dedication.

To the extent possible under law, the authors have waived all copyright and related or neighboring rights to Exceptionally Unsafe. You may copy, modify, distribute, and use the work, including for commercial purposes, all without asking permission.

The full legal text is in the LICENSE file in the repository.

In plain English: take it. Fork it. Translate it. Quote it. Steal it. Improve it. Claim it as your own if you want to. The book exists to be useful, not to be owned.