Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Other Places This Problem Hides

Now that we have the shape of the problem clearly in mind — partial state mutation interrupted by an unexpected control transfer — this chapter is a tour of other places it hides under different names. Each of these is a real, distinct domain, with its own vocabulary and its own folklore, but each is structurally exception safety.

Signal handlers and async-signal-safety

A POSIX signal can interrupt a process at almost any instruction. The signal handler runs on the same thread, in the same address space, with the original code’s state in whatever condition it happened to be in.

This is exactly the exception-safety problem, with extra hostility. The interruption is at the instruction level rather than at function-call boundaries. The original code was not written with the signal handler in mind; the signal handler may not know what state the original code was in.

POSIX defines a list of async-signal-safe functions — those that may be safely called from a signal handler. The list is short: write, read, _exit, signal, sigaction, a handful of others. Conspicuously absent: malloc, free, printf, fprintf, anything that touches stdio, anything that allocates memory, anything that takes a lock. Why? Because the signal could have interrupted the original code in the middle of any of those operations. If the original code was halfway through malloc and held malloc’s internal lock, and the signal handler calls malloc, the handler deadlocks (or worse, depending on how malloc is implemented).

The structural correspondence:

Exception safetyAsync-signal-safety
Function might throwSignal might be delivered
Stack unwinds, partial state visibleSignal handler runs, partial state visible
Use no-throw operations in critical sectionsUse only async-signal-safe operations in handlers
Two-phase commit: do throwing work firstDo everything in main code; handler only signals

The standard pattern for signal handlers in production C: do as little as possible in the handler. Set a flag (specifically, a sig_atomic_t volatile), and let the main code poll it. This is the inverse of the strong guarantee — the handler defers all real work to the main code, which can do it safely.

volatile sig_atomic_t signal_received = 0;

void handler(int sig) {
    signal_received = 1;
    // that's it. Do nothing else.
}

int main(void) {
    signal(SIGINT, handler);
    while (running) {
        if (signal_received) {
            handle_signal();    // safe; we're in main code
            signal_received = 0;
        }
        do_work();
    }
}

signalfd (Linux), kqueue/EVFILT_SIGNAL (BSD), and pthread_sigmask plus a dedicated signal-handling thread are all variations on the same idea: get the signal out of the asynchronous interrupt context and into a context where you can do real work safely.

Interrupt handlers in kernel code

Move down the stack to the kernel, and the same problem reappears with even harder constraints.

A hardware interrupt — disk I/O completion, timer tick, network card receive — runs the interrupt handler at high CPU priority, possibly with other interrupts disabled, possibly on a CPU stack that is not the kernel’s normal stack. The handler must:

  • Not block on locks held by the code it interrupted (deadlock).
  • Not allocate memory in ways that might block (memory allocators are themselves interruptible).
  • Not call most of the kernel’s services, because those services may have been mid-operation.
  • Run quickly, because the interrupted code is waiting and other interrupts are queued.

The Linux kernel’s pattern is a split handler: a small “top half” that runs in interrupt context and acknowledges the hardware, and a larger “bottom half” (variously: tasklets, softirqs, workqueues) that runs at lower priority and does the actual work. The same shape as the userspace signal handler pattern: do as little as possible at high-disruption priority, defer the rest to a context where the constraints are weaker.

The exception-safety language for this: the top half must provide the no-throw guarantee to everything below it. It cannot leave shared state inconsistent, cannot fail in ways that propagate, cannot release locks it didn’t take. The “no-throw” here is at the level of “no failure is permitted to escape,” because the kernel above has no way to handle one if it did.

If you have written a kernel driver and dealt with the constraints on irq handlers, the rules in the C++ exception-safety chapters of this book should feel familiar. They are the same rules, applied at a different system layer.

Hardware exceptions: page faults and divide-by-zero

A hardware exception — page fault, divide-by-zero, illegal instruction, alignment violation — fires synchronously when an instruction can’t complete. The hardware transfers control to the kernel’s exception handler.

This is literally a CPU-level GOTO to an unguessable destination, with no software intermediary. The instruction was halfway through some operation; the hardware preserves enough state to return to the next instruction (or retry the failing one); the kernel decides what to do.

The interesting case for this book is page faults. A page fault is not a bug — it is the mechanism by which demand paging, copy-on-write, mmap’d files, and stack growth are implemented. Almost every memory access in a userspace program could, in principle, fault. The kernel handles the fault by allocating a page and resuming.

But: the original instruction is in some intermediate state when the fault occurs. On most architectures, the CPU re-runs the instruction after the kernel handles the fault, with the page now mapped. The instruction’s effects so far are either fully rolled back by hardware (good architectures) or partially preserved (some bad cases on older or stranger hardware). Most modern architectures (x86-64, ARM64) make memory accesses precise — the instruction is either fully complete or fully not — but this is itself a deliberate design property, paid for in microarchitectural complexity.

The exception-safety lens: hardware exceptions are the strong guarantee implemented in silicon. The instruction either committed its effects entirely or did not. The CPU spends real complexity to provide this; the alternative (basic guarantee at the instruction level) would make demand paging unimplementable, because no software fault handler could reason about what state the instruction left things in.

This is one of the deepest examples in the book of how the strong guarantee is the foundation on which other things rest. Strip it away at the hardware level and most of the operating system becomes incoherent.

Database transaction boundaries

A database transaction is commit-or-rollback. The strong guarantee, with industrial-strength implementation. ACID’s “A” is “atomicity,” which is exactly the property the strong guarantee names.

Most working programmers think of transactions as a database thing rather than an exception-safety thing. Look at it from the other direction:

  • A transaction commits or rolls back. The strong guarantee.
  • Rolled-back transactions leave no observable state. The strong guarantee’s contract.
  • The transaction log (write-ahead log, undo log) is the database’s mechanism for making rollback possible even after partial application — it is the system-level analog of a scope guard’s stored undo action.
  • Two-phase commit (the database protocol) is two-phase commit (the C++ pattern) at a different scale: prepare in a way that is reversible, then commit in a way that is not.

The vocabulary is older than C++’s. The mechanisms are the same.

A specific exception-safety problem at the transaction boundary: what does your application code do if the database driver throws after the transaction has been committed but before the commit’s confirmation reaches the application? Specifically:

try:
    cursor.execute("INSERT ...")
    conn.commit()                    # network call to DB
    update_local_cache(item)         # local mutation
except DatabaseError:
    rollback_local_cache(item)
    raise

If conn.commit() succeeds on the database side but the connection is dropped before the success is reported, the driver throws. The exception handler runs. update_local_cache was never called. The application’s local cache says “no item,” the database says “yes item.” The application has desynchronized from the system of record, in a way that no rollback can fix — the database commit did happen, irreversibly, before the failure was reported.

This is the generals’ problem applied to a single client and a single server, and it has no in-process solution. The only way forward is idempotency on retry: design the operation so that re-trying it is safe whether the previous attempt committed or not. Exactly-once delivery does not exist in the network model; at-least-once with idempotency does.

The exception-safety lesson: the strong guarantee, end-to-end, requires more than two-phase commit at the function level when there’s a network in the middle. You can have the strong guarantee for the database. You can have the strong guarantee for the local cache. You cannot have it for “both atomically” without a distributed coordinator, and even then with caveats. This is the territory of consensus algorithms, the subject of which is its own book.

Saga compensations in distributed systems

When you can’t have transactions across services — and you usually can’t, in a microservices architecture — you have sagas. A saga is a sequence of local transactions, each in a different service, with compensating transactions defined for rollback.

Order Service:    create order        (compensate: cancel order)
Payment Service:  charge card         (compensate: refund card)
Inventory:        reserve stock       (compensate: release stock)
Shipping:         schedule delivery   (compensate: cancel delivery)

If step 3 fails, the saga runs the compensations for steps 1 and 2 in reverse: release nothing (step 3 didn’t run), refund the card, cancel the order.

This is literally the scope-guard pattern from chapter 4, scaled up to multiple services and made durable. Each “compensation” is the rollback action a scope guard would run. The “dismiss on success” is the saga’s normal completion. The orchestrator (or choreographed event flow) is what guarantees that compensations actually run.

The way sagas differ from in-process scope guards:

  • Compensations may themselves fail. If the refund call fails, what now? You have a sequence of nested rollbacks, each of which can fail, and at some point a human has to look at the system and decide.
  • Compensations are not always exact inverses. “Send a marketing email” has no compensation: you can’t unsend the email. The saga either declines to include un-compensable steps, or accepts that some compensations are best-effort.
  • Compensations may not be valid if too much time has passed. Refunding a payment six months after the charge is harder than refunding it immediately. State drifts.

These are all reasons why distributed sagas are harder than they look, and why most production “distributed transaction” systems quietly accept that they are best-effort and have manual reconciliation processes for the cases where compensation fails.

The exception-safety language: a saga provides the basic guarantee at the system level (no resources leaked) and approximates the strong guarantee (rollback to before-state) but cannot generally provide it. “Approximates” is doing a lot of work in that sentence; it is in the gap between approximation and provision that real money lives.

Cooperative multitasking yields

A coroutine that yields to another coroutine has, from its own perspective, paused. From the perspective of any code that observes the yielded coroutine’s mutable state, the coroutine has left state in whatever shape it was in at the yield.

This is the same problem as a lock release in concurrent code: the yielded coroutine has not finished its mutation. Whatever state it was modifying may be partially modified. Other coroutines that read that state see the half-modified version.

In strict cooperative multitasking, the yielding coroutine controls when it yields, so it can ensure it yields only at points where its state is consistent. This is the cooperative version of “two-phase commit”: don’t yield in the middle of a multi-step state mutation. Modern async/await is closer to this — await is an explicit yield point, and the programmer can structure code so awaits happen between consistent states.

But in preemptive coroutines (some Lua versions, Erlang processes, kernel green threads), the runtime can preempt at points the coroutine did not choose. In those systems, the same care that applies to multi-threaded code applies, because preemption is exception-shaped: an unexpected control transfer that may leave state half-modified.

Memory allocation under low-memory conditions

A subtle one. A program that allocates memory under low-memory conditions may have an allocation throw (bad_alloc in C++) or return null (C, some other languages). The point at which the allocation fires is not predictable in advance — it depends on the running state of the entire process. Any allocation, in principle, could fail.

Code written to handle allocation failure gracefully is rare. Most code assumes allocations succeed, with bad_alloc propagated up to a top-level handler that logs and aborts. This is a defensible position — if you’re out of memory, your options are limited — but it means any function that allocates may be the source of a throw, which is most functions in any language with dynamic allocation, which is most languages.

The exception-safety implication: in a code base that wants to provide guarantees under allocation failure, the allocation discipline becomes pervasive. Every assignment of a vector, every use of std::string, every container operation is an allocation site. Codifying “this code is exception-safe under allocation failure” requires either avoiding allocations in critical paths (which is the embedded-systems and high-frequency-trading discipline) or accepting that the strong guarantee under allocation failure is not on offer.

The Linux kernel, famously, runs in GFP_ATOMIC mode in interrupt handlers, where allocation may simply return null (no waiting). Driver code is full of conditional logic for this: try to allocate, fail back to a slower path if not, ultimately drop a packet rather than block. This is exception safety, written as null-checks instead of try/catch, with the same structure.

Cancellation in async runtimes

We touched on this in chapter 7. Cancellation in async runtimes — Tokio, asyncio, .NET Tasks — is exception-shaped: a request to abort an in-progress operation, propagated either as an exception (most runtimes) or as a Drop (Rust). The cancelled operation has, by definition, not finished. Its in-progress state may be half-mutated.

The cancellation-safety question is: if your async function is cancelled at any await point, will it leave behind a consistent world?

This is the same question as exception safety. The framing is younger, the literature is thinner, and the patterns are still being worked out, but the underlying problem is the same. The Rust async community in particular has been building vocabulary around cancellation safety; expect the next decade of literature on it to repeat, in different terms, the patterns the C++ community established for exceptions in the 1990s.

A general pattern emerges

Across all of these:

DomainThe “control transfer”The “partial state”The pattern
C++ exceptionsthrowLocal state when throw firesTwo-phase commit, RAII
Smart contractsExternal callStorage state during callChecks-effects-interactions
Concurrent codeLock release / thread switchShared state at switchTwo-phase commit under lock
Signal handlersSignal deliveryProcess state when signal arrivesAsync-signal-safe handlers
Hardware exceptionsTrapInstruction state at trapArchitectural precise exceptions
Database transactionsTransaction abortTuples mid-updateWrite-ahead log + rollback
SagasStep failureCross-service stateCompensating transactions
Async/cancellationCancel signalAwait-suspended stateCancellation-safe code
Allocation failurebad_alloc / nullLocal state at allocationAllocate first, mutate after

In each row, the same shape: something might transfer control out of the function unexpectedly; the function must be written so that whatever state it’s in at the moment of transfer is recoverable or compensatable. The local mechanism differs. The structural pattern is identical.

This is, I think, the most useful thing to take away from the book. Once you see the shape, you see it everywhere. And the patterns the C++ community developed for the exception case — RAII, two-phase commit, scope guards, the strong guarantee — turn out to be the patterns everyone needs, in every domain. They are sometimes rediscovered with different names. Sometimes they are not rediscovered at all, and the bugs that result are reported as “logic errors” or “race conditions” or “reentrancy attacks,” and the post-mortems treat them as separate phenomena. They are not separate phenomena.

The next chapter is about the tooling that exists, mostly inadequate, for catching these.

Further reading