Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Exception Safety in Concurrent Code

Exception-safe sequential code is hard. Exception-safe concurrent code is roughly an order of magnitude harder, and the reason is not subtle: in sequential code, the time between “an exception fires” and “control reaches a handler” is short and involves only one thread of control. In concurrent code, that interval may be observable by other threads, who may act on the partially-mutated state before the throwing thread has cleaned up, before any handler has run, before any rollback could conceivably take effect.

This chapter walks through the specific problems that emerge when exceptions and concurrency interact: lock-holding under throw, partial state visible across threads, the interaction with thread cancellation, and the surprisingly subtle question of what “the strong guarantee” even means in a multi-threaded context.

The basic problem: locks held under throw

class Cache {
    std::mutex mu_;
    std::unordered_map<std::string, std::string> entries_;
public:
    void put(const std::string& key, const std::string& value) {
        std::lock_guard<std::mutex> lock(mu_);
        entries_[key] = value;       // (1) might throw bad_alloc
        update_metadata(key);        // (2) might throw
        notify_observers(key);       // (3) might throw
    }
};

If update_metadata at (2) throws, the lock is released by lock_guard’s destructor — that part is fine. But what state is entries_ in? entries_[key] = value succeeded. Whatever invariant update_metadata was supposed to maintain in another part of the cache (perhaps a separate lookup table, an LRU list, something) has not been maintained. The cache is now internally inconsistent, the lock is released, and another thread is free to observe the half-updated state.

This is the internal version of the problem from chapter 1, with one new wrinkle: in sequential code, the throwing thread will eventually unwind to a handler that can decide to fix or ignore the inconsistency. In concurrent code, any other thread can call Cache::get between the throw and the handler, and will see whatever state the throw left.

Rust calls this poisoning and bakes it into std::sync::Mutex. If a thread panics while holding the lock, the next attempt to acquire the lock returns Err(PoisonError) instead of granting access. Code that wants to access the (potentially inconsistent) state has to call .into_inner() on the error explicitly, acknowledging that “yes, I know the state may be wrong, I’m taking responsibility.”

C++ has no equivalent. C++’s std::mutex happily releases on the throw and grants the next acquisition without comment. Java and Go are the same. The discipline of “leave the protected state consistent before you throw” is, in the non-Rust languages, entirely on the programmer’s shoulders, and the language has no way to remind you that you forgot.

The fix for the snippet above:

void put(const std::string& key, const std::string& value) {
    // do throwing work outside the lock if possible
    auto new_entry = make_entry(key, value);  // might throw
    auto new_metadata = compute_metadata(key); // might throw

    std::lock_guard<std::mutex> lock(mu_);
    // critical section is now no-throw
    entries_[key] = std::move(new_entry);
    metadata_[key] = std::move(new_metadata);
}

Reorder so all throwing operations happen outside the lock, and the critical section becomes a sequence of no-throw moves and assignments. This is two-phase commit at the concurrent level: the “side copy” is constructed under no-lock, then committed under-lock with no-throw operations.

This pattern is not always achievable. If you need to read protected state in order to compute the new state (read-modify-write), you can’t do the computation outside the lock. The standard answer here is atomic compare-and-swap: read the current state, compute the new state without holding the lock, then atomically swap if the state hasn’t changed in the meantime, retrying if it has. Lock-free data structures generalize this. But that is a different chapter, in a different book.

Strong guarantee under concurrency: redefined

Recall the strong guarantee: “either the operation completes, or the visible state is exactly as before.”

Under concurrency, visible to whom? Three answers, increasingly demanding:

  1. Visible to the same thread. After the throw, this thread sees the same state it would have seen without calling the operation. Easy, satisfied by careful sequential design.

  2. Visible to other threads after the throw is fully handled. Once this thread’s exception has propagated to its handler and any cleanup has run, other threads see consistent state. Achievable with two-phase commit under a lock, as above.

  3. Visible to other threads at every moment, including during the operation. Other threads may observe the operation in progress, but they see the state as either fully-before or fully-after, never partially-applied. Strict strong guarantee. Requires either no observable intermediate state (atomic update) or a coordination mechanism that makes the intermediate state invisible (RCU, copy-on-write).

Most real concurrent code provides (1) by accident, (2) by careful design, and (3) only when the data structure is explicitly designed for it. The C++ Standard Library’s concurrent containers (shared_mutex, the various atomic types, etc.) provide (2) under their stated contracts and (3) only for individual atomic operations.

The interesting cases are when (3) should be required and isn’t. Example: a logging system where a log line is written to a buffer and a sequence number is incremented. If the buffer write throws after the sequence number is incremented, an external observer reading “current sequence number” sees a number that doesn’t correspond to any actual log line. Subtle, occasionally consequential.

Thread cancellation: the special case

Some platforms support thread cancellation: another thread requests this thread to stop, and at well-defined cancellation points, the thread is interrupted. POSIX threads have pthread_cancel; Java has Thread.interrupt(); .NET has cooperative cancellation tokens.

In C++, on most implementations, cancellation is implemented as a special exception — a “forced unwind” that runs destructors normally but cannot be caught by ordinary catch. The intent is that cleanup runs but the cancellation is ultimately uninterruptable. In practice, this means that any code that is exception-safe in the ordinary sense is also cancellation-safe, modulo the rule that you must not catch the cancellation exception, which is enforced by the runtime.

This is mostly fine, but it introduces a subtle constraint: any catch (...) clause in C++ might be catching a cancellation exception, and re-throwing it (or letting it propagate) is essential to allowing the cancellation to actually take effect. The idiom:

try {
    do_work();
} catch (...) {
    cleanup();
    throw;  // critical: re-throw to propagate cancellation
}

Code that does catch (...) {} (catch all and swallow) will silently absorb thread cancellation, defeating the mechanism. This has caused real bugs.

Java’s interruption is similar in spirit: InterruptedException is checked, and the convention is that catching it without re-throwing or restoring the interrupted status leaves the thread in a state where the interruption has been silently lost. Effective Java item 70 covers this in detail; the short version is “Thread.currentThread().interrupt(); after catching InterruptedException if you don’t re-throw.”

In Go, context.Context is the cancellation mechanism, and it is not exception-shaped: it is a value-based signal, with code expected to check ctx.Done() at strategic points. This avoids the silent-swallow problem at the cost of explicit checks. Whether you prefer this trade-off depends on your views about visibility-versus-ergonomics in error handling.

Async cancellation

Async/await frameworks (C# Task, Python asyncio, JavaScript Promise/async, Rust async) all have to deal with cancellation across await points, where the call stack at the cancellation point is logical rather than physical. The languages handle this in different ways:

  • C# / .NET: CancellationToken passed explicitly. Awaited operations check the token; on cancellation, throw OperationCanceledException. The exception unwinds through await boundaries normally.
  • Python asyncio: Task.cancel() causes the next await to raise CancelledError at the awaiting site. The stack is logical; the unwinding happens as the awaited tasks return errors that propagate up.
  • JavaScript: No built-in cancellation. AbortController provides an out-of-band signal; libraries are expected to check it. There is no exception-shaped cancellation mechanism in the language.
  • Rust async: Cancellation is “drop the future.” Drop runs destructors. The future cannot run any more. This is, in some ways, the cleanest model — cancellation is just resource cleanup of an unfinished computation — but it has surprising consequences for code that “must complete” parts of an operation.

The Rust model deserves a second look, because it interacts with exception safety in a way that is not obvious. If a Rust async function has done step_1() and is partway through await step_2(), and the future is dropped, step_2’s Drop runs, and the function’s local state is dropped, but step 1’s effects on external state remain. The basic guarantee for resource cleanup is preserved (the Drop chain runs), but the strong guarantee for the operation’s logical atomicity is not, because Rust can’t synthesize a rollback of step_1. As of this writing, the async-Rust community is still building patterns for this — the cancellation-safety literature is recent and ongoing.

Lock ordering and deadlock under exception

A specific concurrent failure mode worth flagging: if you acquire two locks and an exception fires between them, the cleanup must release them in the right order. RAII gets this right automatically (destructors run in reverse construction order). But if you acquire locks in a function and re-throw to the caller, the caller may then try to acquire the same locks in a different order, deadlocking.

void Service::transfer(Account& a, Account& b, int amount) {
    std::scoped_lock lk(a.mu_, b.mu_);  // deadlock-free acquisition
    // ...
}

std::scoped_lock (C++17) is the right answer: it acquires multiple mutexes in a deadlock-free order using std::lock, releases them in reverse on destruction, and is exception-safe by construction. Older code using std::lock_guard paired with std::lock separately is more error-prone. The footgun: if you lock a.mu_ first and then throw before acquiring b.mu_, you’re safe. But if a different code path locks b.mu_ first and then a.mu_, you have lock-ordering inconsistency, and that can deadlock with the first path. Exception cleanup never causes the deadlock; the deadlock was structural, just exposed by the exception path having different timing.

std::scoped_lock and the equivalent in other languages exist because the manual version is so easy to get wrong. Use them.

A small disaster: the destructor-throws-during-unwinding interaction

Combined with concurrency, the rule “destructors must not throw” becomes harder to satisfy. A destructor that, say, releases a lock and then logs to a remote service is fine in normal operation. During stack unwinding from an exception in another thread, the remote service may have become unreachable. The destructor’s logging call throws, the runtime sees a throw during unwinding, and std::terminate is called.

The fix is, again, to absorb exceptions in destructors:

~RemoteLogTransaction() {
    try {
        if (!committed_) remote_log_.send_rollback();
    } catch (...) {
        // swallow; we're already unwinding
        local_log_.write("rollback failed; remote service unreachable");
    }
}

This is ugly. It is also necessary. Every destructor in concurrent C++ code that does anything beyond pointer cleanup needs to consider this case.

What concurrent exception safety looks like, in practice

A short list of the disciplines that work:

  1. Do throwing work outside critical sections. Compute the new state with no lock held; commit under-lock with no-throw operations. Two-phase commit at the concurrent level.

  2. Use lock-free or atomic primitives where the strong concurrent guarantee is required. std::atomic<T>::compare_exchange for read-modify-write. std::shared_mutex for read-mostly state. These are tools designed for the case where the state-consistency window must be zero.

  3. Use std::scoped_lock (C++17) or equivalents for multi-mutex acquisition. The deadlock-avoidance and exception-safe release are built in.

  4. Handle thread cancellation explicitly. Re-throw cancellation exceptions in C++. Restore interrupt status in Java. Honor cancellation tokens in C#. Treat cancellation as a first-class case, not as “exceptions are exceptional.”

  5. Audit destructors for exception throws, especially in concurrent code where the calling environment may be more error-prone than the development environment. A destructor that throws in production but not in test is the worst kind of production bug.

  6. Recognize that mutex poisoning (Rust) is a feature, not a bug. Other languages should learn from it. Where the language doesn’t help, leave a comment that the protected state may be inconsistent after a throw, and treat that as a first-class failure mode.

  7. Prefer immutable data with atomic pointer swaps over mutable data with mutexes. The exception-safety argument is: an immutable structure cannot be partially mutated, because it cannot be mutated at all. A swap of a pointer to it is atomic. The throw-during-mutation problem disappears. The cost is a copy.

The honest summary

Concurrent exception safety is not a discipline anyone gets right by intuition. It is a set of patterns, applied with care, and continually undermined by the ordinary engineering pressures of “this code worked, let’s not touch it” and “we don’t have time to audit every destructor.” The result, in practice, is that almost every long-lived C++ codebase contains some number of latent exception-safety bugs in its concurrent code, and these manifest occasionally as production incidents whose post-mortem says “we shipped a fix to handle the case where x throws while y was in state z.”

The next chapter is about how this exact problem reappears in a place that does not involve try/catch at all, but is nonetheless the same problem in different costume.

Further reading

  • Hans-J. Boehm, “Threads Cannot Be Implemented as a Library,” PLDI 2005. Foundational paper on why concurrency must be a language-level concept, with implications for exception interaction.
  • Java Concurrency in Practice, Goetz et al., chapter 7 (“Cancellation and Shutdown”). The clearest treatment of the interrupt-and-cancellation interaction in any language.
  • “Cancellation safety in async Rust” — see the tokio documentation and the Rust async working group’s ongoing discussion. As of this writing, the formal definition is still being refined.
  • The Rustonomicon, “Poisoning” section: https://doc.rust-lang.org/nomicon/poisoning.html