Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Patterns That Work: Sagas, Outbox, Circuit Breakers, and Friends

Previous chapters have been about problems. This one is about solutions — or more precisely, about the patterns that smart engineers have repeatedly converged on when solving distributed systems problems. These aren’t novel ideas; they’re battle-tested approaches with known trade-offs.

If chapters 1-9 are the theory, this is the practice.


The Transactional Outbox Pattern

The problem it solves: You need to perform a database write and publish an event/send a message, and you need both to happen atomically (either both succeed or neither does).

This comes up constantly. You save a new order to the database and need to publish an OrderCreated event to a queue. If the database write succeeds but the message publish fails, your system is inconsistent. If the message publish succeeds but the database write fails, you’ve published a lie.

The naive solution — write to DB, then publish — breaks because these are two separate operations with no transaction spanning them:

// This is broken
db.save(order)  // succeeds
queue.publish(OrderCreated)  // network error!
// order exists in DB, event never published

The outbox pattern:

Write both to the database, in the same transaction:

BEGIN TRANSACTION;

INSERT INTO orders (id, ...) VALUES (...);
INSERT INTO outbox (id, event_type, payload, published)
  VALUES (uuid(), 'OrderCreated', '{"orderId": ...}', false);

COMMIT;

A separate process (the “outbox poller” or “relay”) reads unpublished outbox entries and publishes them to the queue:

def outbox_relay():
    while True:
        events = db.query("""
            SELECT * FROM outbox
            WHERE published = false
            ORDER BY created_at
            LIMIT 100
            FOR UPDATE SKIP LOCKED
        """)

        for event in events:
            try:
                queue.publish(event.event_type, event.payload)
                db.execute(
                    "UPDATE outbox SET published = true WHERE id = ?",
                    event.id
                )
            except Exception:
                log.error(f"Failed to publish {event.id}, will retry")

        time.sleep(0.1)  # poll interval

The guarantee: if the transaction commits, the event is in the outbox. The relay will publish it eventually, retrying if necessary. Because the relay can retry (and the queue consumer must be idempotent), this gives you at-least-once delivery with the ordering guarantees of your database.

For high-throughput systems, replace polling with Change Data Capture (CDC): tools like Debezium tail the database’s replication log (binlog for MySQL, WAL for Postgres) and publish events for each committed change. Zero polling overhead; events published in near-real-time.


Saga Pattern: Distributed Transactions Without 2PC

The problem it solves: A business operation spans multiple services, each with its own database. You need the operation to be consistent — either all services complete their step, or the operation is rolled back across all of them.

The classic example: placing an order requires:

  1. Reserving inventory (Inventory Service)
  2. Charging the payment method (Payment Service)
  3. Creating the shipment (Shipping Service)
  4. Sending the confirmation email (Notification Service)

If payment succeeds but inventory reservation fails, you’ve charged a customer for something you can’t ship. If shipment succeeds but payment fails, you’ve shipped for free. You need transactional guarantees across services that don’t share a database.

Two-phase commit across services is technically possible but terrible in practice: it requires all services to implement 2PC, creates tight coupling, and leaves the system blocked if the coordinator fails.

Sagas take a different approach: decompose the distributed transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes it.

Place Order Saga:
    Step 1: Reserve inventory        → compensate: release reservation
    Step 2: Charge payment           → compensate: refund charge
    Step 3: Create shipment          → compensate: cancel shipment
    Step 4: Send confirmation email  → compensate: (none / send cancellation)

If any step fails, execute the compensating transactions for all completed steps in reverse order:

Step 1: Reserve inventory       ✓
Step 2: Charge payment          ✓
Step 3: Create shipment         ✗ (failure!)
→ Compensate step 2: Refund charge
→ Compensate step 1: Release reservation
Saga complete (rolled back)

Choreography vs Orchestration

There are two approaches to implementing sagas:

Choreography: Each service publishes events. Other services listen for events and take the next action.

Order Service publishes OrderCreated
  → Inventory Service hears it, reserves inventory, publishes InventoryReserved
    → Payment Service hears it, charges card, publishes PaymentProcessed
      → Shipping Service hears it, creates shipment

Advantages: decoupled, no single point of failure. Disadvantages: the business logic is distributed across services, hard to see the full flow, complex to debug.

Orchestration: A central “saga orchestrator” tells each service what to do next.

Order Orchestrator:
  1. Call Inventory Service: reserve(orderId)
  2. Call Payment Service: charge(orderId)
  3. Call Shipping Service: create_shipment(orderId)
  if any step fails:
  4. Call Payment Service: refund(orderId)  (if charged)
  5. Call Inventory Service: release(orderId)  (if reserved)

Advantages: business logic is in one place, easy to visualize and debug, easier to implement complex compensation logic. Disadvantages: the orchestrator is a central point of failure (mitigated by making it stateless and persisting state to a database).

The orchestration implementation: Store the saga state in a database. On each step completion (success or failure), update the state. The orchestrator can be restarted from any point — it just picks up from the last saved state.

class OrderSaga:
    def __init__(self, db, inventory_svc, payment_svc, shipping_svc):
        self.db = db
        self.inventory = inventory_svc
        self.payment = payment_svc
        self.shipping = shipping_svc

    def execute(self, order_id):
        state = self.db.get_or_create_saga_state(order_id)

        if state.step < 1:
            result = self.inventory.reserve(order_id)
            if not result.success:
                return self.compensate(order_id, step=0)
            self.db.update_saga_state(order_id, step=1, inventory_reservation_id=result.id)

        if state.step < 2:
            result = self.payment.charge(order_id)
            if not result.success:
                return self.compensate(order_id, step=1)
            self.db.update_saga_state(order_id, step=2, payment_id=result.id)

        if state.step < 3:
            result = self.shipping.create_shipment(order_id)
            if not result.success:
                return self.compensate(order_id, step=2)
            self.db.update_saga_state(order_id, step=3, shipment_id=result.id)

        self.db.update_saga_state(order_id, status="completed")

    def compensate(self, order_id, step):
        state = self.db.get_saga_state(order_id)

        if step >= 2 and state.payment_id:
            self.payment.refund(state.payment_id)

        if step >= 1 and state.inventory_reservation_id:
            self.inventory.release(state.inventory_reservation_id)

        self.db.update_saga_state(order_id, status="compensated")

What sagas don’t give you

Sagas provide atomicity (either all steps complete or they’re compensated) but not isolation. While a saga is executing, other sagas can see intermediate states. If two sagas both try to reserve the same last unit of inventory, one might succeed and then later fail at payment, while the other was told “no inventory” incorrectly.

This is the dirty reads problem in saga terms. Mitigations: semantic locks (mark resources as “reserved” before they’re fully committed), process managers that coordinate conflicting sagas, careful ordering of steps (reserve last-item-in-stock at the end, not the beginning).


Circuit Breaker (Detailed Implementation Notes)

Covered in the fault tolerance chapter, but worth expanding on implementation details.

What counts as a failure?

  • HTTP 5xx responses (server errors)
  • Network timeouts
  • Connection refused errors

What doesn’t count as a failure?

  • HTTP 4xx responses (client errors — these aren’t the downstream service’s fault)
  • Business logic errors (an order can’t be fulfilled because it’s invalid — that’s not a service failure)

Misclassifying 4xx as failures causes circuit breakers to trip on bad client input, which is not what you want.

Sliding window vs count-based thresholds

Count-based: “trip after 5 failures.” Simple but doesn’t account for recovery time. If the service had 5 failures in the last hour, the circuit would be tripped even though it’s been healthy for 55 minutes.

Time-based: “trip if error rate exceeds 50% in the last 10 seconds.” More accurate. Requires tracking recent history. Libraries like Resilience4j implement sliding windows.

Half-open probing strategy

Don’t just send one probe when moving to HALF_OPEN. The single probe might fail by chance, sending you back to OPEN when the service has actually recovered. Consider:

  • Sending a configurable number of probes before fully closing
  • Using a lower failure threshold to re-open from HALF_OPEN
  • Incrementally increasing allowed traffic (10% → 25% → 50% → 100%)

Read-Your-Writes via Sticky Sessions or Version Tokens

The problem: You write to a primary database, immediately redirect to a page that reads from a replica. The replica hasn’t caught up. Your write is invisible. User reports a bug.

Solution 1: Sticky reads to primary after writes

Track a “recently wrote” flag (in a session cookie, JWT claim, or request header). For a short window (e.g., 30 seconds) after a write, route reads to the primary.

def handle_write(user_id, data):
    db.primary.write(data)
    session["wrote_at"] = time.time()
    return redirect(...)

def handle_read(user_id):
    wrote_recently = (
        "wrote_at" in session and
        time.time() - session["wrote_at"] < 30
    )
    db_to_use = db.primary if wrote_recently else db.replica
    return db_to_use.read(user_id)

Disadvantage: routes potentially expensive reads to the primary, partially defeating the purpose of replicas.

Solution 2: Replication position tokens

After a write, capture the WAL/replication position (a log sequence number). Pass this as a token to subsequent requests. On the replica, wait until the replica has caught up to at least that position before serving the read.

Client writes → Primary returns LSN: 12345
Client next request includes token: min_lsn=12345
Replica checks: is my current LSN ≥ 12345?
  If yes: serve read
  If no: wait briefly, recheck (or route to primary)

This is more precise than time-based routing. PostgreSQL exposes its WAL LSN; MySQL exposes binlog coordinates. CockroachDB has built-in follower reads with this mechanism.


Event Sourcing

The pattern: Instead of storing the current state of an entity, store the sequence of events that produced that state. Current state is computed by replaying events.

Traditional approach:
  accounts table: { id: 1, balance: 150 }

Event sourcing approach:
  events table:
    { account_id: 1, type: "Deposit", amount: 100, at: T1 }
    { account_id: 1, type: "Deposit", amount: 100, at: T2 }
    { account_id: 1, type: "Withdrawal", amount: 50, at: T3 }

  Current balance: sum events to get 150

Advantages:

  • Full audit trail: you know exactly how you got to current state
  • Time travel: compute state at any historical point
  • Event replay: rebuild projections from scratch (useful for adding new views)
  • Natural fit with CQRS and event-driven architecture

Disadvantages:

  • Querying current state requires aggregation (mitigated with snapshots and read models)
  • Schema evolution of events is hard (old events must still be interpretable)
  • Mental model shift — requires discipline to work with
  • Storage grows without bound unless snapshots and compaction are implemented

Event sourcing is not universally applicable, but for domains where audit, history, and the ability to re-derive state are important (financial systems, legal records, collaborative editing), it’s a natural fit.


CQRS: Command Query Responsibility Segregation

The pattern: Separate the read path from the write path. Write operations (commands) update an authoritative data store. Read operations (queries) read from one or more separate read models optimized for specific queries.

Write path:
  Command → validate → execute → update write store → publish events

Read path:
  Event listener → update read model(s)
  Query → read from read model

Why it helps: Write stores don’t need to be optimized for queries. Read models can be denormalized, pre-aggregated, and optimized for specific access patterns. Multiple read models can be maintained for different query patterns.

Write store: normalized relational tables
Read model A: denormalized table for customer order history (optimized for "show me my orders")
Read model B: Elasticsearch index for full-text order search
Read model C: materialized view for monthly revenue reporting

The trade-off: Read models are derived from events and may lag behind the write store. Queries may see slightly stale data. This is the eventual consistency in CQRS.

CQRS pairs naturally with event sourcing (events update the read models) but can also be used with traditional state-based persistence.


Sidecar/Proxy Pattern for Cross-Cutting Concerns

The pattern: Run a proxy process alongside each service instance. The proxy handles cross-cutting concerns (retries, circuit breaking, load balancing, tracing, mutual TLS) so the service doesn’t have to.

Service code:
  "Call http://payment-service/charge"
  (no retry logic, no circuit breaker, no tracing)

Sidecar proxy intercepts:
  - Adds distributed trace headers
  - Enforces mTLS
  - Applies retry policy
  - Applies circuit breaker
  - Load balances across payment-service instances
  - Reports metrics

Service sees a simple HTTP call.
The sidecar handles everything else.

This is the service mesh pattern: Envoy, Linkerd, Consul Connect. The infrastructure concerns are handled at the infrastructure level, not in application code.

The trade-off: Operational complexity. You’re now running two processes per service instance. The proxy is another thing that can fail, another thing to configure, another thing to understand. The benefit (uniform policy enforcement, language-agnostic implementation) is real, but the cost is real too.


Inbox Pattern (Idempotent Message Consumption)

Companion to the outbox pattern. When consuming messages from a queue, you need to handle duplicates — the queue might deliver the same message more than once (at-least-once delivery).

The pattern: Before processing a message, check if you’ve already processed it. If yes, skip processing (return success). If no, process it and record that you’ve done so.

def handle_message(message):
    message_id = message.headers["message-id"]

    # Check if already processed (atomic with processing via DB transaction)
    with db.transaction():
        if db.query("SELECT 1 FROM processed_messages WHERE id = ?", message_id):
            return  # Already processed, skip

        # Process the message
        process_business_logic(message.body)

        # Record as processed
        db.execute(
            "INSERT INTO processed_messages (id, processed_at) VALUES (?, NOW())",
            message_id
        )

This requires messages to have a stable, unique ID. Most message brokers (Kafka, SQS, RabbitMQ) provide message IDs. For Kafka, the combination of topic + partition + offset is a unique identifier.


Patterns Summary

PatternProblem SolvedKey Trade-off
Transactional OutboxAtomic DB write + event publishAdded latency; requires outbox poller
Saga (choreography)Distributed transactionsHard to visualize; complex compensation
Saga (orchestration)Distributed transactionsOrchestrator is central point of concern
Circuit BreakerCascade failure preventionConfiguration complexity; false trips
Idempotency KeySafe retriesRequires unique ID generation; storage
Read-Your-WritesReplication lag consistencyRoutes load to primary
Event SourcingAudit trail, history, replayStorage growth; schema evolution
CQRSRead/write optimizationEventual consistency on reads
Sidecar/Service MeshCross-cutting concernsOperational complexity
Inbox / DeduplicationIdempotent message consumptionStorage for processed message IDs

None of these patterns are magic. Each adds complexity in exchange for solving a specific problem. The discipline is matching the pattern to the problem — not reaching for a saga when a simple transaction would do, and not “just using a transaction” when you actually need a saga.


Next: the chapter that ties everything together — how to make actual architecture decisions for your actual system, with your actual constraints.