Introduction

Welcome to a book about event-driven architecture. If you picked this up hoping for a breezy overview with a few diagrams and a lot of hand-waving, I have bad news: we are going to get into the weeds. If you picked this up because you are tired of reading blog posts that describe event-driven architecture as "a paradigm where services communicate through events" and then move on as though that explains anything — you are in the right place.

Event-driven architecture (EDA) is one of the most powerful — and most misunderstood — approaches to building distributed systems. It is also one of the oldest. Long before microservices became a conference circuit favourite, long before "real-time" became a product requirement for applications that could comfortably run on a cron job, engineers were building systems that reacted to things that happened in the world. The ideas are not new. The tooling has gotten dramatically better. The ways to get it wrong have scaled accordingly.

This chapter sets the stage. We will define what an event actually is (more carefully than you might expect), contrast EDA with request-response architectures, trace a brief history, and — critically — talk about when you should not use it. Then we will lay out the roadmap for the rest of the book.

What Is an Event?

An event is a record of something that happened. Past tense. Immutable. Done.

This sounds obvious, but it is the single most important concept in the entire book, and getting it wrong cascades into every design decision that follows. An event is not a request. It is not an instruction. It is not a suggestion. It is a statement of fact about the past.

// This is an event:
{
  "type": "OrderPlaced",
  "timestamp": "2025-11-14T09:32:17.443Z",
  "data": {
    "orderId": "ord-7829",
    "customerId": "cust-441",
    "totalAmount": 149.99,
    "currency": "USD"
  }
}

// This is NOT an event — this is a command:
{
  "type": "PlaceOrder",
  "data": {
    "customerId": "cust-441",
    "items": [...]
  }
}

The distinction matters enormously. An event describes what has happened. A command describes what someone wants to happen. The grammar is the giveaway: events are past participles (OrderPlaced, PaymentProcessed, UserRegistered), commands are imperatives (PlaceOrder, ProcessPayment, RegisterUser). If your "event" is named CreateInvoice, you do not have an event. You have a command wearing an event's clothing, and it will cause you problems.

Immutability

Events are immutable because the past is immutable. You cannot un-place an order. You can cancel it — which is a new event (OrderCancelled) — but the original OrderPlaced event remains true. This is not a technical constraint; it is a philosophical one that happens to have profound technical implications.

Immutability gives you an append-only log of everything that happened in your system. This log is the foundation of event sourcing, audit trails, temporal queries, and replay-based debugging. It is also the reason event-driven systems can be so much harder to "fix" than traditional ones. You cannot just UPDATE a row and pretend the mistake never happened. The mistake is in the log. You deal with it by appending a correction, not by rewriting history.

Events Are Not Messages

This is a distinction that trips up even experienced engineers. A message is a transport mechanism — it is how you get data from point A to point B. An event is a semantic concept — it is the data itself, the fact that something happened. You publish events as messages, but the event exists independently of how (or whether) it is transmitted. An order was placed whether or not your message broker was up at the time. The event is the truth; the message is the delivery vehicle.

This distinction matters when you start thinking about reliability. If the message is lost, the event still happened. Your system needs to be designed to handle that gap.

Request-Response vs Event-Driven: The Paradigm Shift

Most developers grew up building request-response systems. Client sends a request, server processes it, server sends a response, client continues. It is synchronous (or at least synchronous-feeling), sequential, and easy to reason about. The call stack is your friend. Debugging is a matter of following the thread.

Event-driven architecture inverts this model. Instead of "I need X, so I will ask service B for X and wait," the pattern becomes "something happened, and I will tell anyone who cares." The producer does not know — or care — who is listening. The consumer does not know — or care — who produced the event. This is temporal and spatial decoupling, and it is the core value proposition of EDA.

Temporal Decoupling

In a request-response system, the caller and the callee must both be alive at the same time. If the payment service is down, the order service cannot complete the order. In an event-driven system, the order service publishes OrderPlaced and moves on. The payment service processes it whenever it comes back up. The two services do not need to be alive simultaneously.

This sounds wonderful, and it often is. It also means that "the order was placed five minutes ago but payment hasn't been processed yet" is a normal system state, not an error. Your users, your support team, and your monitoring dashboards all need to understand this. If they do not, you will spend your weekends explaining why the system is "broken" when it is, in fact, working exactly as designed.

Spatial Decoupling

The producer does not know who consumes its events. This means you can add new consumers without modifying the producer. The order service does not need to know that a new analytics pipeline just subscribed to OrderPlaced events. This is a genuine superpower for evolving systems. It is also a genuine liability for understanding them — more on that in the observability chapter.

The Cost of Decoupling

Every architectural decision is a trade-off, and the EDA community has historically been better at selling the benefits than acknowledging the costs. So let us be direct:

Debugging is harder. There is no call stack. There is no single request ID that flows through the system by default (you have to build this with correlation IDs). When something goes wrong, you are reconstructing causality from distributed logs.
Ordering is harder. Events may arrive out of order. They may arrive multiple times. They may not arrive at all. Your consumers need to handle all of these cases.
Testing is harder. You cannot just mock a function call. You need to simulate event flows, deal with eventual consistency, and test for scenarios that are difficult to reproduce deterministically.
Monitoring is harder. "Is the system healthy?" is a surprisingly complex question when the answer depends on the lag of seventeen consumer groups.
Explaining it to stakeholders is harder. "The data will be consistent... eventually" is not a sentence that inspires confidence in people who sign off on budgets.

None of these costs are deal-breakers. All of them are real. The rest of this book is, in large part, about how to manage them.

A Brief History: From Message Queues to Modern Event Streaming

Event-driven architecture did not spring fully formed from a Kafka whitepaper. The ideas have been evolving for decades.

The Message Queue Era (1980s–2000s)

The first generation of asynchronous messaging was built around message queues. IBM MQ (née MQSeries) launched in 1993, though the concepts predate the product. The model was simple: producers put messages on a queue, consumers take messages off. Once a message is consumed, it is gone. This is destructive consumption, and it worked well for point-to-point integration between enterprise systems.

The Java Message Service (JMS) specification, released in 2001, standardised the API. AMQP, arriving in 2003 (with the 1.0 spec in 2011), tried to standardise the wire protocol. RabbitMQ, launched in 2007, made message queuing accessible to developers who did not have an IBM sales team on speed dial.

This era gave us pub/sub as a first-class pattern, dead letter queues, message acknowledgement, and the beginnings of reliable asynchronous communication. It also gave us enterprise integration patterns — Gregor Hohpe and Bobby Woolf's 2003 book of that title remains essential reading.

The Event Streaming Era (2011–Present)

Then LinkedIn built Kafka, and the world shifted.

Apache Kafka, open-sourced in 2011 and graduating from the Apache Incubator in 2012, introduced a fundamentally different model: the distributed commit log. Instead of destructive consumption, Kafka retains events. Consumers maintain their own position (offset) in the log and can re-read events at will. This turned event infrastructure from a plumbing concern into a data platform.

The implications were profound. If events are retained, you can:

Replay them to rebuild state or recover from bugs.
Add new consumers that process the entire history of events, not just new ones.
Build materialised views from event streams.
Decouple storage from processing — the log is both a communication channel and a database.

This model — events as a log, consumers as independent readers — is the foundation of modern event-driven architecture. Kafka was first, but the idea has been adopted by Apache Pulsar, Redpanda, Amazon Kinesis, Azure Event Hubs, and others. Each has different trade-offs, which we explore in Part 2 of this book.

The Cloud-Native Era (2018–Present)

The most recent wave has been cloud-managed event services and serverless event routing. AWS EventBridge, Google Eventarc, Azure Event Grid — these services abstract away the infrastructure and focus on event routing, filtering, and transformation. They trade control for convenience, and for many use cases, the trade is worth it.

CloudEvents, a CNCF specification for describing events in a common way, emerged in 2018 and reached 1.0 in 2019. It is the closest thing we have to a universal event envelope standard, and we will discuss it in Chapter 2.

Where EDA Fits

Event-driven architecture is not a universal solvent. It is a tool, and like all tools, it excels in some contexts and is actively harmful in others. Here is where it tends to shine.

Microservices Communication

This is the poster-child use case. When you decompose a monolith into services, those services need to communicate. Request-response (typically HTTP/gRPC) works for synchronous queries, but for cross-service state changes — "an order was placed, now inventory, billing, shipping, and analytics all need to react" — events are the natural model.

The alternative is a web of point-to-point API calls, where the order service must know about and call every downstream service. This creates tight coupling, cascading failures, and a deployment dependency graph that makes your release calendar weep.

Data Pipelines and Analytics

Event streams are a natural fit for data ingestion. Instead of batch ETL jobs that run nightly and break silently, you get a continuous stream of business events flowing into your data warehouse, your ML feature store, and your real-time dashboards. Kafka was literally built for this at LinkedIn.

IoT and Sensor Data

When you have thousands (or millions) of devices emitting telemetry, request-response is not viable. The devices fire events, and your backend processes them asynchronously, at whatever rate it can sustain. Back-pressure mechanisms, windowed aggregation, and stream processing are the natural tools here.

Real-Time User Experiences

Live notifications, collaborative editing, activity feeds, real-time pricing — anything where the user expects to see changes as they happen, rather than after a page refresh. Event-driven architecture provides the infrastructure for pushing state changes to interested parties.

Integration Across Organisational Boundaries

When system A is maintained by team X and system B is maintained by team Y, and neither team wants to be on-call for the other's deployments, events provide a natural boundary. Team X publishes events describing what happened in their domain. Team Y subscribes and reacts on their own schedule. The event schema is the contract; the teams need not coordinate beyond that.

The Promise and the Price

Let us be honest about both sides.

The Promise

Loose coupling. Services can evolve independently. Adding a new consumer does not require changing the producer.
Scalability. Event consumers can be scaled independently. You can have one producer and fifty consumers, each processing the same stream at different rates.
Resilience. If a consumer goes down, events are retained (in a streaming system) and processed when it recovers. No data is lost.
Auditability. An event log is a natural audit trail. You know what happened, when, and (if your events are well-designed) why.
Temporal freedom. Systems do not need to be available simultaneously. This is genuinely transformative for global systems spanning time zones.

The Price

Eventual consistency. Your system will have periods where different services have different views of the world. This is not a bug; it is a fundamental property. But it will surprise anyone who expects reads-after-writes consistency.
Operational complexity. You are now running a distributed system with an event broker at its heart. That broker needs to be monitored, scaled, secured, and upgraded. It is a critical dependency.
Debugging difficulty. When a user reports "my order shows as confirmed but I was never charged," tracing the cause requires correlating events across multiple services, possibly with different retention periods.
Schema management. Events are contracts between services. Changing an event schema is like changing an API — except your "API" might have dozens of consumers you do not know about.
Duplicate processing. In any distributed system, messages may be delivered more than once. Your consumers must be idempotent. This is easy to say and surprisingly hard to do well.
Out-of-order processing. Depending on your broker and partitioning strategy, events may arrive out of order. Your consumers need to handle this gracefully.

The rest of this book is largely about managing these costs without losing the benefits.

When NOT to Use EDA

This section might be the most valuable in the chapter, because the industry has a habit of reaching for event-driven architecture in situations where a simple function call would do the job.

Simple CRUD Applications

If your application is a straightforward create-read-update-delete interface backed by a single database, you do not need events. A web framework, a relational database, and some well-structured SQL will serve you better, faster, and with dramatically less operational overhead. The fact that CRUD is "boring" does not mean it is wrong.

Low-Traffic Internal Tools

If your internal admin dashboard handles fifty requests a day, the complexity of an event-driven architecture is not justified. The benefits of decoupling kick in when systems need to scale independently, evolve independently, or handle failure independently. If everything runs on one server and the on-call rotation is "Dave," a monolith is fine. More than fine — it is correct.

When You Need Synchronous Guarantees

Some operations genuinely need synchronous, transactional guarantees. "Debit account A and credit account B" is the classic example. You can build this with events (and we will discuss sagas in Chapter 3), but the complexity cost is significant. If your entire domain is dominated by operations that need immediate consistency, EDA is fighting your requirements rather than serving them.

When Your Team Is Not Ready

This is the uncomfortable one. Event-driven architecture requires a team that understands distributed systems, eventual consistency, idempotency, and asynchronous debugging. If your team is unfamiliar with these concepts, introducing EDA will not go well. The architecture will degrade into a distributed monolith — all the coupling of a monolith with all the operational complexity of a distributed system. Train first, then migrate.

When You Are Optimising Prematurely

"We might need to scale to millions of users someday" is not a reason to adopt EDA today. Build the simplest thing that works. If and when you hit scaling challenges, you will know specifically what needs to change. Speculative architecture is the enemy of shipping software.

Roadmap of This Book

This book is divided into two parts.

Part 1: Event-Driven Architecture Deep Dive

Part 1 covers the concepts, patterns, and practices that apply regardless of which broker you choose:

Chapter 2: Core Concepts — Events, commands, queries, event design, CloudEvents, idempotency, and ordering.
Chapter 3: Fundamental Patterns — Pub/sub, event sourcing, CQRS, sagas, the outbox pattern, and change data capture.
Chapter 4: Schema Evolution and Contracts — How to change event schemas without breaking consumers. Compatibility rules, schema registries, and versioning strategies.
Chapter 5: Error Handling and Delivery Guarantees — At-most-once, at-least-once, exactly-once (and why that last one comes with an asterisk). Dead letter queues, retry policies, and poison messages.
Chapter 6: Observability and Debugging — Distributed tracing, correlation IDs, consumer lag monitoring, and the art of figuring out what went wrong.
Chapter 7: Security and Access Control — Encryption, authentication, authorisation, and the event-specific challenges of securing a pub/sub system.
Chapter 8: Testing Event-Driven Systems — Unit testing, integration testing, contract testing, and chaos engineering for event flows.
Chapter 9: Anti-Patterns and Pitfalls — The distributed monolith, god events, event storms, and other ways to ruin a perfectly good architecture.

Part 2: The Broker Showdown

Part 2 is a detailed, opinionated evaluation of every major (and several minor) event broker available today:

Chapters 10–25 cover individual brokers: Kafka, RabbitMQ, Pulsar, AWS services, Google and Azure services, Redis Streams, NATS, ActiveMQ, ZeroMQ, Redpanda, Memphis, Solace, Chronicle Queue, Aeron, and a chapter on the more obscure options.
Chapter 26 provides a comprehensive comparison matrix.
Chapter 27 is a selection guide — a decision framework for choosing the right broker for your specific needs.

Each broker chapter follows the same structure: architecture overview, strengths, weaknesses, operational characteristics, and honest guidance on when it is (and is not) the right choice. No vendor brochures. No "it depends" without explaining what it depends on.

A Note on Examples

Code examples throughout this book use pseudocode or language-agnostic notation unless a specific broker's client library is being discussed. The concepts are the same whether you are writing Java, Python, Go, TypeScript, or Rust. Where broker-specific examples are necessary, we lean toward the most commonly used client libraries.

We assume familiarity with basic distributed systems concepts (networks are unreliable, clocks are not synchronised, processes can fail). If those ideas are unfamiliar, we recommend reading through the first few chapters of Martin Kleppmann's Designing Data-Intensive Applications before continuing. It is the best prerequisite we can recommend, and we will reference it frequently.

Let us begin.

Core Concepts

Before you design your first event, before you choose a broker, before you argue with your team about whether Kafka is overkill — you need a shared vocabulary. This chapter establishes one. The distinctions here are not academic. Getting them wrong leads to architectures that look event-driven on the diagram but behave like a distributed monolith in production.

Events vs Commands vs Queries

This is the foundational taxonomy. Every message flowing through your system is one of these three things, and conflating them is the single most common design mistake in event-driven systems.

Events

An event is a notification that something happened. Past tense. Immutable. The producer is stating a fact about its own domain.

{
  "type": "InvoiceIssued",
  "source": "billing-service",
  "time": "2025-11-14T10:15:33Z",
  "data": {
    "invoiceId": "inv-9921",
    "customerId": "cust-441",
    "amount": 250.00,
    "currency": "EUR"
  }
}

Key properties:

Past tense. InvoiceIssued, not IssueInvoice.
Owned by the producer. The billing service decides what an InvoiceIssued event looks like. Consumers do not get a vote (though they get a voice via schema negotiation — see Chapter 4).
No expectation of a response. The producer fires and forgets. It does not know or care who consumes the event.
Immutable. Once published, an event is a historical fact. You do not update events; you publish new ones.

Commands

A command is an instruction to do something. Imperative. Directed at a specific recipient. The sender expects something to happen as a result.

{
  "type": "SendWelcomeEmail",
  "target": "email-service",
  "data": {
    "recipientEmail": "alice@example.com",
    "templateId": "welcome-v2",
    "locale": "en-GB"
  }
}

Key properties:

Imperative. SendWelcomeEmail, ProcessRefund, ShipOrder.
Directed. There is one intended recipient. If you are broadcasting a command to "whoever wants to handle it," you have an event in disguise.
May fail. The recipient may reject the command. The sender typically needs to know about this.
Has coupling. The sender knows about the recipient and its capabilities. This is point-to-point messaging, not pub/sub.

Queries

A query is a request for information. It does not change state. It is synchronous in nature, even when implemented over asynchronous transport.

{
  "type": "GetOrderStatus",
  "data": {
    "orderId": "ord-7829"
  }
}

Queries in an event-driven system are typically handled via request-response (HTTP, gRPC) rather than through the event broker. The async query pattern exists but adds complexity that is rarely justified. If you find yourself routing queries through Kafka, step back and ask what problem you are actually solving.

Why the Distinction Matters

When you conflate events and commands, you end up with producers that expect consumers to act in specific ways. This recreates the coupling that event-driven architecture was supposed to eliminate. The producer starts to fail when the consumer does not behave as expected. You have reinvented RPC with extra steps and worse debugging.

A helpful litmus test: if the producer's correctness depends on what the consumer does with the message, you have a command, not an event. Design accordingly. Commands go to specific services over point-to-point channels. Events go to topics where anyone can subscribe.

Anatomy of a Well-Designed Event

A surprising number of production incidents trace back to poorly designed events. Missing timestamps, absent correlation IDs, ambiguous types — these are not theoretical problems. They are the things that make your on-call engineers weep at 3 AM.

Here is what a well-designed event includes:

Event ID

A globally unique identifier for this specific event instance. UUIDs (v4 or v7) are the standard choice. UUIDv7 is preferable where available because it is time-ordered, which makes log analysis and debugging easier.

"id": "01944b3c-8f3a-7d1e-a2b3-4c5d6e7f8901"

This ID serves multiple purposes:

Deduplication. When a consumer receives the same event twice (and it will), the ID lets it recognise the duplicate.
Tracing. You can follow a specific event through the system.
Idempotency keys. Consumers can use the event ID to ensure they process each event exactly once.

Do not use auto-incrementing integers. They are not globally unique, they leak information about your event volume, and they create a coordination bottleneck.

Timestamp

When the event occurred. Not when it was published, not when it was received — when the thing that the event describes happened. Use ISO 8601 format with timezone information. UTC is strongly preferred.

"time": "2025-11-14T10:15:33.447Z"

Include sub-second precision. You will need it for ordering, debugging, and performance analysis. Millisecond precision is the minimum; microsecond is better.

A word of caution: wall-clock timestamps are not reliable for ordering. Clocks drift between machines, NTP corrections can cause jumps, and two events that happened "at the same time" on different machines may have timestamps that suggest a different order. We discuss ordering properly later in this chapter. Use timestamps for human-readable debugging, not for determining causal order.

Source

The identity of the system, service, or component that produced the event. This should be a stable identifier, not a hostname or IP address (which change with deployments).

"source": "billing-service"

or, following the CloudEvents URI convention:

"source": "/services/billing/eu-west-1"

Event Type

A namespaced string that identifies what kind of event this is. Use a consistent naming convention across your organisation.

"type": "com.example.billing.InvoiceIssued"

Some conventions:

Reverse domain notation: com.example.billing.InvoiceIssued
Dot-separated hierarchy: billing.invoice.issued
Simple PascalCase: InvoiceIssued (often sufficient for smaller systems)

Pick one. Enforce it. The naming convention matters less than consistency.

Payload (Data)

The actual business data. What was ordered, how much was charged, which user signed up. This is the part that varies between event types.

"data": {
  "invoiceId": "inv-9921",
  "customerId": "cust-441",
  "lineItems": [
    { "sku": "WIDGET-42", "quantity": 3, "unitPrice": 49.99 },
    { "sku": "GADGET-7", "quantity": 1, "unitPrice": 100.03 }
  ],
  "totalAmount": 250.00,
  "currency": "EUR"
}

The payload design — what to include and what to omit — is one of the most consequential decisions in EDA. We will tackle this in the "Fat Events vs Thin Events" section below.

Metadata

Non-business data about the event itself. This typically includes:

Schema version: "dataschema": "https://schemas.example.com/billing/invoice-issued/v2"
Content type: "datacontenttype": "application/json"
Correlation ID: "correlationid": "corr-abc-123" (for tracing a business process across multiple events)
Causation ID: "causationid": "evt-xyz-789" (the ID of the event that caused this one)

"metadata": {
  "schemaVersion": "2.1.0",
  "correlationId": "corr-abc-123",
  "causationId": "evt-xyz-789",
  "contentType": "application/json",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
}

Correlation ID

This deserves special emphasis. A correlation ID is a unique identifier that follows a business process across multiple events and services. When a customer places an order, the initial OrderPlaced event gets a correlation ID. Every subsequent event in that order's lifecycle — PaymentProcessed, InventoryReserved, ShipmentCreated — carries the same correlation ID.

Without correlation IDs, debugging a multi-step business process in an event-driven system is an exercise in despair. You are searching through millions of events across dozens of services trying to reconstruct what happened to one order. With correlation IDs, you filter by a single value and get the complete story.

Make correlation IDs mandatory. Reject events that do not have them. You will thank yourself.

Putting It All Together

A complete, well-designed event:

{
  "specversion": "1.0",
  "id": "01944b3c-8f3a-7d1e-a2b3-4c5d6e7f8901",
  "type": "com.example.billing.InvoiceIssued",
  "source": "/services/billing/eu-west-1",
  "time": "2025-11-14T10:15:33.447Z",
  "datacontenttype": "application/json",
  "dataschema": "https://schemas.example.com/billing/invoice-issued/v2",
  "correlationid": "corr-abc-123",
  "causationid": "evt-xyz-789",
  "data": {
    "invoiceId": "inv-9921",
    "customerId": "cust-441",
    "totalAmount": 250.00,
    "currency": "EUR"
  }
}

You will notice this looks suspiciously like a CloudEvents envelope. That is not a coincidence. We will discuss CloudEvents formally later in this chapter.

Event Taxonomy

Not all events are created equal, and not all events serve the same purpose. Understanding the different kinds of events will save you from shoehorning every use case into a single pattern.

Domain Events

A domain event represents something meaningful that happened within a bounded context. It uses the language of the domain (the "ubiquitous language" if you are a Domain-Driven Design practitioner, which, in the context of EDA, you probably should be).

OrderPlaced
PaymentAuthorised
ShipmentDispatched
AccountSuspended

Domain events are the bread and butter of event-driven systems. They are raised by aggregates, published to event streams, and consumed by other bounded contexts. They describe business facts.

A well-designed domain event should be understandable by a domain expert, not just a developer. If your event is called EntityStateTransition_V2, you have abstracted away all the meaning.

Integration Events

An integration event is a domain event that has been explicitly designed for consumption by other bounded contexts or external systems. The distinction matters because internal domain events may contain implementation details that should not leak across service boundaries.

For example, internally your billing service might raise a InvoiceTaxCalculationCompleted event with detailed tax breakdown data structures specific to your billing logic. The integration event published externally might be a simplified InvoiceIssued with just the total amount and tax summary.

Integration events form your public API. Treat them with the same care you would treat any public interface: version them, document them, and do not change them without warning.

Notification Events

A notification event tells you that something happened but carries minimal data — just enough to identify what changed, not the details of the change.

{
  "type": "OrderUpdated",
  "data": {
    "orderId": "ord-7829"
  }
}

The consumer, upon receiving this, must call back to the source service (via an API) to get the current state. This is the thinnest possible event, and it has a very specific use case: when the event data is large, changes frequently, and most consumers only care about the latest state.

The downside is obvious: you have reintroduced coupling. The consumer now depends on the producer's API being available. You have traded event payload size for runtime dependency. This is sometimes the right trade-off, but go in with your eyes open.

Event-Carried State Transfer

The opposite of a notification event. An event-carried state transfer includes the complete current state of the entity in the event payload.

{
  "type": "CustomerProfileUpdated",
  "data": {
    "customerId": "cust-441",
    "email": "alice@example.com",
    "name": "Alice Wonderland",
    "tier": "gold",
    "address": {
      "street": "42 Looking Glass Lane",
      "city": "Oxford",
      "postcode": "OX1 2JD",
      "country": "GB"
    },
    "preferences": {
      "newsletter": true,
      "smsNotifications": false
    }
  }
}

The consumer can build and maintain a local copy of the producer's data without ever calling the producer's API. This is full decoupling — no runtime dependency whatsoever. The consumer has everything it needs in the event.

The cost is larger events, more bandwidth, and the risk of stale data if events are delayed. It also means every consumer gets a full copy of data it might not need, which has privacy implications (does the shipping service really need the customer's newsletter preferences?).

Fat Events vs Thin Events

This is one of the most debated design decisions in EDA, and the answer — annoyingly — is "it depends." But we can at least make the trade-offs explicit.

Thin Events

A thin event contains the minimum information needed to identify what happened:

{
  "type": "OrderPlaced",
  "data": {
    "orderId": "ord-7829"
  }
}

Advantages:

Small payload, low bandwidth
No risk of leaking data to consumers who should not have it
Event schema rarely changes (there is not much to change)

Disadvantages:

Consumers must call the producer's API to get details (coupling)
The producer's API must handle the resulting query load
If the producer is down, consumers are stuck
You cannot replay thin events to rebuild state (the API state may have changed since the event was published)

Fat Events

A fat event contains all the data a consumer could reasonably need:

{
  "type": "OrderPlaced",
  "data": {
    "orderId": "ord-7829",
    "customerId": "cust-441",
    "customerEmail": "alice@example.com",
    "items": [
      { "sku": "WIDGET-42", "name": "Premium Widget", "quantity": 3, "unitPrice": 49.99 }
    ],
    "totalAmount": 149.97,
    "currency": "USD",
    "shippingAddress": { ... },
    "billingAddress": { ... },
    "placedAt": "2025-11-14T10:15:33Z"
  }
}

Advantages:

True decoupling — consumers need nothing else
Events are self-contained and replayable
Consumers can build local read models without API calls
Works even when the producer is offline

Disadvantages:

Larger payloads, more bandwidth and storage
Schema evolution is harder (more fields to manage)
Risk of data leakage (every consumer gets every field)
Event may include data the producer had to fetch from elsewhere, introducing latency at publish time

The Pragmatic Middle Ground

In practice, most successful systems land somewhere in between: events include enough data for the majority of consumers to operate independently, while acknowledging that edge cases may require an API call. The common pattern is to include the key identifiers plus the data that changed:

{
  "type": "OrderPlaced",
  "data": {
    "orderId": "ord-7829",
    "customerId": "cust-441",
    "items": [
      { "sku": "WIDGET-42", "quantity": 3, "unitPrice": 49.99 }
    ],
    "totalAmount": 149.97,
    "currency": "USD"
  }
}

The shipping address is not included because most consumers do not need it. The shipping service, which does need it, can call the order API. The analytics service, which just needs the amount and item count, has everything it needs in the event.

The guiding principle: include data that most consumers need, exclude data that few consumers need, and always include enough to identify the entity and the change.

Event Envelopes and the CloudEvents Specification

An event envelope is the standard wrapper around your event data — the metadata fields that every event should carry, regardless of its business content. You can design your own, but there is a strong argument for adopting the CloudEvents specification.

CloudEvents

CloudEvents is a CNCF (Cloud Native Computing Foundation) specification that defines a common structure for event metadata. Version 1.0 was released in 2019 and has since been adopted by most major cloud providers and many open-source projects.

The required attributes are:

Attribute	Type	Description
`specversion`	String	CloudEvents spec version (currently `"1.0"`)
`id`	String	Unique event identifier
`source`	URI-ref	Context in which the event happened
`type`	String	Type of event

Optional but recommended attributes:

Attribute	Type	Description
`time`	Timestamp	When the event occurred
`datacontenttype`	String	Content type of `data` (e.g., `application/json`)
`dataschema`	URI	Schema that `data` adheres to
`subject`	String	Subject of the event in context of source

Extension attributes (you define these):

Attribute	Description
`correlationid`	Business process correlation identifier
`causationid`	ID of the event that caused this one
`partitionkey`	Key for ordering/partitioning

Why Adopt CloudEvents?

Interoperability. If you ever need to integrate with cloud services, serverless functions, or third-party systems, CloudEvents is the lingua franca.
Tooling. SDKs exist for every major language. Parsers, validators, and protocol bindings are available off the shelf.
Convention over invention. Designing your own envelope means designing your own conventions, documenting them, building tooling for them, and training every new hire. CloudEvents has done this work for you.
Protocol bindings. CloudEvents defines standard ways to map events onto HTTP, Kafka, AMQP, MQTT, and other transports. This removes an entire category of "how do we encode the metadata" debates.

When Not to Adopt CloudEvents

If you are in a high-performance, low-latency context (financial trading, gaming, real-time telemetry), the overhead of JSON-encoded CloudEvents metadata may be unacceptable. In these environments, custom binary envelopes with Protocol Buffers, FlatBuffers, or SBE (Simple Binary Encoding) are common. The concepts are the same; the serialisation differs.

Also, if your system is entirely internal and you have strong existing conventions, migrating to CloudEvents may not be worth the disruption. But if you are starting from scratch, adopt CloudEvents. Seriously. You will not regret having a standard.

Idempotency: Your Best Friend

In distributed systems, messages can be delivered more than once. Your broker might retry on timeout. Your consumer might crash after processing an event but before acknowledging it. A network partition might cause a producer to retry a publish. The result is the same: duplicate events.

Idempotency means that processing the same event multiple times produces the same result as processing it once. This is not optional in event-driven systems. It is a requirement.

Why Duplicates Happen

Consider this sequence:

Consumer receives event OrderPlaced with ID evt-123.
Consumer processes the event (creates an invoice).
Consumer crashes before sending an acknowledgement to the broker.
Broker, having received no acknowledgement, redelivers evt-123.
Consumer receives the event again.

If the consumer naively processes the event again, the customer gets two invoices. This is not a theoretical concern — it is a Tuesday.

Strategies for Idempotency

1. Idempotency Key Table

Maintain a table (or set) of processed event IDs. Before processing, check if the event ID has been seen:

function handleEvent(event):
    if eventStore.hasBeenProcessed(event.id):
        log("Duplicate event, skipping: " + event.id)
        return

    // Process the event
    processBusinessLogic(event)

    // Record that we've processed it
    eventStore.markAsProcessed(event.id)

The critical subtlety: the business logic and the markAsProcessed call must be in the same transaction. If they are not, you have a gap where a crash between the two leads to either lost events or duplicates.

function handleEvent(event):
    transaction:
        if eventStore.hasBeenProcessed(event.id):
            return

        processBusinessLogic(event)
        eventStore.markAsProcessed(event.id)

2. Natural Idempotency

Some operations are naturally idempotent. Setting a value is idempotent; incrementing a value is not.

// Idempotent: processing this twice has the same effect as once
UPDATE users SET email = 'alice@example.com' WHERE id = 'cust-441'

// NOT idempotent: processing this twice doubles the effect
UPDATE accounts SET balance = balance + 100 WHERE id = 'acct-992'

Where possible, design your event handlers to use naturally idempotent operations. Instead of "add $100 to the balance," use "set the balance to $1,350 as of event evt-123." This requires the event to carry enough state, which circles back to the fat-vs-thin debate.

3. Conditional Writes

Use optimistic concurrency control: only apply the change if the current state matches what you expect.

UPDATE orders
SET status = 'shipped', version = 4
WHERE id = 'ord-7829' AND version = 3

If the event is processed twice, the second attempt finds version = 4 instead of 3, the update affects zero rows, and the duplicate is harmlessly absorbed.

4. Deduplication at the Broker Level

Some brokers support producer-side deduplication (Kafka's idempotent producer, for example). This prevents duplicate publishing but does not protect against duplicate consumption. You still need consumer-side idempotency.

The Idempotency Window

You cannot store every event ID forever. At some point, you need to prune the idempotency table. The question is: how long should you keep IDs?

This depends on your redelivery window. If your broker retries for up to 7 days, your idempotency table needs to retain IDs for at least 7 days (plus a safety margin). In practice, 14 to 30 days is common. After that, if a duplicate somehow arrives, you accept the vanishingly small risk.

For event-sourced systems, the idempotency window is effectively infinite — you have the full event history, and deduplication is inherent.

Causality and Ordering

Ordering is the problem that makes distributed systems researchers write papers with titles like "Time, Clocks, and the Ordering of Events in a Distributed System" (Lamport, 1978). It is also the problem that makes practitioners swear at their screens when events arrive in the wrong order.

The Fundamental Problem

In a distributed system, there is no single global clock. Two events that happen on different machines may have timestamps that suggest order A→B, when in fact the causal order was B→A (because one machine's clock was ahead). Wall-clock time is unreliable for ordering.

Even within a single machine, if events are published to different partitions of a topic, they may be consumed in a different order than they were produced. Ordering guarantees in most brokers are per-partition, not global.

Wall-Clock Time

Despite its unreliability, wall-clock time (timestamps) is what most systems use for ordering. This works well enough when:

Events are produced by the same service (clocks are likely synchronised within a cluster).
Sub-second ordering precision is not required.
NTP is configured and functioning on all machines.

It breaks down when:

Events come from different services on different machines.
Precise ordering matters (financial transactions, inventory counts).
Clock skew exceeds your tolerance.

For most business applications, wall-clock timestamps with NTP synchronisation are "good enough." But you should know when they are not.

Sequence Numbers

Within a single partition or stream, most brokers assign a monotonically increasing sequence number (Kafka calls it an offset, Pulsar calls it a message ID). This gives you a total order within a partition.

The trick is ensuring that causally related events end up in the same partition. The standard approach is to partition by an entity ID (e.g., order ID), so all events for a given order are in the same partition and thus totally ordered.

// Publishing with a partition key ensures ordering per-entity
producer.publish(
    topic: "orders",
    partitionKey: event.data.orderId,
    value: event
)

This works well for entity-level ordering. It does not help with ordering across entities ("did the payment happen before or after the inventory check?").

Logical Clocks

A Lamport clock is a counter that each process maintains. The rules are simple:

Before sending a message, increment the counter and include it in the message.
Upon receiving a message, set your counter to max(local, received) + 1.

This gives you a partial order: if event A's Lamport timestamp is less than event B's, and there is a causal chain from A to B, then A happened before B. But if two events have no causal relationship, their Lamport timestamps tell you nothing about which happened first.

Vector Clocks

Vector clocks extend Lamport clocks to capture the full causal history. Each process maintains a vector of counters, one per process. This allows you to determine whether two events are causally related or concurrent.

// Process A's vector clock after sending: [A:3, B:1, C:2]
// Process B's vector clock after receiving: [A:3, B:4, C:2]

// Comparing two vector clocks:
// [A:3, B:1, C:2] < [A:3, B:4, C:2]  → first causally precedes second
// [A:3, B:1, C:2] || [A:2, B:4, C:2]  → concurrent (neither precedes)

Vector clocks are elegant but have practical challenges:

The vector grows with the number of processes. In a microservices system with hundreds of services, the overhead is significant.
Garbage collection of vector clock entries is non-trivial.
Most developers find them confusing (this is a statement about adoption feasibility, not developer intelligence).

In practice, vector clocks are used in databases (Dynamo, Riak) more than in application-level event systems. For most EDA use cases, partition-level ordering combined with entity-based partitioning is sufficient.

Handling Out-of-Order Events

Regardless of your ordering strategy, consumers should be prepared for out-of-order delivery. The strategies are:

1. Buffer and Reorder

Collect events in a buffer, sort by sequence number or timestamp, and process in order. This adds latency and complexity but guarantees order.

function onEventReceived(event):
    buffer.add(event)

    while buffer.hasNextInSequence(lastProcessedSequence + 1):
        nextEvent = buffer.removeNextInSequence(lastProcessedSequence + 1)
        process(nextEvent)
        lastProcessedSequence = nextEvent.sequenceNumber

2. Last-Write-Wins

If you only care about the latest state, ignore events with a timestamp older than the last processed event for a given entity.

function onEventReceived(event):
    currentTimestamp = stateStore.getLastUpdated(event.entityId)
    if event.timestamp <= currentTimestamp:
        log("Stale event, skipping")
        return
    process(event)
    stateStore.setLastUpdated(event.entityId, event.timestamp)

This is simple but lossy — intermediate states are silently dropped.

3. Version Checking

Include a version number in your events. Only process an event if its version is exactly currentVersion + 1. If it is higher, buffer it. If it is lower, discard it.

4. Accept the Chaos

For some use cases — analytics, logging, non-critical notifications — ordering does not matter. An analytics dashboard that counts orders does not care whether OrderPlaced for order 100 arrives before or after OrderPlaced for order 101. If ordering does not affect correctness, do not pay the cost of enforcing it.

The Pragmatic Summary

For most systems:

Partition by entity ID to get per-entity ordering.
Use correlation IDs and causation IDs to reconstruct causal chains during debugging.
Make consumers tolerant of out-of-order delivery where possible.
Reserve strict global ordering for the rare cases where it is genuinely required (and accept the throughput cost).

Trying to achieve strict global ordering across a distributed system is technically possible but operationally expensive. Usually, you do not need it. When you think you do, check twice.

Chapter Summary

The concepts in this chapter are the vocabulary of event-driven architecture:

Events, commands, and queries are fundamentally different things. Do not conflate them.
A well-designed event has an ID, timestamp, source, type, payload, and metadata (including correlation and causation IDs).
Events come in different flavours: domain events, integration events, notification events, and event-carried state transfer. Each has different trade-offs.
The fat vs thin event debate is a trade-off between decoupling and payload size. Lean toward fat events unless you have a good reason not to.
CloudEvents provides a standard envelope format. Adopt it unless you have specific reasons not to.
Idempotency is not optional. Design every consumer to handle duplicate events correctly.
Ordering is harder than it looks. Use partition-level ordering, entity-based partitioning, and accept that global ordering is usually not worth the cost.

With this vocabulary established, we can move on to the patterns that put these concepts to work.

Fundamental Patterns

Concepts are lovely. Patterns are how you actually build things. This chapter covers the foundational patterns of event-driven architecture — the recurring solutions to recurring problems that show up in every non-trivial EDA implementation. Some of these patterns are simple enough to implement in an afternoon. Others are career-defining rabbit holes. We will cover both.

Publish/Subscribe — The Gateway Pattern

Publish/subscribe (pub/sub) is the simplest and most widely used event-driven pattern. A producer publishes events to a topic (or channel, or subject — the terminology varies by broker). Consumers subscribe to topics and receive events as they are published. The producer does not know who the consumers are. The consumers do not coordinate with each other.

┌──────────┐         ┌─────────┐         ┌────────────┐
│ Producer  │──event──▶│  Topic  │──event──▶│ Consumer A │
│          │         │         │──event──▶│ Consumer B │
│          │         │         │──event──▶│ Consumer C │
└──────────┘         └─────────┘         └────────────┘

Fan-Out

Every subscriber gets a copy of every event. This is the default pub/sub behaviour and is useful when multiple services need to react to the same event independently. OrderPlaced fans out to the billing service, the inventory service, the analytics pipeline, and the notification service — all receiving the same event, all acting on it differently.

Consumer Groups

When you need competing consumers — multiple instances of the same service sharing the workload — you use consumer groups (Kafka terminology) or competing consumers (RabbitMQ terminology). Events on a topic are distributed among group members so that each event is processed by exactly one member of the group.

┌─────────┐         ┌────────────────────────────┐
│  Topic   │──event──▶│  Consumer Group "billing"  │
│         │         │  ┌────────┐  ┌────────┐    │
│         │         │  │ inst-1 │  │ inst-2 │    │
│         │         │  └────────┘  └────────┘    │
└─────────┘         └────────────────────────────┘

This gives you horizontal scalability for consumers. Need to process events faster? Add more instances to the group.

Topic Design

Topic design is underappreciated. A few principles:

One event type per topic is the simplest model and the one we recommend for most cases. orders.placed, orders.shipped, payments.processed. It makes subscription, filtering, and schema management straightforward.
One topic per entity (all events for orders go to orders) is simpler operationally but requires consumers to filter by event type, which adds complexity to every consumer.
Avoid mega-topics. A single topic called events that carries every event in your system is technically possible and practically a nightmare. Consumers cannot subscribe selectively, schema management is impossible, and consumer lag on one event type affects all others.

The right granularity depends on your broker, your volume, and your team's preferences. But err on the side of more topics rather than fewer. It is easier to merge topics than to split them.

When Pub/Sub Is Enough

For many systems, basic pub/sub is all you need. Events are published, consumers react, and life is good. You do not need event sourcing. You do not need CQRS. You need a topic and some subscribers.

The urge to over-engineer is strong in the EDA community. Resist it. If pub/sub solves your problem, declare victory and move on.

Event Streaming vs Message Queuing

These terms are often used interchangeably, which is unfortunate because they describe fundamentally different architectures. The distinction determines your system's capabilities around replay, retention, and consumer independence.

Message Queuing (Destructive Consumption)

In a traditional message queue (RabbitMQ, ActiveMQ, SQS), a message is consumed and then removed from the queue. Once a consumer acknowledges a message, it is gone. Other consumers cannot read it. You cannot replay it.

Producer → [msg-1, msg-2, msg-3] → Consumer
                                    ↓
                              msg-1 consumed
                              (deleted from queue)

Characteristics:

Messages are transient — consumed once and discarded.
Delivery semantics are per-message: acknowledged, rejected, or dead-lettered.
Consumers compete for messages (one message → one consumer).
No replay capability (by default; some brokers offer limited retention).
Excellent for task distribution and work queues.

Event Streaming (Non-Destructive Consumption)

In an event streaming system (Kafka, Pulsar, Redpanda, Kinesis), events are appended to a log and retained for a configurable period (or indefinitely). Consumers maintain their own position (offset) in the log. Multiple consumers can read the same events independently, at different speeds, from different positions.

Log: [evt-1] [evt-2] [evt-3] [evt-4] [evt-5]
                ↑                        ↑
          Consumer A              Consumer B
          (offset 2)              (offset 5)

Characteristics:

Events are retained — the log is the source of truth.
Consumers are independent — each tracks its own offset.
Replay is native — reset the offset, reprocess the history.
Multiple consumer groups can read the same events at different rates.
The log is both a communication channel and a storage system.

When Each Matters

Use message queuing when:

You have task-based workloads (send an email, process a payment, generate a report).
You need per-message routing logic (route to different queues based on headers).
Message order is less important than delivery guarantees.
You do not need replay.

Use event streaming when:

Events are facts you want to retain (business events, audit logs, state changes).
Multiple consumers need to process the same events independently.
You need replay capability (rebuilding state, recovering from bugs, backfilling new consumers).
You are building event sourcing, CQRS, or stream processing pipelines.
Ordering within a partition matters.

The grey zone: Many modern brokers blur the line. RabbitMQ has added stream support. Kafka can be used for task distribution. But the core architectures are different, and choosing the wrong one for your use case leads to either unnecessary complexity or missing capabilities.

A common mistake is using Kafka as a task queue. It works, but you are fighting the abstraction. Kafka's strengths — retention, replay, multi-consumer — are irrelevant for "process this job once and forget about it." Conversely, using RabbitMQ when you need event replay is possible (with Streams) but unnatural.

Pick the tool that matches your use case. This is easier advice to give than to follow, which is why Part 2 of this book exists.

Event Sourcing

Event sourcing is the pattern that makes architects' eyes light up and operations teams' eyes narrow. The idea is simple: instead of storing the current state of an entity, you store the sequence of events that led to that state. The current state is derived by replaying the events.

The Core Idea

Traditional approach (state-based):

┌─────────────────────────────┐
│ orders table                │
│ ─────────────────────────── │
│ id: ord-7829                │
│ status: shipped             │
│ total: 149.97               │
│ updated_at: 2025-11-14      │
└─────────────────────────────┘

Event-sourced approach:

┌─────────────────────────────────────────┐
│ events for ord-7829                     │
│ ─────────────────────────────────────── │
│ 1. OrderPlaced    { total: 149.97 }     │
│ 2. PaymentProcessed { paymentId: ... }  │
│ 3. OrderConfirmed { }                   │
│ 4. ShipmentCreated { trackingId: ... }  │
│ 5. OrderShipped   { carrier: "DHL" }    │
└─────────────────────────────────────────┘

Current state = replay(events 1..5) → { status: "shipped", total: 149.97, ... }

The event store is an append-only log. You never update or delete events. To get the current state, you replay all events for an entity through a state-building function (sometimes called a fold, reducer, or aggregate hydration).

function buildOrderState(events):
    state = { status: "new" }

    for event in events:
        switch event.type:
            case "OrderPlaced":
                state.orderId = event.data.orderId
                state.total = event.data.total
                state.status = "placed"

            case "PaymentProcessed":
                state.paymentId = event.data.paymentId
                state.status = "paid"

            case "OrderShipped":
                state.carrier = event.data.carrier
                state.status = "shipped"

            case "OrderCancelled":
                state.status = "cancelled"
                state.cancelReason = event.data.reason

    return state

Benefits

Complete audit trail. You know exactly what happened and when. Not "the order is shipped" but "the order was placed at 10:15, paid at 10:16, confirmed at 10:17, and shipped at 14:32." Regulated industries (finance, healthcare) love this.

Temporal queries. "What was the state of this order at 10:30?" Replay events up to that timestamp and you have the answer. Try doing that with a mutable database row.

Bug recovery. If a bug corrupted state, you can fix the replay logic and rebuild correct state from the event history. You cannot do this if you have been overwriting state in a database.

Event-driven by nature. The events are already there. Publishing them to a broker for other services to consume is trivial.

Debugging. When something goes wrong, you have the complete history. No more "the order is in a weird state and we don't know how it got there."

Costs

Event sourcing is not free, and the costs are often underestimated.

Replay performance. An entity with 10,000 events takes time to rebuild. Snapshots (periodic saves of the current state, so you only replay events since the last snapshot) mitigate this but add complexity.

function getOrderState(orderId):
    snapshot = snapshotStore.getLatest(orderId)  // e.g., state at event 9500
    events = eventStore.getEventsAfter(orderId, snapshot.version)  // events 9501-10000
    state = snapshot.state

    for event in events:
        state = applyEvent(state, event)

    return state

Schema evolution. Your events are immutable, but your understanding of the domain evolves. What happens when you need to change the structure of an event type that has millions of historical instances? You need upcasting — transforming old event formats to new ones during replay. This is manageable but requires discipline. Chapter 4 covers this in detail.

Storage. You are storing every event that ever happened. For high-volume systems, this adds up. Compaction strategies exist (keeping only the latest event per key) but they undermine the "complete history" benefit.

Complexity. Event sourcing is a significant mental model shift. Developers accustomed to CRUD need to learn to think in terms of events and projections. This is a training and hiring cost.

Query complexity. "Show me all orders over $100 that were placed this week" is a trivial SQL query against a state table. Against an event store, it requires either a projection (see CQRS below) or a full scan of events, neither of which is simple.

When Event Sourcing Is Overkill

Event sourcing is overkill when:

Your domain does not benefit from temporal queries or audit trails.
Your entities have simple lifecycles (created, maybe updated, maybe deleted).
Your team is not prepared for the complexity.
Your read patterns are complex and varied (event sourcing alone makes querying painful).

A user profile that is updated occasionally and queried frequently does not benefit from event sourcing. A financial ledger that must maintain a complete, auditable history absolutely does.

The most common mistake is adopting event sourcing system-wide. Most systems have a few aggregates that benefit from it (the ones with complex state machines and audit requirements) and many that do not. Apply it selectively.

CQRS — Command Query Responsibility Segregation

CQRS separates the write model (how you accept and validate changes) from the read model (how you query and display data). In an event-driven context, this typically means:

Commands are validated and processed by the write model, which emits events.
Events are consumed by one or more read model projectors, which build queryable views.
Queries are served from the read models.

┌──────────┐   command   ┌─────────────┐   events   ┌──────────────┐
│  Client   │───────────▶│ Write Model │───────────▶│  Event Store │
│          │             │  (Domain)   │            │  / Broker    │
└──────────┘             └─────────────┘            └──────┬───────┘
                                                          │
     ┌────────────────────────────────────────────────────┘
     │                    │                    │
     ▼                    ▼                    ▼
┌──────────┐       ┌──────────┐       ┌──────────┐
│ Read Model│      │ Read Model│      │ Read Model│
│ (List)   │       │ (Detail) │      │ (Search) │
└──────────┘       └──────────┘       └──────────┘
     │                    │                    │
     ▼                    ▼                    ▼
┌──────────┐       ┌──────────┐       ┌──────────┐
│ Postgres │       │  Redis   │       │ Elastic  │
│          │       │          │       │ search   │
└──────────┘       └──────────┘       └──────────┘

The Read Model Projection Pattern

A projection (or projector) is a function that consumes events and builds a read model — a denormalized, query-optimized view of the data. Each read model is tailored to a specific query pattern.

// Projector for the "order summary" read model
function projectOrderSummary(event):
    switch event.type:
        case "OrderPlaced":
            db.upsert("order_summaries", {
                orderId: event.data.orderId,
                customerName: event.data.customerName,
                total: event.data.total,
                status: "placed",
                placedAt: event.time
            })

        case "OrderShipped":
            db.update("order_summaries",
                where: { orderId: event.data.orderId },
                set: { status: "shipped", shippedAt: event.time }
            )

        case "OrderCancelled":
            db.update("order_summaries",
                where: { orderId: event.data.orderId },
                set: { status: "cancelled" }
            )

The read model can use whatever storage is optimal for the query pattern:

PostgreSQL for complex joins and ad-hoc queries.
Redis for low-latency key-value lookups.
Elasticsearch for full-text search.
A flat file for exports (seriously — sometimes a CSV updated by a projector is the right answer).

Benefits of CQRS

Independent scaling. Reads and writes can be scaled independently. Most systems are read-heavy; CQRS lets you optimise and scale the read path without touching the write path.

Optimised read models. Instead of one normalised schema that serves all queries poorly, you have multiple denormalised schemas, each optimised for its specific query pattern. The "order list" view has exactly the columns it needs. The "order detail" view has different columns. The "order search" view uses a search engine.

Polyglot persistence. Different read models can use different storage technologies. This sounds like overengineering until you realise that serving full-text search from a relational database and serving key-value lookups from Elasticsearch are both terrible ideas.

Costs of CQRS

Eventual consistency. The read model lags behind the write model. A user who places an order and immediately views their order list may not see the new order. This gap is typically milliseconds to seconds, but it exists, and your UI needs to handle it.

Common mitigation: after a write, redirect the user to a confirmation page that reads from the write model (or uses the data from the write response), not from the read model. By the time the user navigates to the order list, the projection has caught up.

Projection complexity. Each read model is a consumer that must correctly process every relevant event type. Bugs in projectors lead to incorrect read models, and the fix is to replay events and rebuild the projection — which requires the events to be retained (hello, event streaming).

Operational overhead. You are now maintaining multiple databases. Each needs monitoring, backup, and capacity planning. This is a real cost.

CQRS Without Event Sourcing

CQRS does not require event sourcing. You can have a traditional database as your write model and use database triggers, CDC (change data capture), or application-level events to update read models. This gives you the read-model benefits without the full complexity of event sourcing.

Conversely, event sourcing without CQRS is possible but painful — querying an event store directly is slow and limiting.

The two patterns are complementary but independent. Use the combination that matches your needs.

Sagas: Choreography vs Orchestration

A saga is a pattern for managing long-running business transactions that span multiple services. In a monolith, you would wrap the whole thing in a database transaction. In a distributed system, you cannot (distributed transactions exist but are a special circle of performance hell). Instead, each step in the saga is a local transaction, and if a step fails, you execute compensating transactions to undo the previous steps.

The Problem

Place an order. Reserve inventory. Process payment. Create shipment. If the payment fails, you need to release the reserved inventory. If shipment creation fails, you need to refund the payment and release the inventory. Each service owns its own data. There is no global transaction coordinator.

Choreography

In a choreography-based saga, there is no central coordinator. Each service listens for events and acts autonomously.

Order Service          Inventory Service       Payment Service
     │                       │                       │
     │── OrderPlaced ───────▶│                       │
     │                       │── InventoryReserved ─▶│
     │                       │                       │── PaymentProcessed ──▶
     │◀── OrderConfirmed ────│◀──────────────────────│

And when things go wrong:

Order Service          Inventory Service       Payment Service
     │                       │                       │
     │── OrderPlaced ───────▶│                       │
     │                       │── InventoryReserved ─▶│
     │                       │                       │── PaymentFailed ──▶
     │                       │◀── CompensateInventory│
     │◀── InventoryReleased ─│                       │
     │── OrderFailed ───────▶│                       │

Advantages:

No single point of failure. No central coordinator that must be highly available.
Loose coupling. Services react to events, not instructions.
Easy to add new steps — just subscribe to the relevant events.

Disadvantages:

The business process is implicit — it exists in the aggregate behaviour of all services, not in any one place. Understanding the complete saga requires reading the code of every participating service.
Debugging a failed saga requires correlating events across multiple services (correlation IDs are essential here).
Cyclic dependencies can emerge, where Service A reacts to Service B which reacts to Service A.
Adding compensating logic to every service increases complexity across the board.

Orchestration

In an orchestration-based saga, a central orchestrator (sometimes called a saga coordinator or process manager) directs the flow. It sends commands to services and listens for their responses.

                    Saga Orchestrator
                         │
                         │── ReserveInventory ──────▶ Inventory Service
                         │◀── InventoryReserved ─────│
                         │
                         │── ProcessPayment ────────▶ Payment Service
                         │◀── PaymentProcessed ──────│
                         │
                         │── CreateShipment ─────────▶ Shipping Service
                         │◀── ShipmentCreated ────────│
                         │
                         │── ConfirmOrder ───────────▶ Order Service

The orchestrator knows the complete flow. It maintains the state of the saga and handles compensations.

function handleSaga(orderId):
    saga = { orderId, status: "started", steps: [] }

    // Step 1: Reserve inventory
    send(command: "ReserveInventory", data: { orderId })
    response = await("InventoryReserved" or "InventoryReservationFailed")

    if response.type == "InventoryReservationFailed":
        saga.status = "failed"
        send(command: "RejectOrder", data: { orderId, reason: "no inventory" })
        return

    saga.steps.push("inventory_reserved")

    // Step 2: Process payment
    send(command: "ProcessPayment", data: { orderId })
    response = await("PaymentProcessed" or "PaymentFailed")

    if response.type == "PaymentFailed":
        saga.status = "compensating"
        send(command: "ReleaseInventory", data: { orderId })  // compensate step 1
        send(command: "RejectOrder", data: { orderId, reason: "payment failed" })
        return

    saga.steps.push("payment_processed")

    // Step 3: Confirm order
    send(command: "ConfirmOrder", data: { orderId })
    saga.status = "completed"

Advantages:

The business process is explicit and readable. You can look at the orchestrator and understand the entire saga.
Compensating logic is centralised. Easier to test and reason about.
No risk of cyclic dependencies.
The orchestrator can implement complex logic (retries, timeouts, parallel steps) in one place.

Disadvantages:

The orchestrator is a single point of failure. If it goes down, in-flight sagas stall.
The orchestrator has knowledge of all participating services, which is a form of coupling.
Risk of the orchestrator becoming a "god service" that contains too much business logic.

Choosing Between Them

Use choreography when:

The saga is simple (2-3 steps).
The participants are independently developed and deployed.
You value autonomy and decoupling over process visibility.

Use orchestration when:

The saga is complex (4+ steps, branching logic, parallel steps).
You need clear visibility into the saga's state.
Compensating transactions are complex and benefit from centralisation.
You are in a regulated industry that requires process auditability.

Many real-world systems use both. Simple flows are choreographed; complex flows are orchestrated. This is not inconsistency — it is pragmatism.

Event-Driven State Machines

A state machine defines the valid states an entity can be in and the transitions between them. In an event-driven system, events trigger state transitions.

                    ┌────────────┐
         OrderPlaced│            │
    ┌───────────────▶   Placed   │
    │               │            │
    │               └─────┬──────┘
    │                     │ PaymentProcessed
    │               ┌─────▼──────┐
    │               │            │
    │               │    Paid    │
    │               │            │
    │               └─────┬──────┘
    │                     │ OrderShipped
    │               ┌─────▼──────┐
    │               │            │
    │               │  Shipped   │
    │               │            │
    │               └─────┬──────┘
    │                     │ OrderDelivered
    │               ┌─────▼──────┐
    │               │            │
    │               │ Delivered  │
    │               │            │
    │               └────────────┘
    │
    │ OrderCancelled (from Placed or Paid)
    │               ┌────────────┐
    └──────────────▶│ Cancelled  │
                    └────────────┘

The state machine enforces business rules:

function handleEvent(currentState, event):
    switch (currentState, event.type):
        case ("placed", "PaymentProcessed"):
            return "paid"

        case ("paid", "OrderShipped"):
            return "shipped"

        case ("shipped", "OrderDelivered"):
            return "delivered"

        case ("placed", "OrderCancelled"):
            return "cancelled"

        case ("paid", "OrderCancelled"):
            return "cancelled"  // with refund compensation

        case ("shipped", "OrderCancelled"):
            reject("Cannot cancel a shipped order")

        default:
            reject("Invalid transition: " + currentState + " → " + event.type)

Why State Machines Matter in EDA

Without explicit state machines, you end up with implicit state logic scattered across event handlers. "Can we ship this order?" turns into checking four different fields instead of asking "is the order in the paid state?" State machines make invariants explicit and transitions auditable.

They also make invalid states unrepresentable (or at least rejectable). An order cannot transition from "delivered" to "placed." The state machine enforces this. Without it, you are relying on every developer who touches the code to remember the rules.

State machines pair naturally with event sourcing: the event history is the transition history, and the current state is derived by replaying transitions through the state machine.

The Outbox Pattern

The outbox pattern solves one of the most insidious problems in event-driven architecture: the dual-write problem.

The Problem

Your service needs to update its database and publish an event. These are two separate operations. If the database write succeeds but the event publish fails, your database has the new state but no one else knows about it. If the event is published but the database write fails, everyone else thinks something happened that did not actually persist.

// This is broken
function placeOrder(order):
    database.save(order)           // Step 1: succeeds
    eventBroker.publish(OrderPlaced)  // Step 2: fails (broker is down)
    // Database has the order, but no event was published.
    // The rest of the system doesn't know the order exists.

You cannot solve this with a try-catch that rolls back the database on publish failure, because the publish might have succeeded from the broker's perspective even if your client timed out waiting for the acknowledgement. Welcome to distributed systems.

The Solution

Instead of publishing the event directly, write it to an outbox table in the same database, in the same transaction as the business data.

function placeOrder(order):
    transaction:
        database.save(order)
        database.insertIntoOutbox({
            id: newId(),
            type: "OrderPlaced",
            payload: { orderId: order.id, total: order.total },
            createdAt: now(),
            published: false
        })

A separate process (the outbox publisher or relay) polls the outbox table and publishes events to the broker:

// Outbox relay (runs continuously)
function publishOutboxEvents():
    while true:
        events = database.query(
            "SELECT * FROM outbox WHERE published = false ORDER BY createdAt LIMIT 100"
        )

        for event in events:
            eventBroker.publish(event)
            database.update("outbox", { id: event.id }, { published: true })

        sleep(100ms)

Because the business data and the outbox entry are written in the same transaction, they are guaranteed to be consistent. Either both exist or neither does. The relay process handles eventually publishing the event, with at-least-once semantics (if it crashes after publishing but before marking the event as published, it will re-publish on restart — which is why consumers must be idempotent).

Outbox Cleanup

The outbox table grows. You need a cleanup process that removes (or archives) published events after a retention period. This is typically a scheduled job:

DELETE FROM outbox WHERE published = true AND createdAt < NOW() - INTERVAL '7 days'

Polling vs Log-Tailing

The polling approach (query the outbox table periodically) is simple but introduces latency — events are not published until the next poll cycle. For lower latency, you can use database log-tailing (Change Data Capture) to detect new outbox entries and publish them immediately. We cover CDC next.

The Transactional Outbox in Practice

The outbox pattern is the standard recommendation for reliable event publishing from transactional systems. It is used in production at scale by organisations that have discovered the hard way that "publish then save" and "save then publish" are both broken.

The main drawback is that it ties your event publishing to your database technology. If your service does not use a relational database, you need an alternative (e.g., event sourcing, where the event store is the source of truth and events are published from the store).

Change Data Capture (CDC)

Change Data Capture is the pattern of capturing changes from a database's transaction log and publishing them as events. Instead of the application being responsible for publishing events, the database's own change log becomes the event source.

How It Works

Every database maintains a transaction log (write-ahead log in PostgreSQL, binlog in MySQL, oplog in MongoDB). CDC tools read this log and convert changes into events.

┌────────────┐    writes    ┌──────────────┐
│ Application│─────────────▶│   Database   │
│            │              │              │
└────────────┘              └──────┬───────┘
                                   │ transaction log
                            ┌──────▼───────┐
                            │  CDC Tool    │
                            │  (Debezium)  │
                            └──────┬───────┘
                                   │ events
                            ┌──────▼───────┐
                            │ Event Broker │
                            │   (Kafka)    │
                            └──────────────┘

Debezium

Debezium is the dominant open-source CDC platform. It supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and others. It runs as a Kafka Connect connector and produces change events to Kafka topics.

A Debezium change event for a PostgreSQL table looks something like:

{
  "before": {
    "id": 7829,
    "status": "placed",
    "total": 149.97
  },
  "after": {
    "id": 7829,
    "status": "shipped",
    "total": 149.97
  },
  "source": {
    "connector": "postgresql",
    "db": "orders",
    "table": "orders",
    "txId": 559,
    "lsn": 33692736
  },
  "op": "u",
  "ts_ms": 1700000133447
}

This includes both the before and after state, the source table, the transaction ID, and the log sequence number. It is a faithful representation of what the database saw.

CDC vs Application-Level Events

CDC and application-level events solve different problems and are not interchangeable.

Aspect	CDC	Application Events
Source	Database transaction log	Application code
Granularity	Row-level changes	Business-level events
Semantics	"Row X changed from A to B"	"OrderShipped"
Domain language	Database schema language	Business domain language
Coupling	Consumers coupled to DB schema	Consumers coupled to event schema
Completeness	Captures ALL changes, including those from direct DB modifications	Only captures changes the application knows about

When to Use CDC

Outbox relay. CDC is an excellent way to implement the outbox pattern without polling. Write events to the outbox table, and let CDC pick them up and publish them to Kafka. This gives you low-latency event publishing with transactional guarantees.

Legacy system integration. You have a legacy system that writes to a database but does not publish events. CDC lets you capture those changes without modifying the legacy code. This is the "strangle fig" approach to migration: wrap the old system in events and gradually replace it.

Data pipeline ingestion. Streaming database changes into a data warehouse or data lake. This is CDC's original use case and remains one of its strongest.

Audit logging. Every change to a table, captured automatically, without relying on the application to remember to log it.

When Not to Use CDC

When you need business-level events. A CDC event says "the status column changed from placed to shipped." An application-level event says "OrderShipped" and includes the carrier, tracking number, and estimated delivery date. If your consumers need business semantics, CDC alone is insufficient — you will need a transformation layer.

When schema coupling is unacceptable. CDC consumers are coupled to your database schema. If you rename a column, every CDC consumer breaks. Application-level events provide an abstraction layer between your internal schema and your public event contract.

When database compatibility is uncertain. CDC depends on database-specific features (logical replication in PostgreSQL, binlog in MySQL). If you might change databases, the CDC pipeline needs to change too.

CDC + Outbox: The Best of Both Worlds

The most robust pattern combines CDC with the outbox:

Application writes business-level events (with domain semantics, proper naming, and a well-designed schema) to an outbox table.
CDC captures new outbox entries from the transaction log.
CDC publishes them to the event broker.

This gives you:

Transactional consistency (outbox is written in the same transaction as the business data).
Low latency (CDC detects changes in near-real-time, no polling).
Business-level event semantics (the outbox contains properly designed events, not raw table changes).
No polling overhead.

Debezium's outbox event router is purpose-built for this pattern.

Patterns in Combination

These patterns do not exist in isolation. In practice, they combine:

Event sourcing + CQRS: The event store is the write model. Projectors build read models from the event stream. This is the most common combination and the one most people mean when they say "event sourcing."
Sagas + outbox: Each saga step writes its command/event to an outbox table, ensuring reliable publishing.
CDC + event streaming: Database changes are captured by CDC and published to Kafka, where stream processors transform and enrich them.
Pub/sub + state machines: Events trigger state transitions in downstream services, with the state machine enforcing valid transitions.

The art is knowing which patterns to apply where. Not every service needs event sourcing. Not every interaction needs a saga. Not every database change needs CDC. The worst event-driven architectures are the ones that apply every pattern everywhere, creating a system of such overwhelming complexity that no one can reason about it.

Start with pub/sub. Add complexity only when a specific problem demands it. Document why each pattern was chosen. Your future self — and the poor soul who inherits your system — will be grateful.

Chapter Summary

The patterns in this chapter form the toolkit of event-driven architecture:

Pub/sub is the foundation — simple, effective, and sufficient for many use cases.
Event streaming (log-based, with retention) is fundamentally different from message queuing (destructive consumption). Choose based on whether you need replay.
Event sourcing stores events as the source of truth. Powerful for audit, temporal queries, and bug recovery; expensive in complexity and operational cost. Apply selectively.
CQRS separates reads from writes, enabling optimised read models. It complements event sourcing but does not require it.
Sagas manage distributed transactions. Use choreography for simple flows, orchestration for complex ones, or both.
State machines make transitions explicit and invalid states unrepresentable.
The outbox pattern solves the dual-write problem with transactional guarantees.
CDC captures database changes as events, enabling legacy integration, data pipelines, and low-latency outbox publishing.

Each pattern has costs. Apply them deliberately. The next chapter covers how to evolve the schemas of the events these patterns produce — because the only thing harder than designing an event schema is changing one.

Schema Evolution and Contracts

You've got your events flowing, your consumers humming, your dashboards green. Life is good. Then someone adds a field to an event, and three services fall over at 2 AM on a Saturday. Welcome to schema evolution — the problem that everyone acknowledges is important and nobody budgets time for until it's too late.

Schema evolution is the discipline of changing what your events look like over time without setting fire to every consumer downstream. It sounds straightforward. It is not. It is, in fact, the hardest problem you will underestimate in an event-driven architecture, because it sits at the intersection of technical constraints, organizational politics, and the fundamental human inability to predict the future.

Why Schema Evolution Is the Hardest Problem You'll Underestimate

In a monolithic application, changing a data structure is a compile-time problem. You change the struct, the compiler screams at you, you fix the fifteen call sites, and you go home. In an event-driven system, changing an event schema is a deployment-time problem spread across multiple teams, multiple services, and multiple time zones.

Here's what makes it genuinely difficult:

Producers and consumers deploy independently. You cannot coordinate a simultaneous upgrade across all services that touch an event. You will have old producers running alongside new consumers, and new producers running alongside old consumers, sometimes for days, sometimes for months, sometimes — let's be honest — forever.
Events are durable. Unlike HTTP request/response payloads that exist for milliseconds, events sit in logs and queues. A Kafka topic with a 30-day retention period means your new consumer must handle events written by code that was deployed a month ago. Possibly code written by an engineer who has since left the company.
The blast radius is invisible. When you change an HTTP API, you can look at your API gateway logs and enumerate your callers. When you change an event schema, you may not even know who's consuming it. That analytics team that tapped into your order events six months ago? They didn't tell you. They're about to have a very bad day.
Serialization formats have opinions. Your choice of serialization format determines what kinds of changes are safe, what kinds are dangerous, and what kinds are impossible. Choose wrong early, pay forever.

The net result is that schema evolution requires a level of discipline and tooling that feels disproportionate to the apparent simplicity of "I just want to add a field." But the alternative — no discipline, no tooling — is a system that becomes progressively more terrifying to change.

Schema Formats: The Serialization Wars

Before we talk about evolving schemas, we need to talk about what schemas are, because the format you choose constrains everything that follows.

JSON (with JSON Schema)

JSON is the lingua franca of web APIs, and plenty of event-driven systems use it too. It's human-readable, self-describing (sort of), universally supported, and — this is the critical part — has no built-in schema enforcement.

JSON Schema exists as a specification for describing the shape of JSON documents, but it's a validation layer bolted on after the fact, not a feature of the format itself. This means:

Producers can emit whatever they want. There's no serialization step that rejects invalid events. You need explicit validation.
Consumers must be defensive. You cannot trust that a field exists, has the right type, or means what you think it means.
Schema evolution is "easy" in the worst sense. Anyone can add, remove, or change fields at any time. The format won't stop them. The 3 AM page will.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "orderId": { "type": "string", "format": "uuid" },
    "customerId": { "type": "string" },
    "totalAmount": { "type": "number", "minimum": 0 },
    "currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
    "createdAt": { "type": "string", "format": "date-time" }
  },
  "required": ["orderId", "customerId", "totalAmount", "currency", "createdAt"],
  "additionalProperties": false
}

That additionalProperties: false at the bottom? It's the source of a religious war. Set it to false and you get strict validation but break forward compatibility (consumers reject events with new fields). Set it to true (or omit it) and you get forward compatibility but lose the ability to catch typos and garbage fields.

Verdict: JSON is fine for systems with a small number of well-coordinated teams. It becomes increasingly painful as the number of independent producers and consumers grows. The lack of a built-in schema enforcement mechanism means you're relying on convention and discipline, which — let's be charitable — have a mixed track record.

Apache Avro

Avro is the schema evolution format. It was designed for exactly this problem, and it shows. An Avro schema defines the structure of your data, and the serialization/deserialization process is schema-aware: the writer's schema and the reader's schema are both available at read time, and the Avro library resolves differences between them.

{
  "type": "record",
  "name": "OrderCreated",
  "namespace": "com.example.orders",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalAmount", "type": "double" },
    { "name": "currency", "type": "string", "default": "USD" },
    { "name": "createdAt", "type": "long", "logicalType": "timestamp-millis" },
    {
      "name": "discountCode",
      "type": ["null", "string"],
      "default": null
    }
  ]
}

Key Avro features for schema evolution:

Default values make adding fields backward-compatible. Old data missing the new field gets the default.
Union types (like ["null", "string"]) let you make fields optional without contortions.
The reader uses its own schema, so it only sees the fields it cares about, even if the writer included extras.
Schema resolution rules are explicit and well-defined. You don't have to guess what happens when schemas diverge.

The cost? Avro requires both the writer's and reader's schemas at deserialization time. In practice, this means you need a schema registry (more on that shortly). Also, Avro's binary encoding is not human-readable, which makes debugging harder — you can't just cat a message and see what's in it.

Verdict: If schema evolution is a first-class concern (and it should be), Avro is the strongest choice. The Confluent ecosystem is built around it for good reason.

Protocol Buffers (Protobuf)

Google's Protocol Buffers take a different approach. Schemas are defined in .proto files and compiled into language-specific code. Every field has a numeric tag, and the wire format uses these tags rather than field names.

syntax = "proto3";

package orders;

message OrderCreated {
  string order_id = 1;
  string customer_id = 2;
  double total_amount = 3;
  string currency = 4;
  int64 created_at = 5;

  // Added in v2 — old consumers will ignore this field
  optional string discount_code = 6;
}

Protobuf's evolution model:

Adding fields is safe as long as you use new tag numbers. Old readers ignore unknown tags.
Removing fields is safe as long as you never reuse the tag number. (The reserved keyword helps enforce this.)
Renaming fields is free because the wire format uses tag numbers, not names. This is either a feature or a footgun depending on your perspective.
Changing field types is dangerous. Some type changes are compatible (e.g., int32 to int64), but most are not.

// DANGER: reserved tags and names prevent accidental reuse
message OrderCreated {
  reserved 7, 8;           // These tag numbers are retired
  reserved "legacy_field"; // This name is retired

  string order_id = 1;
  // ... rest of fields
}

The proto3 syntax removed required fields entirely (everything is implicitly optional with a zero/empty default), which simplifies evolution but makes it harder to express "this field must be present" — a constraint you then have to enforce in application code.

Verdict: Protobuf is excellent for evolution, has superb cross-language support, and produces compact wire formats. The tag-based approach is fundamentally sound. The main friction is the code generation step, which some teams find annoying and others find essential.

Apache Thrift

Thrift, originally from Facebook, is similar to Protobuf in concept: an IDL that compiles to language-specific code with tagged fields on the wire. It supports required, optional, and default fields.

struct OrderCreated {
  1: required string orderId,
  2: required string customerId,
  3: required double totalAmount,
  4: optional string currency = "USD",
  5: required i64 createdAt,
  6: optional string discountCode
}

Thrift's evolution rules mirror Protobuf's: new fields with new IDs are safe, removing optional fields is safe, never reuse field IDs. The required keyword is a trap — once a field is required, you can never remove it without breaking existing readers, which is why Protobuf dropped the concept entirely in proto3.

Verdict: Thrift is a perfectly serviceable choice, but it's lost mindshare to Protobuf. Unless you're already invested in the Thrift ecosystem, there's little reason to choose it for new projects.

The Comparison, Honestly

Concern	JSON Schema	Avro	Protobuf	Thrift
Human readability	Excellent	Poor (binary)	Poor (binary)	Poor (binary)
Schema enforcement	Opt-in	Built-in	Built-in (codegen)	Built-in (codegen)
Evolution rules	Ad hoc	Formal, well-defined	Formal, tag-based	Formal, tag-based
Schema registry support	Varies	Excellent (Confluent)	Good	Limited
Language support	Universal	Good (JVM-centric)	Excellent	Good
Wire format size	Large	Compact	Very compact	Compact
Debugging ease	Easy	Hard	Hard	Hard

Compatibility Types: A Precise Vocabulary

When we say a schema change is "compatible," we need to be specific about which direction the compatibility runs. There are three types, and conflating them is a reliable source of production incidents.

Backward Compatibility

A new schema is backward compatible if it can read data written with the old schema.

This is the most common requirement. Your consumers upgrade to the new schema, and they can still process events that were written before the upgrade.

Safe changes under backward compatibility:

Adding a field with a default value. Old events don't have it; the default fills in.
Removing a field. New readers just ignore it. (But old readers might still need it — see forward compatibility.)

// Schema v1
{
  "type": "record",
  "name": "OrderCreated",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "totalAmount", "type": "double" }
  ]
}

// Schema v2 — BACKWARD COMPATIBLE (added field with default)
{
  "type": "record",
  "name": "OrderCreated",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "totalAmount", "type": "double" },
    { "name": "currency", "type": "string", "default": "USD" }
  ]
}

A v2 reader processing a v1 event will see currency as "USD". No crash, no null pointer, no existential crisis.

Forward Compatibility

A new schema is forward compatible if data written with the new schema can be read by old consumers.

This is what you need when producers upgrade before consumers — which, in a microservices world, is roughly always. Your payment service ships a new version of OrderCreated on Tuesday, but the fulfillment service won't deploy until Thursday. Forward compatibility means Thursday isn't a disaster.

Safe changes under forward compatibility:

Adding a field. Old readers ignore fields they don't know about (assuming the format supports this — Avro and Protobuf do, strict JSON Schema does not).
Removing a field that has a default. Old readers expecting the field get the default.

Full Compatibility

Full compatibility means the schema is both backward and forward compatible. This is the gold standard and the hardest to maintain.

Safe changes under full compatibility:

Adding a field with a default value. New readers use the default for old events; old readers ignore the new field.

That's... basically it. Full compatibility is restrictive by design. It means:

You cannot add required fields (breaks backward compatibility).
You cannot remove fields without defaults (breaks forward compatibility).
You cannot change field types.
You cannot rename fields (in name-based formats).

This sounds limiting, and it is. It's also the only level that lets producers and consumers upgrade in any order without coordination. The constraint is the feature.

Transitive Compatibility

There's one more dimension: transitive compatibility means the new schema is compatible not just with the immediately previous version, but with all previous versions.

Non-transitive: v3 is compatible with v2 (but maybe not v1). Transitive: v3 is compatible with v2 AND v1 AND every version before that.

You want transitive compatibility. You almost certainly want transitive compatibility. If your Kafka topic has a 30-day retention period and you've released three schema versions this month, consumers processing old events need compatibility all the way back, not just one version.

Schema Registries: The Adult Supervision Your Events Need

A schema registry is a centralized service that stores and manages event schemas. It is not optional. I mean, technically it's optional in the same way that seatbelts are optional — you can absolutely drive without one, and everything will be fine right up until it isn't.

What a Schema Registry Does

Stores schemas with version history. Every version of every event schema, forever (or until you clean up, which — spoiler — you won't).
Assigns schema IDs. Each schema version gets a unique numeric ID. Producers embed this ID in the event payload so consumers know which schema to use for deserialization.
Enforces compatibility. This is the killer feature. When a producer tries to register a new schema version, the registry checks it against previous versions and rejects it if it violates the compatibility rules. This is your safety net. This is the thing that prevents the 2 AM page.
Provides schema lookup. Consumers fetch schemas by ID, so they can deserialize events without needing the schema baked into their code at compile time.

The Major Registries

Confluent Schema Registry

The 800-pound gorilla. Tightly integrated with the Kafka ecosystem, supports Avro, Protobuf, and JSON Schema. Provides compatibility checking at the subject level (where a subject typically maps to a topic + record type).

# Register a schema
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{\"type\":\"record\",\"name\":\"OrderCreated\",\"fields\":[{\"name\":\"orderId\",\"type\":\"string\"}]}"}' \
  http://localhost:8081/subjects/orders-value/versions

# Check compatibility before registering
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{\"type\":\"record\",\"name\":\"OrderCreated\",\"fields\":[{\"name\":\"orderId\",\"type\":\"string\"},{\"name\":\"currency\",\"type\":\"string\",\"default\":\"USD\"}]}"}' \
  http://localhost:8081/compatibility/subjects/orders-value/versions/latest

The compatibility levels it supports:

BACKWARD (default): new schema can read old data
FORWARD: old schema can read new data
FULL: both directions
BACKWARD_TRANSITIVE, FORWARD_TRANSITIVE, FULL_TRANSITIVE: same, but against all versions
NONE: no checking (the "I like to live dangerously" setting)

Licensing note: The Confluent Schema Registry is under the Confluent Community License, not Apache 2.0. This matters if you're building a competing SaaS product. For most internal use cases, it's fine.

Apicurio Registry

The open-source alternative, Apache 2.0 licensed. Supports Avro, Protobuf, JSON Schema, GraphQL, OpenAPI, and more. It can use Kafka, SQL databases, or in-memory storage as its backend.

Apicurio provides compatibility checking and supports the same compatibility levels as Confluent. It also offers a REST API that's mostly compatible with Confluent's, so migration between the two is not catastrophic.

When to use it: When you want open-source licensing, when you need support for non-Kafka brokers, or when you need to store non-event schemas (like OpenAPI specs) in the same registry.

AWS Glue Schema Registry

Amazon's managed offering. Integrates with Kinesis, MSK, and Lambda. Supports Avro and JSON Schema (Protobuf support was added later and with caveats).

When to use it: When you're all-in on AWS and want managed infrastructure with IAM integration. The compatibility checking is solid. The ecosystem integration is convenient. The vendor lock-in is real.

Schema ID Embedding

In practice, the producer serializes an event like this:

[Magic Byte (1)][Schema ID (4 bytes)][Avro Payload (N bytes)]

The magic byte (0x00) signals that this is a schema-registry-aware payload. The 4-byte schema ID tells the consumer exactly which schema to fetch for deserialization. This is the Confluent wire format, and it's become a de facto standard even outside the Confluent ecosystem.

# Python example with confluent-kafka and Avro
from confluent_kafka.avro import AvroProducer

producer = AvroProducer({
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}, default_value_schema=order_created_schema)

producer.produce(
    topic='orders',
    value={
        'orderId': 'ord-123',
        'customerId': 'cust-456',
        'totalAmount': 99.99,
        'currency': 'USD',
        'createdAt': 1679616000000
    }
)
producer.flush()

The producer doesn't manually embed the schema ID — the AvroProducer handles registration and embedding transparently. This is the happy path. The unhappy path is when someone bypasses the serializer and writes raw JSON to an Avro topic, which will be detected approximately never and cause havoc approximately always.

Versioning Strategies

So you need to change an event schema. How do you manage the transition? There are several strategies, each with different tradeoffs between safety, complexity, and how much you trust your fellow engineers.

Additive-Only Changes (The Golden Rule)

The simplest and most robust strategy: you only ever add new optional fields with default values. You never remove fields. You never change field types. You never rename things.

// v1
{ "orderId": "string", "totalAmount": "double" }

// v2 — added optional field
{ "orderId": "string", "totalAmount": "double", "currency": "string (default: USD)" }

// v3 — added another optional field
{ "orderId": "string", "totalAmount": "double", "currency": "string (default: USD)", "discountPercent": "double (default: 0.0)" }

This is boring. This is predictable. This works. Every version is fully compatible with every other version, transitively. Consumers can upgrade at their leisure. Producers can upgrade without coordination.

The downside is that your schema accumulates fields over time. That legacyCustomerType field that was added in 2019 and hasn't been populated since 2021? It's still there. It will always be there. You're building geological strata in your event schemas.

Semantic Versioning for Events

Borrowing from library versioning: MAJOR.MINOR.PATCH for event schemas.

PATCH (1.0.0 -> 1.0.1): Documentation changes, no schema change.
MINOR (1.0.0 -> 1.1.0): Backward-compatible additions (new optional fields).
MAJOR (1.0.0 -> 2.0.0): Breaking changes.

The version can live in the event metadata:

{
  "metadata": {
    "eventType": "OrderCreated",
    "schemaVersion": "2.1.0",
    "timestamp": "2025-03-15T10:30:00Z"
  },
  "payload": {
    "orderId": "ord-123",
    "customerId": "cust-456",
    "totalAmount": 99.99
  }
}

This gives consumers information to route or reject events. A consumer that handles OrderCreated v1.x can skip v2.x events (or send them to a dead letter queue) rather than crash trying to parse them.

The catch: Semantic versioning requires discipline and agreement on what constitutes a breaking change. You'd be surprised how many teams debate whether adding a new enum value is a minor or major change. (It depends on the serialization format: in Avro, it can be breaking if the reader has a fixed enum. In Protobuf, it's fine. In JSON, it depends on whether consumers validate enums.)

The "v2 Topic" Approach

When you have a truly breaking change, one brutal-but-effective strategy is to create a new topic:

orders.order-created.v1  →  the old events
orders.order-created.v2  →  the new events

The v1 topic continues to receive events from old producers and serve old consumers. The v2 topic starts receiving events from upgraded producers. Over time, you migrate consumers from v1 to v2 and eventually decommission v1.

Advantages:

Clean separation. No compatibility gymnastics.
Old and new consumers can coexist indefinitely.
Easy to reason about — each topic has exactly one schema.

Disadvantages:

You need a migration period where both topics are active.
Producers might need to dual-write during the transition (write to both v1 and v2).
Consumers need to be updated to read from the new topic, which requires coordination — the very thing you were trying to avoid.
Topic proliferation. After a few years, you have v1, v2, v3, and nobody's sure if v1 is still active.

# Dual-writing during migration period
class OrderEventProducer:
    def __init__(self, producer, migration_active=True):
        self.producer = producer
        self.migration_active = migration_active

    def publish_order_created(self, order):
        # Always write to v2
        v2_event = self._to_v2_schema(order)
        self.producer.produce('orders.order-created.v2', v2_event)

        # During migration, also write to v1 for lagging consumers
        if self.migration_active:
            v1_event = self._to_v1_schema(order)
            self.producer.produce('orders.order-created.v1', v1_event)

    def _to_v1_schema(self, order):
        return {'orderId': order.id, 'totalAmount': order.total}

    def _to_v2_schema(self, order):
        return {
            'orderId': order.id,
            'totalAmount': order.total,
            'currency': order.currency,
            'lineItems': [item.to_dict() for item in order.items]
        }

Event Upcasting

A middle ground: keep a single topic but transform old events to the new schema at read time. The consumer maintains "upcasters" that know how to convert from old schema versions to the current one.

public class OrderCreatedUpcaster {

    public OrderCreatedV3 upcast(JsonNode event, int schemaVersion) {
        return switch (schemaVersion) {
            case 1 -> upcastFromV1(event);
            case 2 -> upcastFromV2(event);
            case 3 -> parseV3(event);
            default -> throw new UnknownSchemaVersionException(schemaVersion);
        };
    }

    private OrderCreatedV3 upcastFromV1(JsonNode event) {
        return OrderCreatedV3.builder()
            .orderId(event.get("orderId").asText())
            .totalAmount(event.get("totalAmount").asDouble())
            .currency("USD")  // v1 didn't have currency, assume USD
            .lineItems(List.of())  // v1 didn't have line items
            .build();
    }

    private OrderCreatedV3 upcastFromV2(JsonNode event) {
        return OrderCreatedV3.builder()
            .orderId(event.get("orderId").asText())
            .totalAmount(event.get("totalAmount").asDouble())
            .currency(event.get("currency").asText())
            .lineItems(List.of())  // v2 didn't have line items
            .build();
    }
}

This keeps the data layer simple (one topic, old events stay as-is) at the cost of application complexity (every consumer needs upcasting logic). It works well when you have a small number of consumers and can keep the upcasters in a shared library.

Breaking Changes and Migration Strategies

Sometimes you need to make a genuinely breaking change. The field type needs to change from a string to a structured object. The event needs to be split into two events. The semantics are changing fundamentally.

The Expand-and-Contract Pattern

Borrowed from database migrations, this is the safest approach for most breaking changes:

Phase 1: Expand. Add the new field alongside the old one. Producers populate both. This is a backward-compatible change.

// Phase 1: Both fields present
{
  "orderId": "ord-123",
  "customerName": "Jane Doe",
  "customer": {
    "id": "cust-456",
    "name": "Jane Doe",
    "email": "jane@example.com"
  }
}

Phase 2: Migrate. Update all consumers to read from the new field. Verify (through metrics and monitoring) that no consumer is still reading the old field.

Phase 3: Contract. Remove the old field. Producers stop populating it.

The total time for this process is "however long it takes to get every consumer team to update their code," which in practice ranges from "two weeks" to "we gave up and the old field is still there."

The Event Splitter

When one event needs to become two, use a splitter service:

[Producer] → OrderCreated → [Splitter] → OrderCreated (slim)
                                       → OrderLineItemsCreated

The splitter consumes the old combined event and publishes two new events. Old consumers continue reading the old topic. New consumers read the new topics. The splitter runs until all consumers have migrated.

The Nuclear Option: Replay with Transform

If you need to transform the entire history of events, you can:

Create a new topic with the new schema.
Write a batch job that reads every event from the old topic, transforms it, and writes it to the new topic.
Migrate consumers to the new topic.
Decommission the old topic.

This is expensive, disruptive, and sometimes the only option. It's the schema evolution equivalent of "we're going to need a bigger boat."

Consumer-Driven Contracts

Here's an uncomfortable truth: the producer doesn't know what the consumer needs. The producer publishes what it has, and hopes it's enough. Consumer-driven contracts flip this: consumers declare what they need, and the system verifies that the producer provides it.

The Concept

Each consumer publishes a contract describing the minimum set of fields and types it requires from an event. These contracts are tested against the producer's schema in CI/CD, and a failure blocks deployment.

// Consumer contract: fulfillment-service needs from OrderCreated
{
  "consumer": "fulfillment-service",
  "event": "OrderCreated",
  "requiredFields": {
    "orderId": "string",
    "customerId": "string",
    "shippingAddress": {
      "street": "string",
      "city": "string",
      "postalCode": "string",
      "country": "string"
    }
  }
}

Implementing Consumer-Driven Contracts

The tooling for this in the event-driven world is less mature than for HTTP APIs (where Pact is the standard bearer), but the principle is the same:

Consumers define contracts specifying what fields they read and what types they expect.
Contracts are stored in a shared repository or the schema registry.
CI/CD pipelines verify that the producer's schema satisfies all consumer contracts before deploying.
A producer cannot make a change that violates any consumer's contract.

# Simplified contract verification
def verify_contract(producer_schema: dict, consumer_contract: dict) -> list:
    violations = []

    for field_name, expected_type in consumer_contract['requiredFields'].items():
        producer_field = find_field(producer_schema, field_name)

        if producer_field is None:
            violations.append(f"Missing required field: {field_name}")
        elif not types_compatible(producer_field['type'], expected_type):
            violations.append(
                f"Type mismatch for {field_name}: "
                f"producer has {producer_field['type']}, "
                f"consumer expects {expected_type}"
            )

    return violations

The Organizational Challenge

Consumer-driven contracts sound great in a conference talk. In practice, they require:

Every consumer team to actually write and maintain contracts (they won't, at first).
A culture where producer teams accept that they can't break consumers (some will resist).
Tooling in CI/CD to enforce contracts (this is the easy part, surprisingly).
A governance process for resolving conflicts when a producer needs to make a breaking change that violates a contract (this is the hard part, unsurprisingly).

The payoff is worth it. Consumer-driven contracts transform schema evolution from "we hope nothing breaks" to "we know nothing breaks, because the pipeline told us."

The Schema Graveyard: What Happens When Nobody Cleans Up

Let me paint you a picture. It's three years into your event-driven architecture. Your schema registry has 847 schemas across 312 subjects. Your OrderCreated event is on version 23. Versions 1 through 17 haven't been produced in over a year, but they're still registered because nobody's sure if there are old events in the topic that use them. There are 14 schemas that contain the word "test" or "temp" in their name. Three schemas are registered under subjects that correspond to topics that no longer exist. Nobody has the full picture. Nobody wants to touch it.

This is the schema graveyard, and it's what happens when you have a schema registry but no schema governance.

How to Avoid It

Ownership. Every schema has an owner (a team, not a person). The owner is responsible for its lifecycle, including deprecation and removal.
Deprecation process. Before removing a schema version, mark it as deprecated. Give consumers a timeline. Check topic retention periods — if the topic has 7-day retention, you only need to keep schemas that were used within the last 7 days.
Automated cleanup. Write a job that:
- Identifies schema versions not referenced by any event in the topic (scan the topic headers or schema IDs).
- Cross-references with consumer group offsets to ensure no consumer will encounter old events.
- Reports orphaned schemas for review.
Schema lifecycle metadata. Tag schemas with creation date, owning team, deprecation status, and "last seen in production" timestamps.

# Schema lifecycle tracking
schema_metadata = {
    "subject": "orders-value",
    "version": 17,
    "created_at": "2023-06-15T10:00:00Z",
    "created_by": "order-service-team",
    "status": "deprecated",
    "deprecated_at": "2024-09-01T00:00:00Z",
    "deprecation_reason": "Replaced by v18 — added structured address",
    "removal_eligible_after": "2024-10-01T00:00:00Z",
    "last_seen_in_production": "2024-08-28T14:22:00Z"
}

Registry-level metrics. Track the number of active schemas, the rate of new registrations, the average number of versions per subject, and the number of subjects with no recent activity. Alert when these metrics suggest entropy is winning.

The Hard Truth

You will not maintain perfect schema hygiene. The graveyard will form. The goal is not prevention but mitigation — keeping the graveyard small, well-mapped, and distinguishable from the schemas that are actually alive.

Code Examples: Compatible vs. Breaking Changes

Let's make this concrete with Avro examples showing what the schema registry will accept and what it will reject.

Compatible: Adding an Optional Field

// v1
{
  "type": "record",
  "name": "UserRegistered",
  "namespace": "com.example.users",
  "fields": [
    { "name": "userId", "type": "string" },
    { "name": "email", "type": "string" },
    { "name": "registeredAt", "type": "long" }
  ]
}

// v2 — COMPATIBLE under BACKWARD, FORWARD, and FULL
{
  "type": "record",
  "name": "UserRegistered",
  "namespace": "com.example.users",
  "fields": [
    { "name": "userId", "type": "string" },
    { "name": "email", "type": "string" },
    { "name": "registeredAt", "type": "long" },
    { "name": "displayName", "type": ["null", "string"], "default": null }
  ]
}

A v1 reader processing a v2 event: ignores displayName. A v2 reader processing a v1 event: sets displayName to null. Everyone's happy.

Breaking: Adding a Required Field Without a Default

// v1
{
  "type": "record",
  "name": "UserRegistered",
  "fields": [
    { "name": "userId", "type": "string" },
    { "name": "email", "type": "string" }
  ]
}

// v2 — BREAKS BACKWARD COMPATIBILITY
{
  "type": "record",
  "name": "UserRegistered",
  "fields": [
    { "name": "userId", "type": "string" },
    { "name": "email", "type": "string" },
    { "name": "phoneNumber", "type": "string" }
  ]
}

A v2 reader processing a v1 event: crashes. Where's phoneNumber? There's no default. The Avro deserializer throws. Your registry should reject this if backward compatibility is enforced.

Breaking: Changing a Field Type

// v1: totalAmount is a double
{ "name": "totalAmount", "type": "double" }

// v2: totalAmount is now a record — BREAKS EVERYTHING
{
  "name": "totalAmount",
  "type": {
    "type": "record",
    "name": "Money",
    "fields": [
      { "name": "amount", "type": "long" },
      { "name": "currency", "type": "string" }
    ]
  }
}

This is the kind of change that requires the expand-and-contract pattern. Add totalAmountV2 as a new field with the structured type, migrate consumers, then deprecate totalAmount.

Breaking: Removing a Field Without a Default (Breaks Forward Compatibility)

// v1
{
  "type": "record",
  "name": "OrderCreated",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "priority", "type": "string" },
    { "name": "totalAmount", "type": "double" }
  ]
}

// v2 — removed priority, BREAKS FORWARD COMPATIBILITY
{
  "type": "record",
  "name": "OrderCreated",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "totalAmount", "type": "double" }
  ]
}

A v1 reader (old consumer) processing a v2 event: where's priority? If the v1 schema has no default for priority, deserialization fails. The fix: priority should have had a default value from the start, or you need forward-compatible consumers that tolerate missing fields.

Protobuf: Safe Field Removal with Reserved Tags

// v1
message OrderCreated {
  string order_id = 1;
  string priority = 2;
  double total_amount = 3;
}

// v2 — removed priority, reserved the tag
message OrderCreated {
  reserved 2;
  reserved "priority";

  string order_id = 1;
  double total_amount = 3;
}

In Protobuf, this is safe. Old readers encountering a v2 message will see priority as empty string (the default). New readers encountering a v1 message will just skip tag 2. The reserved keyword prevents anyone from accidentally reusing tag 2 for a different field in the future, which would be a subtle and devastating bug.

Summary

Schema evolution is not a one-time design decision; it's an ongoing discipline. The choices you make early — serialization format, compatibility level, registry adoption — determine how painful or painless changes will be for the lifetime of the system.

The advice, in brief:

Pick a schema format with built-in evolution support. Avro or Protobuf. Not raw JSON. Not "we'll be careful."
Use a schema registry. Enforce compatibility checking in CI/CD and at registration time.
Default to FULL_TRANSITIVE compatibility. It's restrictive, and that's the point.
Make additive-only changes the norm. New optional fields with sensible defaults. Boring is beautiful.
Use expand-and-contract for breaking changes. It takes longer. It's worth it.
Invest in consumer-driven contracts. They're organizational work disguised as technical work, and they're worth it.
Plan for the graveyard. Because it's coming whether you plan or not.

Schema evolution is the tax you pay for the privilege of independent deployability. Pay it cheerfully, automate it aggressively, and never, ever skip the compatibility check.

Error Handling and Delivery Guarantees

In a synchronous system, error handling is straightforward: the call fails, you get an exception, you show the user a sad face, and everyone moves on. In an event-driven system, error handling is a philosophy, a lifestyle, and occasionally a source of existential dread.

The fundamental challenge is this: when a producer publishes an event and walks away, who is responsible when something goes wrong? The producer has already moved on. The broker is just a pipe. The consumer might not even exist yet. The event could fail to process three days from now, on a server in a region you've never heard of, for a reason that has nothing to do with the original business logic. And you need a plan for that.

This chapter covers the guarantees your system can (and cannot) provide, the strategies for handling failures gracefully, and the tools for dealing with events that refuse to be processed.

The Three Delivery Guarantees (and What They Actually Mean)

Every messaging system documentation page features a section on delivery guarantees, typically presented with the gravitas of constitutional law. There are three:

At-Most-Once

The event is delivered zero or one times. It might be lost, but it will never be duplicated.

Implementation: the producer fires the event and doesn't wait for acknowledgment. Or the broker acknowledges receipt but the consumer doesn't acknowledge processing. If anything goes wrong — network blip, consumer crash, broker hiccup — the event is gone.

# At-most-once: fire and forget
producer.send('orders', event)
# Did it arrive? Who knows. Moving on.

When it's appropriate: Metrics, analytics, logging — data where losing a few events is acceptable and duplicates would skew your numbers. If your dashboard can tolerate showing 99.97% of events instead of 100%, at-most-once is simpler and faster.

When it's not: Financial transactions, order processing, anything where losing an event means losing money or trust.

At-Least-Once

The event is delivered one or more times. It will never be lost, but it might be duplicated.

Implementation: the producer retries until it gets an acknowledgment. The consumer processes the event and then acknowledges it. If the consumer crashes after processing but before acknowledging, the broker redelivers the event, and the consumer processes it again.

# At-least-once: producer with retries
producer.send('orders', event, acks='all', retries=3)

# Consumer side: process then commit
event = consumer.poll()
process(event)         # This succeeds
consumer.commit()      # But what if we crash here?
# If we crash between process() and commit(),
# the event will be redelivered. We'll process it twice.

This is the most common guarantee in practice, because it's achievable without exotic infrastructure. The tradeoff is that your consumers must handle duplicates, which is the topic of the next section.

Exactly-Once

The event is delivered exactly one time. Never lost, never duplicated. The holy grail.

And now for the uncomfortable part.

Why "Exactly-Once" Is Mostly Marketing

"Exactly-once delivery" is one of those phrases that sounds simple, means something specific and narrow in the contexts where it's achievable, and means something impossible in the general case. Let's untangle it.

The Two Generals Problem

Distributed systems theory has proven — proven, mathematically, not just "it's really hard" — that exactly-once delivery between two independent processes over an unreliable network is impossible. This is a consequence of the Two Generals Problem. You cannot guarantee that both the sender and receiver agree on whether a message was delivered, because the acknowledgment itself can be lost.

If the producer sends an event and the ack gets lost, the producer doesn't know if the event was received. It can either:

Not retry: risking event loss (at-most-once).
Retry: risking duplication (at-least-once).

There is no third option. Physics doesn't care about your SLA.

What "Exactly-Once" Actually Means in Practice

When Kafka or Pulsar or any other system claims "exactly-once," they mean one of two things:

Exactly-once within the broker. Kafka's exactly-once semantics (EOS) guarantee that a produce-consume-produce cycle within Kafka doesn't duplicate events. The broker uses producer IDs and sequence numbers to deduplicate, and transactions to ensure atomic writes across multiple partitions. This is real, it works, and it's a significant engineering achievement. But it only applies to the Kafka-to-Kafka pipeline.
Effectively-once with idempotent consumers. The system delivers at-least-once, but the consumer is designed so that processing the same event multiple times has the same effect as processing it once. The event might be delivered more than once, but it's processed exactly once in terms of its effect on the world.

// Kafka exactly-once: transactional produce-consume-produce
Properties props = new Properties();
props.put("transactional.id", "order-processor-1");
props.put("enable.idempotence", true);

KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();

try {
    producer.beginTransaction();

    // Consume from input topic
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        // Process and produce to output topic
        ProducerRecord<String, String> output = transform(record);
        producer.send(output);
    }

    // Atomically commit consumer offsets and producer writes
    producer.sendOffsetsToTransaction(
        currentOffsets, consumerGroupMetadata);
    producer.commitTransaction();

} catch (Exception e) {
    producer.abortTransaction();
}

This is powerful and useful. It's also limited to the Kafka-internal pipeline. The moment your consumer calls an external HTTP API, writes to a database, or sends an email, you're back to at-least-once with idempotency requirements.

The Honest Version

Here's what you should tell your stakeholders: "Our system provides at-least-once delivery with idempotent processing, which means every event will be processed, and processing it more than once won't cause incorrect behavior. This is what the industry calls 'effectively once' and it's what 'exactly-once' actually means in any real-world system that interacts with external services."

That's less catchy than "exactly-once," but it has the advantage of being true.

Idempotent Consumers: The Real Solution

Since you're going to receive duplicates — and you are, it's not a question of if — your consumers need to handle them gracefully. An idempotent operation is one where performing it multiple times has the same result as performing it once.

Strategies for Idempotency

1. Natural Idempotency

Some operations are naturally idempotent:

SET balance = 100 (idempotent: same result every time)
INSERT OR UPDATE customer SET name = 'Jane' (idempotent)
DELETE FROM orders WHERE id = 'ord-123' (idempotent after first execution)

Some are not:

INCREMENT balance BY 10 (not idempotent: each execution adds 10)
INSERT INTO ledger (amount) VALUES (10) (not idempotent: creates new rows)
SEND EMAIL to customer (very not idempotent: customer gets annoyed)

2. Deduplication with Event IDs

Every event should carry a unique identifier. The consumer tracks which IDs it has already processed and skips duplicates.

class IdempotentConsumer:
    def __init__(self, db):
        self.db = db

    def handle_event(self, event):
        event_id = event['metadata']['eventId']

        # Check if we've already processed this event
        if self.db.exists('processed_events', event_id):
            log.info(f"Skipping duplicate event: {event_id}")
            return

        # Process the event within a transaction
        with self.db.transaction() as tx:
            self._process(event, tx)

            # Record that we've processed this event
            tx.insert('processed_events', {
                'event_id': event_id,
                'processed_at': datetime.utcnow(),
                'consumer': self.consumer_name
            })

    def _process(self, event, tx):
        # Actual business logic here
        order = event['payload']
        tx.insert('orders', {
            'id': order['orderId'],
            'customer_id': order['customerId'],
            'total': order['totalAmount']
        })

Critical detail: The business logic and the deduplication record must be in the same transaction. If you process the event, crash before recording it, and then process it again on redelivery, you've defeated the purpose.

3. Idempotency Keys in External Calls

When your consumer calls an external service (payment gateway, email provider, shipping API), pass an idempotency key so the external service can deduplicate on its end.

def process_payment(event):
    idempotency_key = f"payment-{event['orderId']}-{event['eventId']}"

    response = payment_gateway.charge(
        amount=event['totalAmount'],
        currency=event['currency'],
        idempotency_key=idempotency_key  # Gateway deduplicates on this
    )

    return response

Most serious payment APIs support this. Stripe, for example, accepts an Idempotency-Key header that guarantees the same charge isn't processed twice, regardless of how many times you call the API with that key.

4. Conditional Writes

Use optimistic concurrency control to make writes idempotent:

def apply_discount(event, db):
    order_id = event['orderId']
    expected_version = event['orderVersion']

    rows_affected = db.execute("""
        UPDATE orders
        SET discount = %s, version = version + 1
        WHERE id = %s AND version = %s
    """, [event['discount'], order_id, expected_version])

    if rows_affected == 0:
        # Either the order doesn't exist or the version doesn't match.
        # If the version doesn't match, we've already applied this
        # (or a later) update. Either way, we're done.
        log.info(f"Conditional write skipped for order {order_id}")

Retry Strategies

When processing fails, you retry. But how you retry matters enormously. A naive retry strategy can turn a transient failure into a cascading outage.

Immediate Retry

Retry instantly. This works for genuinely transient errors — a momentary network blip, a brief connection pool exhaustion. It fails catastrophically for errors that need time to resolve, because you're hammering the failing service at full speed.

# Don't do this in production without a limit
def process_with_immediate_retry(event, max_retries=3):
    for attempt in range(max_retries):
        try:
            return process(event)
        except TransientError:
            if attempt == max_retries - 1:
                raise
            continue  # Try again immediately

Use when: The error is almost certainly a momentary glitch, and the downstream service can handle the retry volume. So, almost never.

Fixed Delay

Wait a fixed amount of time between retries. Better than immediate retry because it gives the downstream system time to recover.

def process_with_fixed_delay(event, max_retries=3, delay_seconds=5):
    for attempt in range(max_retries):
        try:
            return process(event)
        except TransientError:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay_seconds)

Use when: You have a rough idea of how long recovery takes and the volume of retries is modest.

Exponential Backoff

Each retry waits longer than the last: 1s, 2s, 4s, 8s, 16s, and so on. This is the standard approach for most failure scenarios, because it starts optimistic (maybe it's a quick fix) and becomes progressively more patient.

def process_with_exponential_backoff(event, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return process(event)
        except TransientError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)  # 1, 2, 4, 8, 16
            time.sleep(delay)

Exponential Backoff with Jitter

Here's the problem with exponential backoff: if 100 consumers all fail at the same time (because a downstream service went down), they all retry at the same time, creating a thundering herd that takes down the service again as soon as it recovers.

Jitter randomizes the retry delay to spread out the load:

import random

def process_with_backoff_and_jitter(event, max_retries=5, base_delay=1, max_delay=60):
    for attempt in range(max_retries):
        try:
            return process(event)
        except TransientError:
            if attempt == max_retries - 1:
                raise

            # Full jitter: random delay between 0 and the exponential ceiling
            exp_delay = min(base_delay * (2 ** attempt), max_delay)
            delay = random.uniform(0, exp_delay)
            time.sleep(delay)

AWS's architecture blog has an excellent analysis of jitter strategies. The "full jitter" approach (random between 0 and the exponential ceiling) outperforms both "equal jitter" and "decorrelated jitter" in most scenarios. Use full jitter. Your future self will thank you.

The Retry Budget

Beyond per-event retries, consider a system-wide retry budget: a limit on the total number of retries per time window across all events.

class RetryBudget:
    def __init__(self, max_retries_per_minute=100):
        self.max_retries = max_retries_per_minute
        self.retry_count = 0
        self.window_start = time.time()

    def can_retry(self):
        now = time.time()
        if now - self.window_start > 60:
            self.retry_count = 0
            self.window_start = now

        return self.retry_count < self.max_retries

    def record_retry(self):
        self.retry_count += 1

retry_budget = RetryBudget(max_retries_per_minute=100)

def process_with_budget(event):
    try:
        return process(event)
    except TransientError:
        if retry_budget.can_retry():
            retry_budget.record_retry()
            requeue(event)
        else:
            send_to_dlq(event)

Without a retry budget, a sustained downstream outage can cause your retry queue to grow without bound, consuming memory and network resources and potentially causing your consumer to fall behind on non-failing events.

Dead Letter Queues: Your Event Purgatory

When an event has exhausted its retries and still can't be processed, it goes to the dead letter queue (DLQ). The DLQ is where events go to wait for a human to figure out what went wrong.

Anatomy of a Good DLQ

A DLQ isn't just a dumping ground. A well-designed DLQ includes:

The original event, unmodified.
Error metadata: the exception message, stack trace, consumer name, timestamp of last failure, number of retry attempts.
Routing metadata: which topic it came from, which consumer group failed on it, which partition and offset.

def send_to_dlq(event, error, context):
    dlq_envelope = {
        'originalEvent': event,
        'error': {
            'message': str(error),
            'type': type(error).__name__,
            'stackTrace': traceback.format_exc(),
            'timestamp': datetime.utcnow().isoformat()
        },
        'source': {
            'topic': context.topic,
            'partition': context.partition,
            'offset': context.offset,
            'consumerGroup': context.consumer_group
        },
        'retryHistory': {
            'attempts': context.retry_count,
            'firstAttempt': context.first_attempt_time.isoformat(),
            'lastAttempt': datetime.utcnow().isoformat()
        }
    }

    producer.send(f"{context.topic}.dlq", dlq_envelope)

DLQ Processing Patterns

Events in the DLQ aren't dead — they're in purgatory. You need processes for dealing with them:

Manual review and replay: An operator examines the failed event, fixes the underlying issue (deploys a bug fix, corrects bad data), and replays the event back to the original topic.

Automated retry with delay: A separate consumer reads from the DLQ, waits a configurable period (hours or days, not seconds), and resubmits events to the original topic. This handles cases where the failure was caused by a temporary condition that resolved itself.

Automated triage: A DLQ consumer classifies errors and routes events to different handling queues:

Deserialization errors → schema mismatch queue (probably needs a code fix)
Timeout errors → delayed retry queue (probably transient)
Validation errors → data quality queue (probably needs manual correction)

class DLQTriageConsumer:
    def handle(self, dlq_event):
        error_type = dlq_event['error']['type']

        if error_type in ('SerializationError', 'SchemaError'):
            self.route_to('schema-issues', dlq_event)
        elif error_type in ('TimeoutError', 'ConnectionError'):
            self.route_to('delayed-retry', dlq_event)
        elif error_type == 'ValidationError':
            self.route_to('data-quality', dlq_event)
        else:
            self.route_to('unknown-errors', dlq_event)
            self.alert_on_call_engineer(dlq_event)

The DLQ Naming Convention

Use a consistent naming scheme so it's obvious which DLQ belongs to which topic:

orders.order-created           → orders.order-created.dlq
payments.payment-processed     → payments.payment-processed.dlq

Or, for consumer-specific DLQs (when multiple consumers process the same topic and might fail for different reasons):

orders.order-created.fulfillment-service.dlq
orders.order-created.analytics-service.dlq

Poison Pills: Events That Will Never Succeed

A poison pill is an event that will cause the consumer to fail no matter how many times you retry it. Corrupted data, malformed JSON, an event that triggers a bug in your processing logic — these will never succeed, and retrying them is worse than useless.

Identifying Poison Pills

The first step is recognizing that an event is a poison pill rather than a transient failure:

class EventProcessor:
    MAX_RETRIES_TRANSIENT = 5
    MAX_RETRIES_TOTAL = 10

    def process(self, event, retry_count):
        try:
            # Deserialization — if this fails, it's a poison pill
            parsed = self.deserialize(event)
        except (json.JSONDecodeError, SchemaError) as e:
            # Deterministic failure. Don't retry.
            self.send_to_dlq(event, e, poison_pill=True)
            return

        try:
            # Business logic — might be transient or permanent
            self.handle(parsed)
        except TransientError:
            if retry_count < self.MAX_RETRIES_TRANSIENT:
                self.retry(event, retry_count + 1)
            else:
                self.send_to_dlq(event, e, poison_pill=False)
        except ValidationError as e:
            # Deterministic business logic failure
            self.send_to_dlq(event, e, poison_pill=True)
        except Exception as e:
            # Unknown error — retry a few times, then DLQ
            if retry_count < self.MAX_RETRIES_TOTAL:
                self.retry(event, retry_count + 1)
            else:
                self.send_to_dlq(event, e, poison_pill=False)

The key insight: categorize errors as deterministic (poison pill) or non-deterministic (transient). Deterministic failures should go straight to the DLQ — retrying them wastes time and, more importantly, blocks processing of subsequent events if you're consuming from an ordered partition.

The Poison Pill Partition Problem

In Kafka, messages within a partition are processed in order. If your consumer encounters a poison pill and keeps retrying it, every subsequent message in that partition is blocked. This is the single most common cause of "consumer lag alert firing, nobody knows why."

Partition 0: [msg1] [msg2] [POISON] [msg4] [msg5] [msg6] ...
                                ↑
                     Consumer stuck here.
                     msg4, msg5, msg6 are waiting.
                     Lag is growing.
                     Your pager is about to go off.

The solution: detect poison pills quickly (within 1-2 retries), send them to the DLQ, and move on. Do not allow a single bad event to block an entire partition.

Circuit Breakers in Async Systems

The circuit breaker pattern, borrowed from electrical engineering, prevents a failing downstream service from being hammered with requests. In synchronous systems, it's well-understood: after N consecutive failures, the circuit "opens" and all requests fail fast without calling the downstream service. After a timeout, the circuit enters "half-open" state and allows a single test request through.

In async systems, circuit breakers are trickier because the consumer doesn't make synchronous calls in the traditional sense. But the principle still applies when your consumer depends on external services.

class CircuitBreaker:
    CLOSED = 'closed'
    OPEN = 'open'
    HALF_OPEN = 'half_open'

    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.state = self.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = self.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = self.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = self.OPEN


# Usage in an event consumer
circuit_breaker = CircuitBreaker(failure_threshold=5, reset_timeout=30)

def process_order(event):
    try:
        circuit_breaker.call(payment_service.charge, event['orderId'], event['amount'])
    except CircuitOpenError:
        # Don't even try — the payment service is down.
        # Pause consumption or route to a retry topic.
        requeue_with_delay(event, delay_seconds=30)

The Consumer Pause Pattern

When the circuit opens, you have a choice: let events pile up in the broker (consumer lag grows) or pause the consumer. Pausing is usually better:

class CircuitAwareConsumer:
    def run(self):
        while True:
            if self.circuit_breaker.state == CircuitBreaker.OPEN:
                # Stop fetching new events until the circuit closes
                log.warning("Circuit open — pausing consumer")
                time.sleep(self.circuit_breaker.reset_timeout)
                continue

            events = self.consumer.poll(timeout_ms=1000)
            for event in events:
                self.process(event)

This keeps the events safe in the broker (where they have retention and replay capability) rather than accumulating in memory.

Ordering vs. Retry Tension

Here's a genuinely thorny problem: what happens when you need both ordered processing and retry capability?

The Problem

Consider a partition with events for the same entity:

Partition 0: [OrderCreated:ord-1] [OrderUpdated:ord-1] [OrderShipped:ord-1]

These must be processed in order — you can't ship an order before creating it. But what if OrderCreated fails transiently? You want to retry it, but you can't skip ahead to OrderUpdated while you wait.

If you retry by requeuing to the end of the topic, you've now got:

Partition 0: ... [OrderUpdated:ord-1] [OrderShipped:ord-1] ... [OrderCreated:ord-1]

Processing OrderUpdated before OrderCreated is incorrect. Processing OrderShipped before either is nonsensical.

Solutions

1. Block and Retry In-Place

The simplest approach: keep retrying the failing event without advancing the consumer offset. Subsequent events in the partition wait.

def consume_with_ordered_retry(consumer, max_retries=10):
    while True:
        event = consumer.poll()
        retries = 0
        while retries < max_retries:
            try:
                process(event)
                consumer.commit()
                break
            except TransientError:
                retries += 1
                time.sleep(exponential_backoff(retries))
        else:
            # Exhausted retries — DLQ and move on
            send_to_dlq(event)
            consumer.commit()

Downside: You're blocking the entire partition. If the failure persists, lag grows, and every entity in that partition is affected, not just the one with the failing event.

2. Per-Entity Retry with Buffering

Track which entities are "in retry" and buffer subsequent events for those entities while processing events for other entities normally.

class OrderedRetryConsumer:
    def __init__(self):
        self.retry_buffer = defaultdict(list)  # entity_id -> [events]
        self.entities_in_retry = set()

    def process(self, event):
        entity_id = event['entityId']

        if entity_id in self.entities_in_retry:
            # Buffer this event — we'll process it after the retry succeeds
            self.retry_buffer[entity_id].append(event)
            return

        try:
            handle(event)
        except TransientError:
            self.entities_in_retry.add(entity_id)
            schedule_retry(event, on_success=self._flush_buffer,
                          on_failure=self._send_entity_to_dlq)

    def _flush_buffer(self, entity_id):
        self.entities_in_retry.discard(entity_id)
        for buffered_event in self.retry_buffer.pop(entity_id, []):
            self.process(buffered_event)

    def _send_entity_to_dlq(self, entity_id):
        self.entities_in_retry.discard(entity_id)
        for buffered_event in self.retry_buffer.pop(entity_id, []):
            send_to_dlq(buffered_event, reason="Prior event for entity failed")

Downside: Complexity. Memory pressure if many entities are in retry simultaneously. You're reimplementing a significant chunk of what the broker does.

3. Retry Topics with Ordering Keys

Use a dedicated retry topic with the same partitioning key, so entity ordering is maintained within the retry flow:

Main topic (partition by orderId): events flow normally
  ↓ (on failure)
Retry topic (partition by orderId): failed events, with delay
  ↓ (after delay)
Main topic: retried events rejoin the main flow

This is the approach Uber uses in their event processing infrastructure (documented in their engineering blog), and it's well-suited for high-volume systems.

Transactional Outbox: Reliable Publishing

There's a failure mode that trips up every event-driven system eventually: the dual-write problem. Your service needs to update its database AND publish an event, atomically. If either succeeds without the other, the system is inconsistent.

# THE DANGEROUS WAY — dual write
def create_order(order):
    db.insert('orders', order)       # Step 1: write to DB
    producer.send('orders', event)   # Step 2: publish event
    # What if we crash between step 1 and step 2?
    # DB has the order, but no event was published.
    # Or: what if step 1 succeeds, step 2 fails?

The Outbox Pattern

Instead of publishing directly, write the event to an "outbox" table in the same database transaction as the business data. A separate process reads the outbox and publishes to the broker.

def create_order(order):
    with db.transaction() as tx:
        # Business data and event in the SAME transaction
        tx.insert('orders', order.to_dict())
        tx.insert('outbox', {
            'id': uuid4(),
            'aggregate_type': 'Order',
            'aggregate_id': order.id,
            'event_type': 'OrderCreated',
            'payload': json.dumps(order.to_event()),
            'created_at': datetime.utcnow(),
            'published': False
        })
    # Transaction commits atomically — both or neither.

The outbox publisher runs as a separate process:

class OutboxPublisher:
    def __init__(self, db, producer, poll_interval=1):
        self.db = db
        self.producer = producer
        self.poll_interval = poll_interval

    def run(self):
        while True:
            events = self.db.query(
                "SELECT * FROM outbox WHERE published = FALSE "
                "ORDER BY created_at LIMIT 100"
            )

            for event in events:
                try:
                    self.producer.send(
                        topic=f"{event['aggregate_type'].lower()}s",
                        key=event['aggregate_id'],
                        value=event['payload']
                    )
                    self.db.update('outbox',
                        {'published': True, 'published_at': datetime.utcnow()},
                        where={'id': event['id']})
                except Exception as e:
                    log.error(f"Failed to publish outbox event {event['id']}: {e}")
                    # Will retry on next poll

            time.sleep(self.poll_interval)

Change Data Capture as an Alternative

Instead of polling the outbox table, use change data capture (CDC) to stream changes from the outbox table to the broker. Debezium is the standard tool for this:

[Application] → writes to → [Database (outbox table)]
                                    ↓ CDC
                              [Debezium Connector]
                                    ↓
                              [Kafka Topic]

CDC eliminates the polling overhead and provides lower latency. It also means your application doesn't need to know about the broker at all — it just writes to the database, and the infrastructure handles the rest.

The tradeoff: CDC adds operational complexity (you're running Debezium, which needs its own monitoring and care). But for high-volume systems, it's worth it.

Error Handling in Sagas: Compensating Transactions

A saga is a sequence of local transactions across multiple services, where each step either succeeds or triggers compensating actions to undo the effects of previous steps. Error handling in sagas is where things get genuinely interesting — and by "interesting" I mean "complex enough to warrant its own whiteboard session."

The Choreography Approach

In a choreographed saga, each service listens for events and decides what to do next, including how to compensate:

1. OrderService: OrderCreated →
2. PaymentService: (hears OrderCreated) → PaymentCharged →
3. InventoryService: (hears PaymentCharged) → InventoryReserved →
4. ShippingService: (hears InventoryReserved) → ShipmentScheduled

# But what if step 3 fails?
3. InventoryService: (hears PaymentCharged) → InventoryReservationFailed →
2. PaymentService: (hears InventoryReservationFailed) → PaymentRefunded →
1. OrderService: (hears PaymentRefunded) → OrderCancelled

Every forward step has a corresponding compensating step. The compensating steps must be idempotent (because they might be triggered more than once) and must be tolerant of partial state (because the forward step might have partially completed).

The Orchestration Approach

An orchestrator service coordinates the saga and handles failures explicitly:

class OrderSaga:
    def __init__(self, order_id):
        self.order_id = order_id
        self.state = 'STARTED'
        self.completed_steps = []

    def execute(self):
        steps = [
            ('reserve_inventory', self.reserve_inventory, self.release_inventory),
            ('charge_payment', self.charge_payment, self.refund_payment),
            ('schedule_shipping', self.schedule_shipping, self.cancel_shipping),
        ]

        for step_name, forward, compensate in steps:
            try:
                forward()
                self.completed_steps.append((step_name, compensate))
                self.state = f'{step_name}_COMPLETED'
            except Exception as e:
                log.error(f"Saga step {step_name} failed: {e}")
                self.state = f'{step_name}_FAILED'
                self._compensate()
                return

        self.state = 'COMPLETED'

    def _compensate(self):
        # Compensate in reverse order
        for step_name, compensate in reversed(self.completed_steps):
            try:
                compensate()
            except Exception as e:
                # Compensation failure — this is the nightmare scenario.
                # Log it, alert a human, and pray.
                log.critical(
                    f"COMPENSATION FAILED for {step_name}: {e}. "
                    f"Manual intervention required for order {self.order_id}"
                )

When Compensation Fails

What happens when the compensating transaction itself fails? This is the question that keeps saga designers up at night.

Options:

Retry the compensation with exponential backoff. Most compensation failures are transient.
Log and alert. A human investigates and manually corrects the state.
Compensation journal. Write all pending compensations to a durable store. A background process retries them until they succeed.

There is no fully automatic solution. At some point, a human may need to reconcile state across services. Design your system so that identifying and fixing inconsistencies is possible, even if it's not pleasant.

Monitoring and Alerting on DLQs

A DLQ that nobody monitors is worse than no DLQ at all — it gives you the illusion of safety while events silently rot.

Essential DLQ Metrics

# Prometheus-style metrics you should be tracking
dlq_events_total:
  description: "Total events sent to DLQ"
  labels: [source_topic, consumer_group, error_type]
  alert: "Rate > 10/min for 5 minutes"

dlq_events_pending:
  description: "Events in DLQ not yet resolved"
  labels: [source_topic, consumer_group]
  alert: "Count > 100 for 30 minutes"

dlq_event_age_seconds:
  description: "Age of oldest unresolved DLQ event"
  labels: [source_topic, consumer_group]
  alert: "Age > 3600 (1 hour)"

dlq_resolution_rate:
  description: "Rate of events being resolved from DLQ"
  labels: [source_topic, resolution_type]  # replay, discard, manual_fix

The DLQ Dashboard

Your DLQ dashboard should answer these questions at a glance:

How many events are in purgatory right now? (Total, by source topic, by error type.)
Is the inflow rate increasing? (A spike means something broke. A gradual increase means something is degrading.)
How old is the oldest event? (A DLQ event that's been sitting there for a week is a DLQ event that nobody's looking at.)
What's the distribution of error types? (One dominant error type suggests a single root cause. Many error types suggest broader problems.)
Are events being resolved? (If the inflow exceeds the outflow, the DLQ is growing. That's a problem.)

Alert Fatigue Warning

Be judicious with alerts. A DLQ that receives one event per hour is normal wear and tear in a large system. A DLQ that receives one hundred events per minute is an incident. Set your thresholds based on your system's normal behavior, not on the theoretical ideal of zero DLQ events.

# Alert logic — alert on anomalies, not on absolute counts
class DLQAlertEvaluator:
    def evaluate(self, current_rate, baseline_rate):
        if current_rate > baseline_rate * 5:
            return Alert.CRITICAL, f"DLQ rate is 5x above baseline ({current_rate}/min vs {baseline_rate}/min)"
        elif current_rate > baseline_rate * 2:
            return Alert.WARNING, f"DLQ rate is elevated ({current_rate}/min vs {baseline_rate}/min)"
        else:
            return Alert.OK, None

Summary

Error handling in event-driven systems is not a feature you add at the end. It's a fundamental architectural concern that shapes your design from day one.

The essential lessons:

Accept at-least-once delivery. Build idempotent consumers. Stop chasing the exactly-once unicorn for anything that touches external systems.
Categorize errors early. Distinguish between transient failures (retry) and deterministic failures (DLQ immediately). Don't waste time retrying poison pills.
Use exponential backoff with jitter. Always. No exceptions. The thundering herd is real and it is not your friend.
Design your DLQs like first-class citizens. Rich error metadata, clear naming conventions, monitoring and alerting, resolution workflows.
Use the transactional outbox pattern for reliable publishing. The dual-write problem will bite you; it's a matter of when, not if.
Plan for saga compensation failures. Sometimes the undo fails too. Have a manual fallback.
Monitor your DLQs. An unmonitored DLQ is a lie you're telling yourself about system reliability.

The difference between a fragile event-driven system and a resilient one isn't the happy path — it's how thoroughly you've thought about everything that can go wrong. And in a distributed system, the things that can go wrong are limited only by your imagination and Murphy's law.

Observability and Debugging

Congratulations. You've built an event-driven system. Events flow between services, business logic executes asynchronously, and everything works beautifully — until it doesn't, and you spend four hours staring at six different log aggregators trying to figure out why a customer's order vanished into the void between the payment service and the fulfillment service.

Debugging event-driven systems is a fundamentally different discipline from debugging monolithic or synchronous systems. The tools are different, the mindset is different, and the difficulty level is — let's be diplomatic — elevated. This chapter covers the observability practices that will save you from despair, or at least reduce the despair to manageable levels.

Why Traditional Debugging Fails in Event-Driven Systems

In a monolith, you can set a breakpoint, step through code, and watch a request flow from entry point to database and back. The execution is linear, the state is local, and the call stack tells you everything you need to know.

In an event-driven system, none of this is true:

There is no call stack. An event is published by Service A, consumed by Service B (maybe minutes later), which publishes another event consumed by Service C. The "stack" spans processes, machines, and time. Your debugger can't step across a Kafka topic.
Execution is non-linear. A single incoming event might fan out to ten consumers. Each consumer might publish additional events. The execution graph is a DAG, not a stack.
Time is a variable. In a synchronous system, cause and effect are milliseconds apart. In an async system, they might be seconds, minutes, or hours apart. The event that caused a failure at 3 PM might have been published at 11 AM. Good luck finding that in your logs.
State is distributed. The full state of a business process is spread across multiple services' databases, multiple topic partitions, and multiple consumer group offsets. No single service has the complete picture.
Reproduction is hard. You can't just "replay the request" because the system state has changed since the original event was published. Other events have been processed, database rows have been modified, external services have been called. The window of reproduction closed before you even knew there was a bug.

Traditional logging — printing "processing order ord-123" in each service — gives you fragments. What you need is a way to stitch those fragments together into a coherent narrative. That's observability.

The Three Pillars: Logs, Metrics, Traces

The observability community has settled on three complementary signal types. You need all three. Skipping one is like removing a leg from a three-legged stool — technically possible to balance, but you'll fall eventually.

Logs: What Happened

Structured logs are the foundation. Not printf debugging, not unstructured text that you'll regex later — structured, machine-parseable log events with consistent fields.

{
  "timestamp": "2025-03-15T14:22:33.456Z",
  "level": "INFO",
  "service": "fulfillment-service",
  "correlationId": "corr-abc-123",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "eventType": "OrderCreated",
  "eventId": "evt-789",
  "orderId": "ord-456",
  "message": "Processing order for fulfillment",
  "durationMs": 23
}

Non-negotiable fields in every log line:

correlationId: ties together all logs for a single business operation.
traceId and spanId: ties into distributed tracing (more on this below).
service: which service emitted this log.
eventType and eventId: which event triggered this work.
timestamp: with timezone. Always UTC. Fight me.

Metrics: How It's Going

Metrics are aggregated numerical data over time. They answer "how many," "how fast," and "how broken."

Essential event-driven metrics:

# Producer metrics
events_published_total{topic, event_type}           # Counter
event_publish_duration_seconds{topic}                # Histogram
event_publish_errors_total{topic, error_type}        # Counter

# Consumer metrics
events_consumed_total{topic, consumer_group}         # Counter
event_processing_duration_seconds{topic, event_type} # Histogram
event_processing_errors_total{topic, error_type}     # Counter
consumer_lag{topic, partition, consumer_group}        # Gauge

# Broker metrics
topic_message_count{topic}                           # Gauge
partition_offset_latest{topic, partition}             # Gauge

# DLQ metrics
dlq_events_total{source_topic, error_type}           # Counter
dlq_events_pending{source_topic}                     # Gauge

The two most important metrics in any event-driven system: event processing duration (is anything getting slow?) and consumer lag (is anything falling behind?). If you monitor nothing else, monitor these.

Traces: The Journey

A distributed trace represents the end-to-end journey of a request through multiple services. In an event-driven system, it represents the journey of a business operation through multiple event-processing steps.

A trace is composed of spans, each representing a unit of work:

Trace: 4bf92f3577b34da6a3ce929d0e0e4736

[Span 1: api-gateway] POST /orders (12ms)
    └── [Span 2: order-service] CreateOrder (8ms)
         └── [Span 3: order-service] PublishOrderCreated (3ms)
              └── [Span 4: payment-service] ProcessPayment (45ms)
                   ├── [Span 5: payment-service] ChargeCard (40ms)
                   └── [Span 6: payment-service] PublishPaymentProcessed (2ms)
                        └── [Span 7: fulfillment-service] CreateShipment (15ms)
                             └── [Span 8: fulfillment-service] PublishShipmentCreated (2ms)

The challenge in event-driven systems is that spans 3 and 4 are separated by a Kafka topic. The payment service doesn't receive an HTTP call from the order service — it receives an event from a topic. The trace context must be propagated through the event for the trace to remain connected.

Correlation IDs: Threading Context Through Async Flows

A correlation ID is a unique identifier generated at the beginning of a business operation and carried through every subsequent event and service call. It's the thread that lets you pull on one end and unravel the entire operation.

Generating Correlation IDs

The correlation ID is typically generated at the system boundary — the API gateway, the initial event producer, or whatever first receives the business request:

import uuid

class APIGateway:
    def handle_request(self, request):
        # Generate or extract correlation ID
        correlation_id = request.headers.get(
            'X-Correlation-ID',
            str(uuid.uuid4())
        )

        # Pass to downstream service
        response = order_service.create_order(
            order_data=request.body,
            correlation_id=correlation_id
        )

        return response

Propagating Through Events

The correlation ID must travel with the event. Put it in the event metadata, not the payload — it's infrastructure context, not business data.

class OrderService:
    def create_order(self, order_data, correlation_id):
        order = Order.create(order_data)

        event = {
            'metadata': {
                'eventId': str(uuid.uuid4()),
                'eventType': 'OrderCreated',
                'correlationId': correlation_id,
                'causationId': None,  # This is the root event
                'timestamp': datetime.utcnow().isoformat(),
                'source': 'order-service'
            },
            'payload': order.to_dict()
        }

        producer.send('orders', key=order.id, value=event)
        return order

The Causation Chain

Beyond correlation IDs, maintain causation IDs to track which event caused which. The causation ID of an event is the event ID of the event that triggered its creation.

class PaymentService:
    def handle_order_created(self, event):
        correlation_id = event['metadata']['correlationId']
        causing_event_id = event['metadata']['eventId']

        # Process payment...
        payment = process_payment(event['payload'])

        # Publish with causation chain
        payment_event = {
            'metadata': {
                'eventId': str(uuid.uuid4()),
                'eventType': 'PaymentProcessed',
                'correlationId': correlation_id,       # Same correlation ID
                'causationId': causing_event_id,       # Points to OrderCreated
                'timestamp': datetime.utcnow().isoformat(),
                'source': 'payment-service'
            },
            'payload': payment.to_dict()
        }

        producer.send('payments', key=payment.order_id, value=payment_event)

With this chain, you can reconstruct the complete event lineage for any business operation:

OrderCreated (evt-001, causation: null, correlation: corr-abc)
  └── PaymentProcessed (evt-002, causation: evt-001, correlation: corr-abc)
       └── InventoryReserved (evt-003, causation: evt-002, correlation: corr-abc)
            └── ShipmentCreated (evt-004, causation: evt-003, correlation: corr-abc)

Kafka Header Propagation

In Kafka, use message headers for metadata propagation rather than embedding it in the payload:

from confluent_kafka import Producer

def publish_event(producer, topic, key, payload, correlation_id, causation_id, trace_context):
    headers = [
        ('correlation-id', correlation_id.encode('utf-8')),
        ('causation-id', causation_id.encode('utf-8') if causation_id else b''),
        ('event-type', payload['eventType'].encode('utf-8')),
        ('trace-parent', trace_context.encode('utf-8')),
        ('source-service', 'order-service'.encode('utf-8')),
    ]

    producer.produce(
        topic=topic,
        key=key.encode('utf-8'),
        value=json.dumps(payload).encode('utf-8'),
        headers=headers
    )
    producer.flush()

Headers have several advantages over payload-embedded metadata: they're accessible without deserializing the event, they can be read by infrastructure tooling (monitoring, routing) that doesn't understand the payload schema, and they keep infrastructure concerns separate from business data.

Distributed Tracing with OpenTelemetry

OpenTelemetry (OTel) is the industry-standard framework for distributed tracing (and metrics, and logs, but tracing is where it shines in event-driven systems). The key challenge is propagating trace context through message brokers, which weren't designed with tracing in mind.

The W3C Trace Context Standard

The traceparent header carries trace context in a standardized format:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              |  |                                  |                |
              |  trace-id (128 bit)                 span-id (64 bit)|
              version                                              flags (sampled)

Producer-Side: Injecting Trace Context

from opentelemetry import trace
from opentelemetry.context import attach, detach
from opentelemetry.trace.propagation import get_current_span
from opentelemetry.propagators import inject

tracer = trace.get_tracer("order-service")

def publish_order_created(order):
    with tracer.start_as_current_span("publish_order_created") as span:
        span.set_attribute("messaging.system", "kafka")
        span.set_attribute("messaging.destination", "orders")
        span.set_attribute("messaging.destination_kind", "topic")
        span.set_attribute("order.id", order.id)

        # Inject trace context into carrier (event headers)
        headers = {}
        inject(headers)  # Injects traceparent and tracestate

        # Convert to Kafka header format
        kafka_headers = [
            (k, v.encode('utf-8')) for k, v in headers.items()
        ]

        producer.produce(
            topic='orders',
            key=order.id,
            value=serialize(order.to_event()),
            headers=kafka_headers
        )

Consumer-Side: Extracting and Continuing the Trace

from opentelemetry.propagators import extract

tracer = trace.get_tracer("payment-service")

def handle_message(message):
    # Extract trace context from Kafka headers
    carrier = {
        h[0]: h[1].decode('utf-8')
        for h in message.headers() or []
    }
    ctx = extract(carrier)

    # Create a new span linked to the producer's span
    with tracer.start_as_current_span(
        "process_order_created",
        context=ctx,
        kind=trace.SpanKind.CONSUMER,
        attributes={
            "messaging.system": "kafka",
            "messaging.source": message.topic(),
            "messaging.message_id": message.key().decode('utf-8'),
            "messaging.kafka.partition": message.partition(),
            "messaging.kafka.offset": message.offset(),
        }
    ) as span:
        try:
            process_payment(message.value())
            span.set_status(trace.StatusCode.OK)
        except Exception as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

The Produce-Consume Link

The connection between the producer span and the consumer span is what makes distributed tracing valuable in event-driven systems. Without it, you have two disconnected traces. With it, you have a continuous narrative from "user clicked 'Place Order'" to "warehouse picked the item."

OpenTelemetry supports two linking strategies:

Parent-child: The consumer span is a child of the producer span. This creates a single trace that includes both the producing and consuming work. Simple and intuitive, but can create very wide traces if one event triggers many consumers.
Links: The consumer span is a new trace root with a link to the producer span. This keeps traces manageable in fan-out scenarios but requires tooling that can follow links across traces.

# Link-based approach for fan-out scenarios
from opentelemetry.trace import Link

def handle_message_with_link(message):
    carrier = extract_headers(message)
    producer_context = extract(carrier)
    producer_span_context = trace.get_current_span(producer_context).get_span_context()

    with tracer.start_as_current_span(
        "process_event",
        links=[Link(producer_span_context)],
        kind=trace.SpanKind.CONSUMER
    ) as span:
        process(message)

Event Lineage and Causation Chains

Distributed tracing gives you the how — which services participated and how long they took. Event lineage gives you the what — which events caused which other events, forming a directed graph of causation.

Building an Event Lineage Store

class EventLineageStore:
    """Stores and queries event causation relationships."""

    def __init__(self, db):
        self.db = db

    def record_event(self, event_id, event_type, correlation_id,
                     causation_id, source_service, timestamp):
        self.db.insert('event_lineage', {
            'event_id': event_id,
            'event_type': event_type,
            'correlation_id': correlation_id,
            'causation_id': causation_id,
            'source_service': source_service,
            'timestamp': timestamp
        })

    def get_full_chain(self, correlation_id):
        """Get all events in a business operation, ordered by time."""
        return self.db.query(
            "SELECT * FROM event_lineage "
            "WHERE correlation_id = %s "
            "ORDER BY timestamp",
            [correlation_id]
        )

    def get_descendants(self, event_id):
        """Get all events caused (directly or transitively) by an event."""
        return self.db.query("""
            WITH RECURSIVE chain AS (
                SELECT * FROM event_lineage WHERE event_id = %s
                UNION ALL
                SELECT el.* FROM event_lineage el
                JOIN chain c ON el.causation_id = c.event_id
            )
            SELECT * FROM chain ORDER BY timestamp
        """, [event_id])

    def get_root_cause(self, event_id):
        """Walk up the causation chain to find the originating event."""
        return self.db.query("""
            WITH RECURSIVE chain AS (
                SELECT * FROM event_lineage WHERE event_id = %s
                UNION ALL
                SELECT el.* FROM event_lineage el
                JOIN chain c ON el.event_id = c.causation_id
            )
            SELECT * FROM chain WHERE causation_id IS NULL
        """, [event_id])

Visualizing Event Flow

A lineage store lets you answer questions that are otherwise nearly impossible:

"Show me every event that resulted from this customer's order" (get descendants of the root event).
"This payment failed — what started the process?" (walk up the causation chain).
"These two services are both updating this entity — is there a shared upstream event?" (find common ancestor).

Query: get_full_chain('corr-abc-123')

Timeline:
14:22:33.100  [order-service]       OrderCreated (evt-001)
14:22:33.250  [payment-service]     PaymentProcessed (evt-002, caused by evt-001)
14:22:33.400  [inventory-service]   InventoryReserved (evt-003, caused by evt-002)
14:22:33.500  [notification-service] CustomerNotified (evt-004, caused by evt-002)
14:22:34.100  [fulfillment-service]  ShipmentCreated (evt-005, caused by evt-003)

Consumer Lag Monitoring: The Canary in the Coal Mine

Consumer lag is the difference between the latest offset in a topic partition and the offset that a consumer group has committed. It tells you how far behind a consumer is — how many unprocessed events are waiting.

Why Lag Matters

A small, stable lag is normal. Events arrive faster than you process them during bursts, and you catch up during lulls. This is fine.

A growing lag is a problem. It means your consumer is falling further behind, which means:

Events are being processed with increasing delay (SLA violations).
If the lag grows past the topic's retention period, events will be deleted before they're processed (data loss).
The consumer is likely degraded or stuck (processing too slowly, or blocked by a poison pill).

Monitoring Lag

# Using confluent_kafka's AdminClient to check consumer lag
from confluent_kafka.admin import AdminClient

def check_consumer_lag(bootstrap_servers, consumer_group, topic):
    admin = AdminClient({'bootstrap.servers': bootstrap_servers})

    # Get the latest offsets for each partition
    topic_metadata = admin.list_topics(topic)
    partitions = topic_metadata.topics[topic].partitions

    total_lag = 0
    for partition_id in partitions:
        tp = TopicPartition(topic, partition_id)

        # Get the committed offset for this consumer group
        committed = admin.list_consumer_group_offsets(consumer_group, [tp])
        committed_offset = committed[tp].offset

        # Get the latest offset (high watermark)
        latest_offset = consumer.get_watermark_offsets(tp)[1]

        lag = latest_offset - committed_offset
        total_lag += lag

        if lag > 10000:
            alert(f"High lag on {topic}/{partition_id}: "
                  f"{lag} events behind (consumer: {consumer_group})")

    return total_lag

Lag-Based Alerts

Set up tiered alerts:

# Alert rules for consumer lag
alerts:
  - name: consumer_lag_warning
    condition: "consumer_lag > 1000 for 5 minutes"
    severity: warning
    message: "Consumer group {consumer_group} is {lag} events behind on {topic}"

  - name: consumer_lag_critical
    condition: "consumer_lag > 10000 for 10 minutes"
    severity: critical
    message: "Consumer group {consumer_group} is severely behind ({lag} events). Check for stuck consumers or poison pills."

  - name: consumer_lag_approaching_retention
    condition: "consumer_lag_time_seconds > (retention_seconds * 0.8)"
    severity: critical
    message: "Consumer group {consumer_group} lag is approaching topic retention. DATA LOSS IMMINENT."

The last alert is the one that should wake people up at night. When lag (measured in time, not just events) approaches the topic's retention period, events will start being deleted before they're consumed. This is irrecoverable data loss.

Lag as a Health Signal

Consumer lag is the single most useful health signal for an event-driven system. Trend it, dashboard it, alert on it. A system where lag is stable and low is healthy. A system where lag is growing is sick, even if nothing else looks wrong yet.

Dead Letter Queue Dashboards

We covered DLQs in Chapter 5. Here's the observability angle.

What Your DLQ Dashboard Needs

┌─────────────────────────────────────────────────────┐
│ Dead Letter Queue Overview                          │
├─────────────────────────────────────────────────────┤
│                                                     │
│ Total Pending Events: 47          Oldest: 2h 15m    │
│                                                     │
│ Inflow (last hour): 12/hr        Outflow: 8/hr     │
│                                                     │
│ By Source Topic:                                    │
│   orders.order-created         │████████░░│ 23      │
│   payments.payment-processed   │███░░░░░░░│ 11      │
│   inventory.stock-updated      │██░░░░░░░░│ 8       │
│   shipping.label-created       │█░░░░░░░░░│ 5       │
│                                                     │
│ By Error Type:                                      │
│   TimeoutError                 │██████░░░░│ 19      │
│   ValidationError              │████░░░░░░│ 14      │
│   SerializationError           │███░░░░░░░│ 9       │
│   Unknown                      │█░░░░░░░░░│ 5       │
│                                                     │
│ Trend (24h):                                        │
│   ▃▃▂▂▁▁▁▂▃▅▇█▇▅▃▃▂▂▁▁▁▁▁                         │
│   ^                 ^                               │
│   midnight          noon                            │
│                                                     │
└─────────────────────────────────────────────────────┘

DLQ Event Explorer

Beyond aggregate metrics, you need the ability to drill into individual events:

class DLQExplorer:
    """REST API for investigating DLQ events."""

    def get_events(self, source_topic=None, error_type=None,
                   since=None, limit=50):
        query = "SELECT * FROM dlq_events WHERE 1=1"
        params = []

        if source_topic:
            query += " AND source_topic = %s"
            params.append(source_topic)
        if error_type:
            query += " AND error_type = %s"
            params.append(error_type)
        if since:
            query += " AND failed_at > %s"
            params.append(since)

        query += " ORDER BY failed_at DESC LIMIT %s"
        params.append(limit)

        return self.db.query(query, params)

    def get_event_detail(self, dlq_event_id):
        event = self.db.get('dlq_events', dlq_event_id)

        # Enrich with lineage data
        event['lineage'] = self.lineage_store.get_full_chain(
            event['correlation_id']
        )

        # Include related successful events for context
        event['related_events'] = self.get_related_events(
            event['correlation_id']
        )

        return event

    def replay_event(self, dlq_event_id):
        event = self.db.get('dlq_events', dlq_event_id)
        original_topic = event['source_topic']
        original_event = event['original_event']

        self.producer.send(original_topic, original_event)
        self.db.update('dlq_events', dlq_event_id,
                       {'status': 'replayed', 'replayed_at': datetime.utcnow()})

Event Replay and Time-Travel Debugging

One of the genuine advantages of event-driven architecture (specifically, event sourcing or log-based systems like Kafka) is the ability to replay events. This is your time machine.

Replaying for Debugging

When a consumer produces incorrect results, you can:

Reset the consumer group offset to a point before the bug.
Fix the bug and deploy the new consumer.
Replay all events from the reset point.

# Reset consumer group to a specific timestamp
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group fulfillment-service \
  --topic orders \
  --reset-offsets --to-datetime 2025-03-15T10:00:00.000 \
  --execute

Warnings:

This replays all events from that timestamp, not just the one you're debugging. If your consumer has side effects (sends emails, charges credit cards), you need to disable those during replay or use a separate consumer group.
Replay with side effects is how you email a customer 47 times. Ask me how I know.

The Replay Consumer Pattern

A safer approach is a dedicated replay consumer that processes events without side effects, just to reconstruct state:

class ReplayConsumer:
    """Replays events to reconstruct state at a point in time."""

    def __init__(self, topic, target_timestamp):
        self.consumer = KafkaConsumer(
            topic,
            group_id=f'replay-{uuid4()}',  # Unique group — won't affect production
            auto_offset_reset='earliest',
            enable_auto_commit=False
        )
        self.target_timestamp = target_timestamp
        self.state = {}

    def replay(self, entity_id=None):
        for message in self.consumer:
            if message.timestamp > self.target_timestamp:
                break

            event = deserialize(message.value)

            if entity_id and event.get('entityId') != entity_id:
                continue

            self._apply(event)

        return self.state

    def _apply(self, event):
        entity_id = event['entityId']
        if entity_id not in self.state:
            self.state[entity_id] = {}

        # Apply event to state (event sourcing projection)
        event_type = event['metadata']['eventType']
        handler = getattr(self, f'_handle_{event_type}', None)
        if handler:
            handler(event, self.state[entity_id])

    def _handle_OrderCreated(self, event, state):
        state.update(event['payload'])
        state['status'] = 'created'

    def _handle_OrderShipped(self, event, state):
        state['status'] = 'shipped'
        state['shippingInfo'] = event['payload']['shippingInfo']

Time-Travel Queries

If you're using event sourcing, you can reconstruct the state of any entity at any point in time:

def get_order_state_at(order_id, target_time):
    """What did this order look like at the given timestamp?"""
    events = event_store.get_events(
        aggregate_id=order_id,
        up_to=target_time
    )

    state = {}
    for event in events:
        apply_event(state, event)

    return state

# "What was the order state when the payment was processed?"
payment_event = event_store.get_event('evt-002')
order_state = get_order_state_at('ord-456', payment_event['timestamp'])

This is extremely powerful for debugging: "The fulfillment service says the order had no shipping address, but the customer definitely entered one. Let's see what the order looked like at the exact moment the fulfillment event was published."

Debugging Patterns: The "Event Detective" Workflow

When something goes wrong in an event-driven system, here's a systematic approach:

Step 1: Start with the Symptom

"Customer says their order confirmation email never arrived."

Step 2: Find the Correlation ID

Look up the order in your system, find the correlation ID for the business operation:

SELECT correlation_id FROM orders WHERE order_id = 'ord-789';
-- Result: corr-def-456

Step 3: Pull the Event Chain

Query your event lineage store for all events in this correlation:

SELECT event_id, event_type, source_service, timestamp, causation_id
FROM event_lineage
WHERE correlation_id = 'corr-def-456'
ORDER BY timestamp;

evt-101  OrderCreated          order-service        14:22:33.100  NULL
evt-102  PaymentProcessed      payment-service      14:22:33.250  evt-101
evt-103  InventoryReserved     inventory-service    14:22:33.400  evt-102
evt-104  ShipmentCreated       fulfillment-service  14:22:34.100  evt-103

Notice anything? There's no CustomerNotified event. The notification service never fired.

Step 4: Check the Consumer

Is the notification service consuming from the right topic? Is it healthy?

# Check consumer lag for notification-service
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group notification-service --describe

Output shows the notification service is caught up on the payments topic. So it received PaymentProcessed but didn't emit CustomerNotified. Why?

Step 5: Check the Logs

# Search logs for this correlation ID in the notification service
correlationId:corr-def-456 AND service:notification-service

{
  "timestamp": "2025-03-15T14:22:33.260Z",
  "level": "ERROR",
  "service": "notification-service",
  "correlationId": "corr-def-456",
  "eventId": "evt-102",
  "message": "Failed to render email template: missing field 'customerEmail'",
  "error": "KeyError: 'customerEmail'"
}

Found it. The PaymentProcessed event didn't include the customer's email, and the notification service failed. But where did the failure go?

Step 6: Check the DLQ

SELECT * FROM dlq_events
WHERE correlation_id = 'corr-def-456'
AND source_service = 'notification-service';

If the event is in the DLQ, you know the failure was caught. If it's not, the failure was swallowed silently — a logging-only error handler with no DLQ routing. That's its own bug.

Step 7: Fix and Replay

Fix the notification service to handle the missing field (or fix the payment service to include it), then replay the event from the DLQ.

This entire workflow — symptom, correlation, chain, consumer, logs, DLQ, fix — should take minutes, not hours. It takes minutes only if you've invested in the tooling described in this chapter.

Common Observability Anti-Patterns

1. The Log Volcano

Logging everything at DEBUG level in production because "we might need it." You're generating terabytes of logs, your log aggregator is expensive and slow, and the signal-to-noise ratio is atrocious. Nobody can find anything.

Fix: Log at INFO for normal operations, WARN for recoverable anomalies, ERROR for failures. Use structured logging with consistent fields. Enable DEBUG temporarily and selectively during active investigations.

2. The Metric Desert

No metrics, or only infrastructure metrics (CPU, memory, disk). You know the machine is healthy but have no idea if the application is working correctly.

Fix: Instrument your business logic. Every event processed, every external call, every state transition should have metrics. Focus on the RED method: Rate, Errors, Duration.

3. Orphan Traces

Traces that stop at the broker boundary because nobody propagated the trace context through events. You have perfect visibility within each service and zero visibility across services — which is exactly where you need it most.

Fix: Make trace context propagation a standard part of your event production and consumption libraries. Don't rely on individual developers to remember.

4. The Missing Correlation ID

Some events have correlation IDs, some don't. Some services propagate them, some generate new ones. The result is fragmented event chains that you can't stitch together.

Fix: Correlation ID propagation must be enforced at the framework level, not the application level. Use middleware or interceptors that automatically extract and inject correlation IDs. Reject events that don't have one.

5. Alert Noise

Alerting on every DLQ event, every momentary lag spike, every transient error. Your on-call engineers learn to ignore alerts, which means they ignore the real ones too.

Fix: Alert on sustained anomalies, not individual events. Use anomaly detection rather than static thresholds where possible. Have distinct channels for warnings (investigate when convenient) and critical alerts (investigate now).

6. The Post-Mortem Information Gap

Something fails on Friday, you investigate on Monday, and the relevant logs have been rotated, the metrics resolution is too coarse to see the spike, and the trace was sampled away.

Fix: Retain high-resolution data for at least 7 days. Keep aggregated data for months. Never sample traces for error conditions — always capture traces for failed operations at 100%.

7. Observing Only the Happy Path

All your dashboards show throughput and latency for successful events. You have no visibility into failures, retries, or DLQ activity. The system looks green while 2% of events are silently failing.

Fix: Failure metrics should be as prominent as success metrics. A dashboard that only shows success rate is a lie of omission.

Tools of the Trade

A brief, opinionated survey of the observability ecosystem for event-driven systems.

Distributed Tracing

Jaeger: Open-source, CNCF graduated project. Good for Kubernetes-native deployments. Supports OpenTelemetry natively. The UI is functional if not beautiful. Scales well with Elasticsearch or Cassandra as the storage backend.

Zipkin: The original open-source distributed tracing tool. Simpler than Jaeger, with a cleaner UI. Good for smaller deployments. Less active development than Jaeger.

Grafana Tempo: A cost-effective trace storage backend that integrates with Grafana. Uses object storage (S3, GCS) instead of dedicated databases, which dramatically reduces cost at scale. The tradeoff is query latency — searching for traces is slower than Jaeger, but looking up a trace by ID is fast.

Datadog APM: SaaS, fully managed, excellent UI, deep integration with metrics and logs. Expensive. The trace-to-log correlation is genuinely good. If you're already paying for Datadog, the APM is worth enabling.

Metrics

Prometheus + Grafana: The open-source standard. Prometheus scrapes metrics, Grafana visualizes them. Grafana's dashboarding is best-in-class. Prometheus's pull-based model works well for Kubernetes but can be awkward for short-lived processes.

Datadog: SaaS metrics with excellent tagging and correlation. The DogStatsD agent is easy to integrate. The cost scales with the number of custom metrics, which can get expensive.

Logging

ELK Stack (Elasticsearch, Logstash, Kibana): The classic. Powerful but operationally heavy. Elasticsearch clusters require care and feeding.

Grafana Loki: Log aggregation that indexes labels, not content. Much cheaper than Elasticsearch for high-volume logs. Query language (LogQL) is good. Full-text search is slower than Elasticsearch.

Datadog Logs: SaaS, integrates with traces and metrics. The log-to-trace correlation feature is excellent for event-driven debugging. The pricing is per-gigabyte ingested, which concentrates the mind wonderfully on what you actually need to log.

Kafka-Specific Observability

Kafka Manager / CMAK (Cluster Manager for Apache Kafka): Basic cluster management and consumer lag monitoring. Aging but functional.

Kafka UI / Conduktor / AKHQ: Modern UIs for exploring topics, consumer groups, and messages. Essential for development and debugging.

Burrow: LinkedIn's consumer lag monitoring tool. Evaluates lag as a sliding window rather than a point-in-time snapshot, which produces more meaningful alerts. Highly recommended if you're running Kafka.

Putting It All Together

Here's a complete example of an instrumented event consumer that incorporates correlation IDs, trace propagation, structured logging, metrics, and error handling:

import json
import time
import uuid
import logging
from opentelemetry import trace, metrics
from opentelemetry.propagators import extract
from prometheus_client import Counter, Histogram

# Metrics
events_processed = Counter(
    'events_processed_total',
    'Total events processed',
    ['topic', 'event_type', 'status']
)
processing_duration = Histogram(
    'event_processing_duration_seconds',
    'Event processing duration',
    ['topic', 'event_type']
)

tracer = trace.get_tracer("fulfillment-service")
logger = logging.getLogger("fulfillment-service")

class InstrumentedConsumer:
    def __init__(self, consumer, processor, dlq_producer):
        self.consumer = consumer
        self.processor = processor
        self.dlq_producer = dlq_producer

    def run(self):
        for message in self.consumer:
            self._handle_message(message)

    def _handle_message(self, message):
        # Extract headers
        headers = {h[0]: h[1].decode('utf-8') for h in (message.headers() or [])}
        correlation_id = headers.get('correlation-id', str(uuid.uuid4()))
        event_type = headers.get('event-type', 'unknown')

        # Extract trace context
        ctx = extract(headers)

        # Start a span linked to the producer
        with tracer.start_as_current_span(
            f"process_{event_type}",
            context=ctx,
            kind=trace.SpanKind.CONSUMER,
            attributes={
                "messaging.system": "kafka",
                "messaging.source": message.topic(),
                "messaging.kafka.partition": message.partition(),
                "messaging.kafka.offset": message.offset(),
                "correlation.id": correlation_id,
                "event.type": event_type,
            }
        ) as span:
            start_time = time.time()

            # Structured log: event received
            logger.info("Event received", extra={
                'correlationId': correlation_id,
                'eventType': event_type,
                'topic': message.topic(),
                'partition': message.partition(),
                'offset': message.offset(),
            })

            try:
                event = json.loads(message.value())
                self.processor.handle(event, correlation_id)

                # Success metrics and logging
                duration = time.time() - start_time
                events_processed.labels(
                    topic=message.topic(),
                    event_type=event_type,
                    status='success'
                ).inc()
                processing_duration.labels(
                    topic=message.topic(),
                    event_type=event_type
                ).observe(duration)

                span.set_status(trace.StatusCode.OK)
                logger.info("Event processed successfully", extra={
                    'correlationId': correlation_id,
                    'eventType': event_type,
                    'durationMs': int(duration * 1000),
                })

            except Exception as e:
                # Error metrics and logging
                duration = time.time() - start_time
                events_processed.labels(
                    topic=message.topic(),
                    event_type=event_type,
                    status='error'
                ).inc()

                span.set_status(trace.StatusCode.ERROR, str(e))
                span.record_exception(e)

                logger.error("Event processing failed", extra={
                    'correlationId': correlation_id,
                    'eventType': event_type,
                    'error': str(e),
                    'errorType': type(e).__name__,
                    'durationMs': int(duration * 1000),
                })

                # Send to DLQ with full context
                self._send_to_dlq(message, e, correlation_id)

    def _send_to_dlq(self, message, error, correlation_id):
        import traceback

        dlq_event = {
            'originalEvent': message.value().decode('utf-8'),
            'error': {
                'message': str(error),
                'type': type(error).__name__,
                'stackTrace': traceback.format_exc(),
            },
            'context': {
                'correlationId': correlation_id,
                'topic': message.topic(),
                'partition': message.partition(),
                'offset': message.offset(),
                'failedAt': time.time(),
            }
        }

        self.dlq_producer.send(
            f"{message.topic()}.dlq",
            json.dumps(dlq_event).encode('utf-8')
        )

This is not a small amount of code. That's the honest truth about observability in event-driven systems: it's substantial, it's pervasive, and it needs to be baked into your event processing framework, not sprinkled on top by individual developers.

Summary

Observability in event-driven systems is not a nice-to-have; it's a prerequisite for operating with confidence. The async, distributed nature of these systems means that without deliberate investment in observability, you are flying blind.

The essentials:

Correlation IDs are mandatory. Every event, every log line, every trace. No exceptions. Enforce this at the framework level.
Propagate trace context through events. Use OpenTelemetry and the W3C Trace Context standard. Instrument your producer and consumer libraries once, and every service benefits.
Build event lineage tracking. Causation chains let you reconstruct the complete story of a business operation. This is your primary debugging tool.
Monitor consumer lag religiously. It's the single best health indicator for an event-driven system. Alert on sustained growth, not momentary spikes.
Instrument your DLQs. Dashboard them, alert on them, build tooling to investigate and replay failed events.
Invest in replay capability. The ability to time-travel through your event history is one of the few genuine advantages event-driven systems have over request-response architectures. Use it.
Standardize your observability stack across all services. Fragmented tooling means fragmented visibility, which means fragmented debugging.

The investment is significant. The alternative — debugging a production incident by grepping through six services' logs, correlating timestamps by hand, and guessing at causation — is significantly worse. Build the tooling, maintain the discipline, and your future on-call self will be grateful.

Security and Access Control

You built a beautiful event-driven system. Microservices hum along, events flow like water, and your architecture diagram looks like something out of a conference talk. Then someone points out that every service can read every topic, your PII is sitting in plaintext in an append-only log that you promised was immutable, and your GDPR compliance officer has started drinking at lunch.

Welcome to security in event-driven systems — where the attack surface is larger, the blast radius is wider, and the consequences of getting it wrong are distributed across every consumer that ever touched the data.

The Expanded Attack Surface

In a traditional request-response system, your security perimeter is relatively well-defined. You have API gateways, authentication middleware, and a clear sense of who is talking to whom. In an event-driven system, you have... more.

Consider what you're actually defending:

The broker itself — a centralized nervous system that, if compromised, gives an attacker access to every conversation in your organization.
Producers — any service that can write to a topic can inject poisoned events that every downstream consumer will dutifully process.
Consumers — any service that can read from a topic gets access to every event ever published there, potentially including historical data going back years.
The network between all of the above — events in transit are just bytes on a wire, and bytes on a wire can be read.
The storage layer — events at rest on broker disks, in consumer state stores, in dead letter queues, in replay buffers. Your data has more copies than a bestselling novel.
The schema registry — whoever controls the schema controls what producers can say and what consumers expect. Schema poisoning is a real attack vector.

The fundamental problem is this: event-driven architectures trade direct service-to-service communication for indirect communication through a shared medium. That shared medium becomes a high-value target. It's the difference between intercepting a phone call between two people and tapping the entire telephone exchange.

Threat Modeling for EDA

If you're not doing threat modeling for your event-driven systems, you're not doing security — you're doing hope. The STRIDE framework adapts well:

Threat	EDA Manifestation
Spoofing	A rogue producer impersonates a legitimate service and publishes fraudulent events
Tampering	Events are modified in transit or an attacker alters committed events on disk
Repudiation	A producer denies publishing an event; no audit trail to prove otherwise
Information Disclosure	A consumer reads topics it shouldn't; PII leaks through overly broad subscriptions
Denial of Service	A producer floods a topic, overwhelming consumers; a consumer creates excessive lag
Elevation of Privilege	A service with read-only access gains write access to a topic; a consumer modifies broker configuration

Authentication and Authorization for Producers and Consumers

Authentication: Proving You Are Who You Claim

Every client connecting to your broker — producer or consumer — needs to prove its identity. The days of "it's on the internal network, so it's fine" ended roughly around the time that the concept of "internal network" became a polite fiction.

SASL (Simple Authentication and Security Layer) is the most common framework for broker authentication. The name contains the word "Simple," which should immediately make you suspicious. It supports multiple mechanisms:

SASL/PLAIN — username and password sent in cleartext. Only acceptable over TLS. If you're using this without TLS, you don't have authentication; you have a suggestion.
SASL/SCRAM (Salted Challenge Response Authentication Mechanism) — challenge-response protocol that avoids sending the password over the wire. SHA-256 or SHA-512. A meaningful improvement over PLAIN.
SASL/GSSAPI (Kerberos) — enterprise-grade authentication via Kerberos tickets. If your organization already runs Active Directory, this integrates naturally. If it doesn't, setting up Kerberos just for your message broker is a special kind of masochism.
SASL/OAUTHBEARER — OAuth 2.0 bearer tokens. The modern choice for organizations that have already invested in an identity provider. Tokens are short-lived and can carry fine-grained claims.

mTLS (Mutual TLS) — both client and server present certificates. This is the gold standard for service-to-service authentication in event-driven systems. Each service gets its own certificate, and the broker verifies it before allowing any operations.

# Kafka broker configuration for mTLS
listeners=SSL://broker1:9093
ssl.keystore.location=/var/kafka/ssl/kafka.server.keystore.jks
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}
ssl.truststore.location=/var/kafka/ssl/kafka.server.truststore.jks
ssl.truststore.password=${TRUSTSTORE_PASSWORD}
ssl.client.auth=required
ssl.endpoint.identification.algorithm=https

# Map the Distinguished Name from client certs to a Kafka principal
ssl.principal.mapping.rules=RULE:^CN=([a-zA-Z0-9._-]+),.*$/$$1/,DEFAULT

# Kafka producer configuration for mTLS
bootstrap.servers=broker1:9093
security.protocol=SSL
ssl.keystore.location=/var/app/ssl/producer.keystore.jks
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}
ssl.truststore.location=/var/app/ssl/ca.truststore.jks
ssl.truststore.password=${TRUSTSTORE_PASSWORD}
ssl.endpoint.identification.algorithm=https

# Python producer with mTLS
from confluent_kafka import Producer

producer = Producer({
    'bootstrap.servers': 'broker1:9093',
    'security.protocol': 'SSL',
    'ssl.ca.location': '/var/app/ssl/ca-cert.pem',
    'ssl.certificate.location': '/var/app/ssl/client-cert.pem',
    'ssl.key.location': '/var/app/ssl/client-key.pem',
    'ssl.key.password': os.environ['SSL_KEY_PASSWORD'],
    'ssl.endpoint.identification.algorithm': 'https',
})

Certificate Management Reality Check: mTLS is excellent security and terrible operations. You now need to provision, distribute, rotate, and revoke certificates for every service. You need a PKI (Public Key Infrastructure), or at minimum a tool like HashiCorp Vault, cert-manager (in Kubernetes), or a service mesh that handles this for you. If your plan is "we'll manage the certs manually," your plan is to have an outage on the day a cert expires and nobody noticed.

Authorization: Proving You're Allowed to Do What You're Trying to Do

Authentication tells you who. Authorization tells you what they can do. These are frequently conflated by people who should know better.

Topic-Level and Event-Level Access Control

ACLs (Access Control Lists)

The most straightforward model. Each topic has a list of principals (users or services) and the operations they're allowed to perform.

# Kafka ACL examples

# Allow the order-service to produce to the orders topic
kafka-acls --bootstrap-server broker1:9093 \
  --command-config admin.properties \
  --add \
  --allow-principal User:order-service \
  --operation Write \
  --topic orders

# Allow the shipping-service to consume from the orders topic
kafka-acls --bootstrap-server broker1:9093 \
  --command-config admin.properties \
  --add \
  --allow-principal User:shipping-service \
  --operation Read \
  --topic orders \
  --group shipping-consumer-group

# Deny all other access to the orders topic (deny by default)
kafka-acls --bootstrap-server broker1:9093 \
  --command-config admin.properties \
  --add \
  --deny-principal User:* \
  --operation All \
  --topic orders

ACLs work. They're simple to understand, simple to audit, and simple to get wrong at scale. When you have 50 services and 200 topics, you have potentially 10,000 ACL entries to manage. This is when people start looking at RBAC.

RBAC (Role-Based Access Control)

Instead of granting permissions to individual services, you define roles and assign services to roles.

# Role definitions (conceptual)
role: order-writer
  permissions:
    - topic: orders
      operations: [Write, Describe]
    - topic: order-events
      operations: [Write, Describe]

role: order-reader
  permissions:
    - topic: orders
      operations: [Read, Describe]
    - topic: order-events
      operations: [Read, Describe]

# Role assignments
principal: order-service     -> roles: [order-writer]
principal: shipping-service  -> roles: [order-reader]
principal: billing-service   -> roles: [order-reader]
principal: analytics-service -> roles: [order-reader]

Confluent Platform offers built-in RBAC. Open-source Kafka does not — you'll need to build or buy an authorization plugin that implements the Authorizer interface. RabbitMQ has a plugin-based authorization model. Pulsar has a multi-tenant authorization model built in. The managed cloud brokers (AWS MSK, Confluent Cloud, etc.) generally have RBAC as a feature.

ABAC (Attribute-Based Access Control)

The most flexible and most complex model. Access decisions are based on attributes of the principal, the resource, the action, and the environment.

# ABAC policy (conceptual)
PERMIT if:
  subject.department == "finance" AND
  resource.topic.classification == "financial" AND
  action == "Read" AND
  environment.time.hour BETWEEN 6 AND 22 AND
  environment.network.zone == "corporate"

ABAC is powerful. It can express policies that RBAC cannot, such as time-based restrictions or data classification rules. It's also significantly harder to reason about, audit, and debug. When an engineer at 2 AM is trying to figure out why a consumer can't read from a topic, "check the attribute-based policy engine" is not the answer they want to hear.

Practical guidance: Start with ACLs. Move to RBAC when ACL management becomes painful. Move to ABAC only when you have a genuine requirement that RBAC cannot express and you have the tooling and expertise to manage it. Most organizations never need ABAC for their event infrastructure.

Event-Level Access Control

Topic-level access control is coarse-grained. What if different events on the same topic have different sensitivity levels? An OrderPlaced event might be fine for the analytics team, but an OrderPaymentProcessed event contains card details they shouldn't see.

Options:

Separate topics by sensitivity — the simplest approach. Put sensitive events on a restricted topic. This works but leads to topic proliferation.
Field-level encryption — encrypt sensitive fields within events. Consumers without the decryption key see ciphertext. More on this below.
Event-level authorization in the consumer — the consumer checks whether it's authorized to process each event type. This is enforcement at the wrong layer and depends on consumers being honest, which is the security equivalent of the honor system.
Broker-side filtering with authorization — some brokers support server-side filtering (Pulsar's message filtering, for example). You can combine this with authorization to prevent certain events from being delivered to certain consumers. This is broker-specific and often limited.

Encryption: In-Transit and At-Rest

Encryption In-Transit (TLS/mTLS)

If you're running a production message broker without TLS, stop reading this chapter and go fix that. I'll wait.

TLS encrypts the communication channel between producers, consumers, and the broker. mTLS adds mutual authentication (covered above). The configuration is straightforward but the operational overhead is real.

# Kafka broker TLS configuration
listeners=SSL://0.0.0.0:9093
advertised.listeners=SSL://broker1.example.com:9093

ssl.keystore.type=PKCS12
ssl.keystore.location=/etc/kafka/ssl/broker.keystore.p12
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.key.password=${KEY_PASSWORD}

ssl.truststore.type=PKCS12
ssl.truststore.location=/etc/kafka/ssl/truststore.p12
ssl.truststore.password=${TRUSTSTORE_PASSWORD}

# TLS version — TLSv1.3 if your JVM supports it
ssl.enabled.protocols=TLSv1.3,TLSv1.2
ssl.protocol=TLSv1.3

# Cipher suites — be explicit, don't rely on defaults
ssl.cipher.suites=TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256

Performance impact: TLS adds CPU overhead for encryption/decryption. On modern hardware with AES-NI instructions, the impact is typically 5-15% throughput reduction. If this is unacceptable, you have unusual requirements or unusual hardware. The answer is never "skip TLS." The answer is "get better hardware" or "use TLS offloading."

Inter-broker communication: Don't forget to encrypt communication between broker nodes. In a Kafka cluster, replication traffic between brokers can contain the same sensitive data as producer/consumer traffic. Configure inter.broker.listener.name to use an SSL listener.

Encryption At-Rest

TLS protects data in motion. It does nothing for data sitting on the broker's disks. If someone gains access to the broker's filesystem — through a compromised host, a stolen backup, or a decommissioned disk that wasn't properly wiped — they can read every event in plaintext.

Broker-level disk encryption:

Full-disk encryption (LUKS, BitLocker, AWS EBS encryption) — transparent to the broker, protects against physical disk theft and improper decommissioning. Does not protect against a compromised broker process or an attacker with OS-level access.
Filesystem-level encryption (eCryptfs, fscrypt) — similar protection to full-disk, slightly more targeted.

Broker-native encryption at rest:

Confluent Platform offers transparent encryption at rest.
AWS MSK offers encryption at rest via KMS.
Most managed broker services offer this as a checkbox. Check the checkbox.

Application-level encryption (end-to-end):

The producer encrypts the event payload before publishing; the consumer decrypts after consuming. The broker never sees plaintext. This is the strongest model but requires key management at the application layer.

// Application-level envelope encryption for Kafka events
public class EncryptingSerializer implements Serializer<Event> {
    private final KmsClient kmsClient;
    private final String masterKeyId;

    @Override
    public byte[] serialize(String topic, Event event) {
        // 1. Generate a data encryption key (DEK)
        GenerateDataKeyResponse dataKey = kmsClient.generateDataKey(
            GenerateDataKeyRequest.builder()
                .keyId(masterKeyId)
                .keySpec(DataKeySpec.AES_256)
                .build()
        );

        // 2. Encrypt the event payload with the DEK
        byte[] plaintext = jsonSerializer.serialize(event);
        byte[] ciphertext = aesEncrypt(plaintext, dataKey.plaintext().asByteArray());

        // 3. Package the encrypted DEK alongside the ciphertext
        EncryptedEnvelope envelope = new EncryptedEnvelope(
            dataKey.ciphertextBlob().asByteArray(),  // encrypted DEK
            ciphertext,                                // encrypted payload
            "AES-256-GCM",                            // algorithm
            masterKeyId                                // key reference
        );

        return envelopeSerializer.serialize(envelope);
    }
}

The downside: the broker cannot inspect, filter, route, or compact based on encrypted payloads. You lose broker-side processing capabilities. Compaction, in particular, becomes problematic — the broker can't determine which events share a key if the key is encrypted.

Field-Level Encryption for Sensitive Data in Events

Full-payload encryption is a blunt instrument. Most of your event data isn't sensitive — it's the two or three fields containing email addresses, phone numbers, or payment details that need protection. Field-level encryption lets you encrypt only the sensitive fields, leaving the rest in plaintext for routing, filtering, and debugging.

{
  "eventType": "OrderPlaced",
  "orderId": "ord-12345",
  "timestamp": "2025-11-15T10:30:00Z",
  "customerId": "cust-789",
  "customerEmail": "ENC[AES256-GCM:AwEBAQx2...base64...]",
  "customerPhone": "ENC[AES256-GCM:BxFCAgR3...base64...]",
  "shippingAddress": "ENC[AES256-GCM:CyGDCgS4...base64...]",
  "items": [
    {"sku": "WIDGET-001", "quantity": 2, "price": 29.99}
  ],
  "totalAmount": 59.98,
  "currency": "USD"
}

The event is still routable by orderId, filterable by eventType, and inspectable for debugging — but the PII fields are opaque to anyone without the decryption key.

Implementation Approaches

Approach 1: Custom serializer/deserializer

from cryptography.fernet import Fernet
import json
import os

class FieldLevelEncryption:
    """Encrypts specified fields in an event payload."""

    def __init__(self, key: bytes, sensitive_fields: list[str]):
        self.fernet = Fernet(key)
        self.sensitive_fields = set(sensitive_fields)

    def encrypt_event(self, event: dict) -> dict:
        encrypted = event.copy()
        for field in self.sensitive_fields:
            if field in encrypted and encrypted[field] is not None:
                plaintext = str(encrypted[field]).encode('utf-8')
                encrypted[field] = f"ENC[{self.fernet.encrypt(plaintext).decode('utf-8')}]"
        return encrypted

    def decrypt_event(self, event: dict) -> dict:
        decrypted = event.copy()
        for field in self.sensitive_fields:
            if field in decrypted and isinstance(decrypted[field], str) \
               and decrypted[field].startswith("ENC["):
                token = decrypted[field][4:-1].encode('utf-8')
                decrypted[field] = self.fernet.decrypt(token).decode('utf-8')
        return decrypted


# Usage
encryptor = FieldLevelEncryption(
    key=os.environ['FIELD_ENCRYPTION_KEY'].encode(),
    sensitive_fields=['customerEmail', 'customerPhone', 'shippingAddress']
)

# Producer side
raw_event = {
    "eventType": "OrderPlaced",
    "orderId": "ord-12345",
    "customerEmail": "alice@example.com",
    "customerPhone": "+1-555-0123",
    "shippingAddress": "123 Main St, Springfield",
    "totalAmount": 59.98
}
encrypted_event = encryptor.encrypt_event(raw_event)
producer.produce(topic='orders', value=json.dumps(encrypted_event))

# Consumer side (authorized consumer with the key)
decrypted_event = encryptor.decrypt_event(encrypted_event)
# decrypted_event['customerEmail'] == 'alice@example.com'

# Consumer side (unauthorized consumer without the key)
# They see: "ENC[gAAAAABh...]" and can do nothing with it

Approach 2: Schema-driven encryption

Define which fields are sensitive in the schema itself, and let the serialization layer handle encryption automatically.

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.example.orders",
  "fields": [
    {"name": "orderId", "type": "string"},
    {"name": "customerEmail", "type": "string",
     "confluent:tags": ["PII", "ENCRYPTED"]},
    {"name": "customerPhone", "type": "string",
     "confluent:tags": ["PII", "ENCRYPTED"]},
    {"name": "totalAmount", "type": "double"}
  ]
}

Confluent's schema-level encryption and similar tools use the schema metadata to determine which fields to encrypt, using keys managed through a KMS. This is cleaner than hand-rolling encryption in every producer and consumer, but it ties you to a specific vendor's ecosystem.

Key Management for Field-Level Encryption

The encryption is only as good as the key management. Options:

Envelope encryption via KMS (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault): A master key in the KMS encrypts per-field data encryption keys. Producers request a DEK from the KMS, encrypt the field, and store the encrypted DEK alongside the data. Consumers retrieve the encrypted DEK, call the KMS to decrypt it, and use it to decrypt the field. This is the correct approach.
Shared symmetric key in environment variables: Works for small deployments. Rotation is painful. If the key leaks, every event ever encrypted with it is compromised. Not recommended for production.
Per-consumer keys: Different consumers get different keys, enabling differential access. The producer encrypts sensitive fields multiple times (once per authorized consumer's key) or uses an intermediary re-encryption service. Complex but powerful.

PII in Event Payloads — GDPR's Revenge on Event Sourcing

Here is where event-driven architecture and data protection regulation have a philosophical disagreement.

Event-driven systems love immutability. Events are facts. They happened. You don't change them. This is a core architectural principle, and it gives you audit trails, replay capability, and temporal queries.

GDPR (and CCPA, LGPD, PIPEDA, and the growing family of privacy regulations) loves the right to erasure. Individuals can request that their personal data be deleted. All of it. Including the copy you made three years ago in that event log you forgot about.

These two principles are, on their face, incompatible. And yet, here you are, needing to satisfy both.

The Scope of the Problem

PII leaks into event payloads in ways you don't expect:

Obvious: customerEmail, customerName, shippingAddress
Less obvious: ipAddress in click events, userAgent strings, GPS coordinates in location events
Sneaky: Free-text fields like orderNotes ("Please deliver to Alice Smith at 123 Main St"), correlation IDs that embed user IDs, URLs that contain email addresses as query parameters

If you're doing event sourcing, the problem is worse. Your entire state is derived from events. Deleting an event doesn't just remove data — it potentially corrupts the state of every downstream projection.

Strategies for PII Compliance

Strategy 1: Don't store PII in events.

The simplest approach: events reference external entities by opaque ID rather than embedding PII.

// BAD: PII embedded in event
{
  "eventType": "OrderPlaced",
  "customerName": "Alice Smith",
  "customerEmail": "alice@example.com",
  "shippingAddress": "123 Main St, Springfield, IL 62701"
}

// BETTER: PII referenced by ID
{
  "eventType": "OrderPlaced",
  "customerId": "cust-a1b2c3",
  "shippingAddressId": "addr-x9y8z7"
}

The PII lives in a mutable data store (a database) where it can be updated or deleted. Events contain only references. When a consumer needs the PII, it looks it up — and if the data has been deleted, the lookup returns nothing.

The downside: you lose the self-contained nature of events. Consumers now need access to external data stores. Replay becomes complicated because the external data may have changed since the event was produced. You've traded one problem for another, but the new problem is at least one that databases have been solving for decades.

Strategy 2: Crypto-shredding (the practical answer).

This deserves its own section. See below.

Strategy 3: Event transformation and redaction pipelines.

A dedicated service sits between the raw event stream and downstream consumers, stripping or masking PII fields before forwarding events. The raw stream is access-controlled and retained only as long as legally required; the redacted stream is what most consumers see.

Producer -> [raw-orders topic] -> PII Redaction Service -> [clean-orders topic] -> Consumers
                                        |
                                  (strips PII,
                                   replaces with
                                   hashes or tokens)

This works but introduces an additional service to maintain, a latency overhead, and a risk that the redaction logic misses a field.

The Right to Be Forgotten vs. Immutable Event Logs — The Crypto-Shredding Pattern

Crypto-shredding is the industry's best answer to the immutability-vs-deletion paradox. The idea is elegant:

All PII in events is encrypted with a key that is unique to the data subject (the person whose data it is).
When that person exercises their right to erasure, you don't delete the events — you delete the encryption key.
Without the key, the encrypted PII fields are indistinguishable from random bytes. The data is effectively destroyed while the event structure remains intact.

Implementation

class CryptoShredding:
    """
    Per-subject encryption keys for PII in events.
    Deleting the key == deleting the data.
    """

    def __init__(self, key_store):
        """
        key_store: a durable, secure store mapping subject_id -> encryption_key.
        Could be HashiCorp Vault, AWS KMS, a dedicated database, etc.
        """
        self.key_store = key_store

    def get_or_create_key(self, subject_id: str) -> bytes:
        """Get the encryption key for a data subject, creating one if needed."""
        key = self.key_store.get(subject_id)
        if key is None:
            key = Fernet.generate_key()
            self.key_store.put(subject_id, key)
        return key

    def encrypt_pii(self, subject_id: str, event: dict,
                    pii_fields: list[str]) -> dict:
        """Encrypt PII fields using the subject's key."""
        key = self.get_or_create_key(subject_id)
        fernet = Fernet(key)

        encrypted = event.copy()
        encrypted['_pii_subject'] = subject_id  # track whose key to use
        encrypted['_pii_fields'] = pii_fields   # track which fields are encrypted

        for field in pii_fields:
            if field in encrypted:
                plaintext = json.dumps(encrypted[field]).encode('utf-8')
                encrypted[field] = fernet.encrypt(plaintext).decode('utf-8')

        return encrypted

    def decrypt_pii(self, event: dict) -> dict:
        """Decrypt PII fields. Returns event with '[DELETED]' if key is gone."""
        subject_id = event.get('_pii_subject')
        pii_fields = event.get('_pii_fields', [])

        if not subject_id:
            return event

        key = self.key_store.get(subject_id)

        decrypted = event.copy()
        for field in pii_fields:
            if field in decrypted:
                if key is None:
                    # Key has been shredded — data is effectively deleted
                    decrypted[field] = '[DELETED]'
                else:
                    fernet = Fernet(key)
                    plaintext = fernet.decrypt(decrypted[field].encode('utf-8'))
                    decrypted[field] = json.loads(plaintext.decode('utf-8'))

        return decrypted

    def forget_subject(self, subject_id: str):
        """
        Exercise the right to be forgotten.
        Delete the key, and all PII for this subject becomes unrecoverable.
        """
        self.key_store.delete(subject_id)
        # That's it. Every event containing this subject's PII
        # is now cryptographically shredded.

Crypto-Shredding Considerations

Key storage is critical. The key store is now the most important database in your system. If you lose the keys accidentally, you've accidentally GDPR-deleted all your customer data. Back it up. Replicate it. Treat it like the crown jewels it is.
Key rotation. If you rotate a subject's key, you need to re-encrypt all events containing that subject's PII with the new key. For event sourcing, this means rewriting history — which you said you wouldn't do, but here we are.
Downstream copies. Crypto-shredding works for the event log. It doesn't help with consumers that decrypted the PII and stored it in their own databases. You need a coordinated deletion process across all consumers. This is the part that makes compliance officers nervous.
Performance. Per-subject keys mean a KMS lookup for every event containing PII. Caching helps but introduces a window where a deleted key might still be cached. Set reasonable TTLs.
Legal acceptance. Check with your legal team whether your regulators consider crypto-shredding equivalent to deletion. Most European DPAs accept it, but "most" is not "all."

Audit Trails and Compliance

Event-driven systems have a natural advantage for audit trails: they already record what happened. The challenge is making that record trustworthy, complete, and queryable.

What to Audit

Producer actions: Who published what, when, to which topic. Include the producer's authenticated identity, the event type, a timestamp, and enough metadata to reconstruct the action.
Consumer actions: Who consumed what, when. This is harder to capture since consumption is typically a pull operation, but broker-side access logs can provide it.
Administrative actions: Topic creation/deletion, ACL changes, schema updates, configuration changes. These are the actions an attacker would take to cover their tracks.
Access denials: Failed authentication attempts, authorization failures, rate limit hits. These are often more interesting than successful operations.

Implementing Audit Trails

// Interceptor-based audit logging for Kafka producers
public class AuditProducerInterceptor implements ProducerInterceptor<String, byte[]> {

    private final AuditLogger auditLogger;

    @Override
    public ProducerRecord<String, byte[]> onSend(ProducerRecord<String, byte[]> record) {
        auditLogger.logProduceAttempt(
            AuditEvent.builder()
                .principal(SecurityContext.getCurrentPrincipal())
                .action("PRODUCE")
                .topic(record.topic())
                .partition(record.partition())
                .key(record.key())
                .eventType(extractEventType(record.headers()))
                .timestamp(Instant.now())
                .sourceIp(SecurityContext.getSourceIp())
                .payloadSizeBytes(record.value().length)
                // Do NOT log the payload itself — that defeats the purpose
                // of field-level encryption
                .build()
        );
        return record;
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
        if (exception != null) {
            auditLogger.logProduceFailure(metadata, exception);
        } else {
            auditLogger.logProduceSuccess(metadata);
        }
    }
}

Do not log event payloads in audit trails. This seems counterintuitive, but the audit trail's purpose is to record who did what, not to create yet another copy of the data. Logging payloads creates a second unencrypted copy of PII outside your carefully encrypted event stream. Log metadata only: event type, topic, key, timestamp, principal, outcome.

Tamper-Evident Audit Logs

An attacker who compromises your system will attempt to modify audit logs to cover their tracks. Options for tamper evidence:

Append-only storage — write audit logs to a system that doesn't support mutation (S3 with Object Lock, WORM storage).
Hash chaining — each audit entry includes a hash of the previous entry, creating a blockchain-like chain. Tampering with any entry invalidates all subsequent hashes.
External attestation — periodically send a hash of your audit log to an external timestamping service. This proves the log existed in a particular state at a particular time.

Schema Validation as a Security Boundary

Your schema registry isn't just a convenience for managing data formats. It's a security boundary. A schema defines what a valid event looks like, and rejecting events that don't conform to the schema is a form of input validation — the most fundamental security control there is.

Schema Validation as Input Filtering

Without schema validation, a malicious producer can:

Inject oversized events that exhaust consumer memory
Include unexpected fields that trigger deserialization vulnerabilities
Embed malicious content (script injection payloads, SQL injection strings) in text fields
Send malformed data that causes consumer crashes (null pointer exceptions, type confusion)

With schema validation enforced at the broker or serialization layer:

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.example.orders",
  "fields": [
    {
      "name": "orderId",
      "type": "string",
      "doc": "UUID format order identifier",
      "pattern": "^ord-[a-f0-9]{8}$"
    },
    {
      "name": "totalAmount",
      "type": "double",
      "min": 0.0,
      "max": 1000000.0
    },
    {
      "name": "currency",
      "type": {
        "type": "enum",
        "name": "Currency",
        "symbols": ["USD", "EUR", "GBP"]
      }
    }
  ]
}

Schema validation won't stop a determined attacker, but it raises the bar significantly. It's the event-driven equivalent of parameterized queries — it doesn't solve every security problem, but not using it is inexcusable.

Securing Schema Registries

The schema registry is a control plane component. Whoever controls the schema controls the contract between producers and consumers. An attacker who can modify a schema can:

Add fields that legitimate consumers don't expect, potentially causing crashes.
Change field types to trigger deserialization vulnerabilities.
Remove required fields, breaking downstream processing.
Weaken validation constraints, allowing malicious payloads through.

Hardening the Schema Registry

Authentication and authorization: The schema registry should require authentication. Not all users need the same access. Producers need read access (to validate against the current schema). Schema administrators need write access. Consumers need read access. Nobody else needs any access.
Change control: Schema changes should go through a review process, not be applied directly by producers at runtime. Treat schema changes like database migrations — reviewed, tested, and deployed through a pipeline.
Compatibility enforcement: Enable strict compatibility checking (backward, forward, or full compatibility). This prevents breaking changes from being registered even if an attacker gains write access.
Network isolation: The schema registry should not be accessible from the public internet. It should be on a private network, accessible only to services that need it.
Audit logging: Log every schema read and write operation. Alert on unexpected schema modifications.

# Confluent Schema Registry — enable authentication
# In schema-registry.properties:
authentication.method=BASIC
authentication.roles=admin,developer,readonly
authentication.realm=SchemaRegistry

# Enable HTTPS
listeners=https://0.0.0.0:8081
ssl.keystore.location=/etc/schema-registry/ssl/keystore.p12
ssl.keystore.password=${KEYSTORE_PASSWORD}
ssl.truststore.location=/etc/schema-registry/ssl/truststore.p12
ssl.truststore.password=${TRUSTSTORE_PASSWORD}

Network Segmentation and Broker Hardening

Network Architecture

The broker cluster should live in a dedicated network segment, separated from application services by firewalls or security groups. The principle of least privilege applies at the network level:

┌─────────────────────────────────────────────────┐
│                   Internet                       │
└──────────────────────┬──────────────────────────┘
                       │ (no direct access)
┌──────────────────────┴──────────────────────────┐
│              API Gateway / Load Balancer          │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────┐
│              Application Services Network         │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│  │ Service A  │  │ Service B  │  │ Service C  │    │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘    │
└────────┼──────────────┼──────────────┼──────────┘
         │              │              │
    ┌────┴──────────────┴──────────────┴────┐
    │     Broker Network (restricted)        │
    │  ┌─────────┐ ┌─────────┐ ┌─────────┐  │
    │  │Broker 1 │ │Broker 2 │ │Broker 3 │  │
    │  └─────────┘ └─────────┘ └─────────┘  │
    │  ┌──────────────────────────────────┐  │
    │  │      ZooKeeper / KRaft           │  │
    │  └──────────────────────────────────┘  │
    └────────────────────────────────────────┘

Broker Hardening Checklist

Disable all unauthenticated listeners. No PLAINTEXT listeners in production. Zero.
Enable TLS for all client connections and inter-broker communication.
Enable authentication (mTLS or SASL) for all connections.
Enable authorization (ACLs at minimum).
Restrict ZooKeeper/KRaft access to broker nodes only. ZooKeeper in particular is a treasure trove of cluster metadata and historically has had minimal authentication.
Disable auto-topic creation. auto.create.topics.enable=false. A producer that can create arbitrary topics is a producer that can create a topic named admin-commands and confuse your monitoring.
Set resource limits: message.max.bytes, max.request.size, quota.producer.default, quota.consumer.default. Prevent any single client from monopolizing the cluster.
Run the broker process as a non-root user.
Enable JMX authentication if JMX is exposed. Unauthenticated JMX can be used for remote code execution.
Keep the broker software up to date. CVEs happen.

Common Security Anti-Patterns

Anti-Pattern 1: "It's on the internal network"

The internal network is not a security boundary. It's a speed bump at best. Internal networks get compromised. Employees go rogue. Contractors have access. That one legacy server running Windows Server 2008 in the corner? It has flat network access to your Kafka cluster.

Fix: Zero trust. Authenticate and authorize every connection, regardless of network location.

Anti-Pattern 2: Shared credentials across services

All services use the same username/password or the same TLS certificate. If one service is compromised, the attacker has the credentials of every service.

Fix: Unique credentials per service. This is what mTLS with per-service certificates gives you automatically.

Anti-Pattern 3: Overly broad topic access

Every service can read from and write to every topic. This is the default configuration for most brokers, and it is exactly as secure as leaving your front door open because you live in a nice neighborhood.

Fix: Deny by default. Grant the minimum required access per service.

Anti-Pattern 4: PII in plaintext everywhere

PII is in events, in consumer databases, in logs, in monitoring dashboards, in Slack alerts that say "Order from alice@example.com failed." Every copy is a liability.

Fix: Encrypt PII at the source. Mask PII in logs and alerts. Use opaque identifiers where possible.

Anti-Pattern 5: No schema validation

Any producer can send any bytes to any topic. A bug in one producer sends garbage data that crashes three consumers and corrupts a database.

Fix: Mandatory schema validation at the serialization layer, ideally enforced by the broker.

Anti-Pattern 6: Secrets in event payloads

API keys, tokens, passwords embedded in events because "the downstream service needs them to call the API." Congratulations, your secrets are now stored in an append-only log, replicated across three broker nodes, retained for seven days, and readable by every consumer of that topic.

Fix: Never put secrets in events. Use a secrets manager. Pass references, not values.

Anti-Pattern 7: Ignoring the control plane

All security focus is on the data plane (events in topics) while the control plane (topic management, ACL management, schema management, consumer group management) is wide open.

Fix: Secure the control plane at least as rigorously as the data plane. Ideally more so.

Summary

Security in event-driven systems is not a feature you bolt on after the architecture is built. It's a set of constraints that must inform the architecture from day one. The expanded attack surface — broker, producers, consumers, network, storage, schema registry — demands a comprehensive approach:

Authenticate every connection with mTLS or SASL.
Authorize every operation with ACLs or RBAC.
Encrypt in transit with TLS and at rest with disk or application-level encryption.
Protect PII with field-level encryption and crypto-shredding for deletion compliance.
Validate schemas to prevent malformed or malicious events.
Audit everything, but audit metadata, not payloads.
Harden brokers, registries, and the network.

The best event-driven security is invisible to developers who are doing the right thing and an impenetrable wall to everyone else. Achieving that is hard work. But the alternative — discovering your GDPR exposure when a regulator asks for proof of deletion from your immutable event log — is harder.

Testing Event-Driven Systems

Testing synchronous request-response systems is straightforward. Call a function, get a result, assert on the result. Testing event-driven systems is... not that. You publish an event and then wait. Something might happen. Eventually. Probably. Somewhere else. In a different process. On a different machine. And your test needs to verify that the right thing happened without being able to observe it directly.

If your test suite for an event-driven system looks the same as your test suite for a REST API, one of two things is true: either your system isn't actually event-driven, or your tests aren't actually testing anything.

Why Testing Async Systems Is Fundamentally Different

The core difficulty is that event-driven systems break the temporal coupling between cause and effect. In a synchronous system:

request -> processing -> response (all in one call, one thread, one moment)

In an event-driven system:

publish event -> ??? -> eventually a consumer processes it -> ??? -> maybe a side effect occurs

This introduces several testing challenges that don't exist in the synchronous world:

Non-deterministic timing. When you publish an event, you don't know when it will be consumed. It depends on broker latency, consumer lag, partition assignment, rebalancing, and the phase of the moon.
No return value. A producer gets an acknowledgment that the event was written to the broker. It does not get confirmation that any consumer processed it, let alone processed it correctly.
Distributed state. The outcome of processing an event might be a state change in a different service's database, the publication of another event, or a call to an external API. Your test needs to observe state in a different system.
Ordering is conditional. Events may arrive in order within a partition and out of order across partitions. Your tests need to account for both cases.
Exactly-once is a spectrum. Your tests need to verify behavior under at-least-once delivery, which means verifying idempotency, which means running the same event through the same consumer multiple times and asserting on the outcome.
Infrastructure dependency. You can't meaningfully test event-driven behavior without a broker (or a convincing fake). This pushes more of your testing into integration territory.

The test pyramid, that beloved conference slide, starts to look more like a test diamond — a thin layer of unit tests at the bottom, a fat layer of integration tests in the middle, and a thin layer of end-to-end tests at the top.

Unit Testing Event Producers and Consumers in Isolation

Despite everything I just said about the difficulty of testing async systems, unit tests still have a role. The trick is knowing what to test at the unit level and what to push to integration tests.

Testing Producers

A producer's job is to create a well-formed event and hand it to the broker. The unit test should verify the event creation, not the broker interaction.

# The producer logic, separated from broker interaction
class OrderEventProducer:
    def __init__(self, event_publisher):
        self.event_publisher = event_publisher

    def create_order_placed_event(self, order) -> dict:
        """Create an OrderPlaced event from an Order domain object."""
        return {
            "eventType": "OrderPlaced",
            "eventId": str(uuid.uuid4()),
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "version": 1,
            "data": {
                "orderId": order.id,
                "customerId": order.customer_id,
                "items": [
                    {"sku": item.sku, "quantity": item.quantity, "price": str(item.price)}
                    for item in order.items
                ],
                "totalAmount": str(order.total_amount),
                "currency": order.currency,
            }
        }

    def publish_order_placed(self, order):
        event = self.create_order_placed_event(order)
        self.event_publisher.publish("orders", key=order.id, value=event)
        return event


# Unit test — no broker needed
class TestOrderEventProducer:
    def test_creates_valid_order_placed_event(self):
        order = Order(
            id="ord-123",
            customer_id="cust-456",
            items=[OrderItem(sku="WIDGET-001", quantity=2, price=Decimal("29.99"))],
            total_amount=Decimal("59.98"),
            currency="USD"
        )

        producer = OrderEventProducer(event_publisher=Mock())
        event = producer.create_order_placed_event(order)

        assert event["eventType"] == "OrderPlaced"
        assert event["version"] == 1
        assert event["data"]["orderId"] == "ord-123"
        assert event["data"]["totalAmount"] == "59.98"
        assert len(event["data"]["items"]) == 1
        assert event["data"]["items"][0]["sku"] == "WIDGET-001"

    def test_publish_calls_event_publisher_with_correct_topic_and_key(self):
        mock_publisher = Mock()
        producer = OrderEventProducer(event_publisher=mock_publisher)

        order = Order(id="ord-123", customer_id="cust-456", items=[],
                      total_amount=Decimal("0"), currency="USD")
        producer.publish_order_placed(order)

        mock_publisher.publish.assert_called_once()
        call_args = mock_publisher.publish.call_args
        assert call_args[0][0] == "orders"           # topic
        assert call_args[1]["key"] == "ord-123"       # partition key

The key insight: separate event construction from event transmission. The construction logic is pure business logic — testable with unit tests. The transmission is infrastructure interaction — testable with integration tests.

Testing Consumers

A consumer's job is to receive an event, validate it, and perform some action. The unit test should verify the action logic, assuming a valid event arrives.

# Consumer logic, separated from broker interaction
class OrderEventConsumer:
    def __init__(self, inventory_service, notification_service):
        self.inventory_service = inventory_service
        self.notification_service = notification_service

    def handle_order_placed(self, event: dict):
        """Process an OrderPlaced event."""
        order_data = event["data"]

        # Reserve inventory for each item
        for item in order_data["items"]:
            self.inventory_service.reserve(
                sku=item["sku"],
                quantity=item["quantity"],
                order_id=order_data["orderId"]
            )

        # Notify the customer
        self.notification_service.send_order_confirmation(
            customer_id=order_data["customerId"],
            order_id=order_data["orderId"]
        )


# Unit test — no broker needed
class TestOrderEventConsumer:
    def test_reserves_inventory_for_each_item(self):
        inventory = Mock()
        notifications = Mock()
        consumer = OrderEventConsumer(inventory, notifications)

        event = {
            "eventType": "OrderPlaced",
            "data": {
                "orderId": "ord-123",
                "customerId": "cust-456",
                "items": [
                    {"sku": "WIDGET-001", "quantity": 2, "price": "29.99"},
                    {"sku": "GADGET-002", "quantity": 1, "price": "49.99"},
                ],
            }
        }

        consumer.handle_order_placed(event)

        assert inventory.reserve.call_count == 2
        inventory.reserve.assert_any_call(
            sku="WIDGET-001", quantity=2, order_id="ord-123"
        )
        inventory.reserve.assert_any_call(
            sku="GADGET-002", quantity=1, order_id="ord-123"
        )

    def test_sends_order_confirmation(self):
        inventory = Mock()
        notifications = Mock()
        consumer = OrderEventConsumer(inventory, notifications)

        event = {
            "eventType": "OrderPlaced",
            "data": {
                "orderId": "ord-123",
                "customerId": "cust-456",
                "items": [],
            }
        }

        consumer.handle_order_placed(event)

        notifications.send_order_confirmation.assert_called_once_with(
            customer_id="cust-456", order_id="ord-123"
        )

This pattern — extracting the handler logic from the message consumption loop — is the single most important testing technique for event-driven consumers. If your event handler is tangled up with your KafkaConsumer.poll() loop, your tests will be tangled up with Kafka too.

Contract Testing with Pact and Similar Tools

Unit tests verify that your producer creates the right shape of event and your consumer handles that shape correctly. But who verifies that the shape the producer creates is the shape the consumer expects?

This is the contract problem, and it's amplified in event-driven systems where producers and consumers are developed by different teams, deployed independently, and communicate only through events that flow through a broker.

What Is Contract Testing?

A contract test verifies that two systems agree on the format of the messages they exchange, without requiring both systems to be running simultaneously. It's the event-driven equivalent of "did you read the API docs?" except automated, mandatory, and not dependent on anyone actually writing or reading docs.

Pact for Event-Driven Systems

Pact is the most widely-used contract testing framework. It was originally designed for HTTP APIs but supports message-based interactions through its message pact feature.

# Consumer-side Pact test (consumer defines what it expects)
# This is the "consumer-driven" part — the consumer defines the contract

from pact import MessageConsumer, MessagePact

def test_order_placed_contract():
    pact = MessageConsumer('ShippingService').has_pact_with(
        MessagePact('OrderService'),
        pact_dir='./pacts'
    )

    expected_event = {
        "eventType": "OrderPlaced",
        "version": 1,
        "data": {
            "orderId": Like("ord-12345"),           # any string matching pattern
            "customerId": Like("cust-789"),
            "items": EachLike({
                "sku": Like("WIDGET-001"),
                "quantity": Like(1),                 # any integer
                "price": Like("29.99"),              # any string (decimal)
            }),
            "shippingAddress": {
                "street": Like("123 Main St"),
                "city": Like("Springfield"),
                "state": Like("IL"),
                "zip": Like("62701"),
                "country": Like("US"),
            }
        }
    }

    (pact
        .given("an order exists")
        .expects_to_receive("an OrderPlaced event")
        .with_content(expected_event)
        .with_metadata({"topic": "orders", "contentType": "application/json"}))

    # The handler that processes this event
    with pact:
        handler = OrderEventHandler()
        handler.handle(expected_event)

    # Pact writes a contract file (pact JSON) to ./pacts/
    # This file is shared with the OrderService (provider) for verification

# Producer-side Pact verification (provider verifies it meets the contract)

from pact import MessageProvider

def test_order_service_satisfies_shipping_contract():
    provider = MessageProvider(
        provider='OrderService',
        consumer='ShippingService',
        pact_dir='./pacts'     # read the contract the consumer defined
    )

    def order_placed_message_factory():
        """
        Produce an actual OrderPlaced event using real producer code.
        Pact will compare this against the consumer's expectations.
        """
        order = create_test_order()
        producer = OrderEventProducer(event_publisher=Mock())
        return producer.create_order_placed_event(order)

    provider.add_message_interaction(
        description="an OrderPlaced event",
        provider_states=[{"name": "an order exists"}],
        message_factory=order_placed_message_factory,
    )

    # Verify: does the actual event match what the consumer expects?
    provider.verify()

The Pact Broker

In practice, consumer pact files need to get to the provider somehow. The Pact Broker is a central service that stores and shares pacts:

Consumer (Shipping) --[publishes pact]--> Pact Broker <--[fetches pact]-- Provider (Orders)
                                               |
                                         [verification results]
                                               |
                                    CI pipeline: "can I deploy?"

The Pact Broker also supports the "can I deploy?" query: before deploying a new version of a service, ask the broker whether all contracts with all counterparties are still satisfied. If not, the deployment is blocked.

Alternatives to Pact

Spring Cloud Contract — generates tests from contract definitions. More opinionated, Spring-ecosystem specific.
Schema Registry compatibility checks — not exactly contract testing, but schema compatibility enforcement (backward, forward, full) provides a similar guarantee: new schemas won't break existing consumers.
AsyncAPI — an OpenAPI-like specification for async APIs. Useful for documentation and code generation, less mature for contract testing.

Consumer-Driven Contracts — Letting Consumers Define Expectations

Consumer-driven contracts (CDC) invert the traditional model. Instead of the producer defining "here's what I send" and consumers adapting, the consumer defines "here's what I need" and the producer verifies it can provide it.

This matters because in event-driven systems, a single topic might have dozens of consumers, each caring about different fields. The producer team doesn't necessarily know which fields each consumer depends on.

How CDC Works in Practice

Consumer team writes a contract: "We (Shipping Service) need OrderPlaced events to contain orderId, customerId, and shippingAddress with at least street, city, and zip."
Contract is shared with the producer (via Pact Broker, git repo, or artifact repository).
Producer's CI pipeline verifies that the events it produces satisfy all consumer contracts.
If a producer change breaks a contract, the producer's build fails — not the consumer's. This is the critical difference. The team making the breaking change is the team that gets the failing test.

The Governance Question

CDC only works if there's a process for managing contracts. Without governance, you get:

Consumers adding contracts for fields that were never intended to be stable ("you renamed customerId to customer_id and broke us").
An ever-growing set of contracts that prevents the producer from evolving ("we can't add the new field because consumer X's contract doesn't include it and their test fails on unknown fields").
Contract proliferation where every consumer has a slightly different view of the same event.

The solution is a contract review process — producers and consumers agree on what constitutes the stable interface of an event, and contracts are written against that interface, not against the full event payload.

Integration Testing with Embedded/In-Memory Brokers

Unit tests with mocked brokers verify logic. Integration tests with real (or real-enough) brokers verify that your code actually works when messages flow through infrastructure.

Testcontainers

Testcontainers is the industry standard for integration testing with real infrastructure. It starts Docker containers for your tests and tears them down afterward. It supports Kafka, RabbitMQ, Pulsar, Redis, and essentially every broker you might use.

// Java integration test with Testcontainers and Kafka
@Testcontainers
class OrderEventIntegrationTest {

    @Container
    static KafkaContainer kafka = new KafkaContainer(
        DockerImageName.parse("confluentinc/cp-kafka:7.5.0")
    ).withKraft();  // Use KRaft mode, no ZooKeeper needed

    private Producer<String, String> producer;
    private Consumer<String, String> consumer;

    @BeforeEach
    void setUp() {
        // Configure producer
        Properties producerProps = new Properties();
        producerProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
                         kafka.getBootstrapServers());
        producerProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                         StringSerializer.class.getName());
        producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                         StringSerializer.class.getName());
        producer = new KafkaProducer<>(producerProps);

        // Configure consumer
        Properties consumerProps = new Properties();
        consumerProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
                         kafka.getBootstrapServers());
        consumerProps.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group");
        consumerProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        consumerProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
                         StringDeserializer.class.getName());
        consumerProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
                         StringDeserializer.class.getName());
        consumer = new KafkaConsumer<>(consumerProps);
    }

    @Test
    void orderPlacedEventFlowsThroughBroker() throws Exception {
        String topic = "orders-" + UUID.randomUUID(); // unique topic per test
        consumer.subscribe(Collections.singletonList(topic));

        // Produce an event
        String event = """
            {
                "eventType": "OrderPlaced",
                "orderId": "ord-123",
                "customerId": "cust-456"
            }
            """;
        producer.send(new ProducerRecord<>(topic, "ord-123", event)).get();

        // Consume and verify
        ConsumerRecords<String, String> records = ConsumerRecords.empty();
        Instant deadline = Instant.now().plusSeconds(10);

        while (records.isEmpty() && Instant.now().isBefore(deadline)) {
            records = consumer.poll(Duration.ofMillis(500));
        }

        assertFalse(records.isEmpty(), "Expected to receive the event");
        ConsumerRecord<String, String> record = records.iterator().next();
        assertEquals("ord-123", record.key());

        JsonNode eventNode = objectMapper.readTree(record.value());
        assertEquals("OrderPlaced", eventNode.get("eventType").asText());
    }

    @AfterEach
    void tearDown() {
        producer.close();
        consumer.close();
    }
}

# Python integration test with testcontainers
import pytest
from testcontainers.kafka import KafkaContainer
from confluent_kafka import Producer, Consumer
import json
import time

@pytest.fixture(scope="module")
def kafka_container():
    with KafkaContainer(image="confluentinc/cp-kafka:7.5.0") as kafka:
        yield kafka

@pytest.fixture
def kafka_producer(kafka_container):
    producer = Producer({
        'bootstrap.servers': kafka_container.get_bootstrap_server(),
    })
    yield producer

@pytest.fixture
def kafka_consumer(kafka_container):
    consumer = Consumer({
        'bootstrap.servers': kafka_container.get_bootstrap_server(),
        'group.id': f'test-group-{uuid.uuid4()}',
        'auto.offset.reset': 'earliest',
    })
    yield consumer
    consumer.close()

def test_order_event_round_trip(kafka_producer, kafka_consumer):
    topic = f"orders-{uuid.uuid4()}"

    # Produce
    event = {
        "eventType": "OrderPlaced",
        "orderId": "ord-123",
        "customerId": "cust-456",
    }
    kafka_producer.produce(
        topic=topic,
        key="ord-123",
        value=json.dumps(event),
    )
    kafka_producer.flush()

    # Consume
    kafka_consumer.subscribe([topic])
    messages = []
    deadline = time.time() + 10

    while not messages and time.time() < deadline:
        msg = kafka_consumer.poll(timeout=0.5)
        if msg and not msg.error():
            messages.append(msg)

    assert len(messages) == 1
    received = json.loads(messages[0].value())
    assert received["eventType"] == "OrderPlaced"
    assert received["orderId"] == "ord-123"

Testcontainers Tips and Warnings

Use unique topic names per test. Reusing topic names across tests leads to test pollution, where one test's events leak into another test. A UUID suffix on the topic name solves this.
Use unique consumer group IDs per test. Same reason. Consumer groups track offsets; shared groups mean shared state.
Set auto.offset.reset=earliest. Otherwise your consumer might miss events that were produced before it subscribed.
Container startup time is real. A Kafka container takes 10-30 seconds to start. Use scope="module" or scope="session" fixtures to share a container across tests.
Docker must be available. This seems obvious until your CI pipeline doesn't have Docker. Testcontainers needs a Docker daemon. For CI, you might need Docker-in-Docker (DinD) or a remote Docker host.

Embedded Brokers (When Docker Isn't Available)

Some brokers offer embeddable or in-memory versions for testing:

Embedded Kafka (spring-kafka-test provides EmbeddedKafka for Spring Boot applications)
RabbitMQ has rabbitmq-mock libraries
Redis has embedded alternatives like embedded-redis
Pulsar offers a standalone mode suitable for testing

Embedded brokers start faster than containers but may not behave identically to production brokers. They're a pragmatic choice when Docker isn't available but should not be your only level of integration testing.

End-to-End Testing Strategies — The Timing Problem

End-to-end tests for event-driven systems verify that a complete flow works: an API call triggers an event, the event is consumed, a side effect occurs, and the final state is correct.

The fundamental challenge is the timing problem: when do you check for the expected outcome?

The Naive Approach (Don't Do This)

def test_order_flow_end_to_end():
    # Place an order via API
    response = requests.post("http://order-service/orders", json=order_data)
    assert response.status_code == 201
    order_id = response.json()["orderId"]

    # Wait for the event to be processed
    time.sleep(5)  # <-- THIS IS THE PROBLEM

    # Check that shipping was created
    shipping = requests.get(f"http://shipping-service/shipments?orderId={order_id}")
    assert shipping.status_code == 200

A fixed sleep is the most common approach and the worst. It's either too short (flaky test) or too long (slow test). Usually both, depending on the day.

The Polling Approach (Better)

def test_order_flow_end_to_end():
    response = requests.post("http://order-service/orders", json=order_data)
    order_id = response.json()["orderId"]

    # Poll until the expected state appears or timeout
    shipment = poll_until(
        fn=lambda: requests.get(
            f"http://shipping-service/shipments?orderId={order_id}"
        ),
        condition=lambda r: r.status_code == 200 and r.json().get("status") == "CREATED",
        timeout_seconds=30,
        interval_seconds=0.5,
    )

    assert shipment.json()["orderId"] == order_id


def poll_until(fn, condition, timeout_seconds, interval_seconds):
    """Poll a function until a condition is met or timeout expires."""
    deadline = time.time() + timeout_seconds
    last_result = None

    while time.time() < deadline:
        last_result = fn()
        if condition(last_result):
            return last_result
        time.sleep(interval_seconds)

    raise TimeoutError(
        f"Condition not met within {timeout_seconds}s. "
        f"Last result: {last_result}"
    )

Polling is better because it adapts to the actual processing time. When the system is fast, the test is fast. When the system is slow, the test waits longer (up to the timeout).

The Event-Driven Approach (Best)

Instead of polling for the outcome, subscribe to an output event that signals completion:

def test_order_flow_end_to_end():
    # Subscribe to the outcome event BEFORE triggering the flow
    outcome_consumer = create_consumer(topic="shipment-events")

    # Trigger the flow
    response = requests.post("http://order-service/orders", json=order_data)
    order_id = response.json()["orderId"]

    # Wait for the outcome event
    event = wait_for_event(
        consumer=outcome_consumer,
        predicate=lambda e: (
            e["eventType"] == "ShipmentCreated" and
            e["data"]["orderId"] == order_id
        ),
        timeout_seconds=30,
    )

    assert event["data"]["orderId"] == order_id
    assert event["data"]["status"] == "CREATED"

This is the most natural approach for event-driven systems — you're testing the system the way it actually works, by observing events.

Testing Event Ordering and Idempotency

Ordering Tests

If your system depends on event ordering (and most event-driven systems do, at least within a partition), you need tests that verify ordering is preserved.

def test_events_processed_in_order():
    """Verify that events for the same key are processed in order."""
    order_id = f"ord-{uuid.uuid4()}"

    events = [
        {"eventType": "OrderPlaced", "orderId": order_id, "sequence": 1},
        {"eventType": "OrderPaid", "orderId": order_id, "sequence": 2},
        {"eventType": "OrderShipped", "orderId": order_id, "sequence": 3},
    ]

    # Produce all events with the same key (same partition)
    for event in events:
        producer.produce(
            topic="orders",
            key=order_id,
            value=json.dumps(event),
        )
    producer.flush()

    # Consume and verify ordering
    consumed = consume_n_events(consumer, n=3, timeout=15)
    consumed_sequences = [e["sequence"] for e in consumed]

    assert consumed_sequences == [1, 2, 3], \
        f"Events received out of order: {consumed_sequences}"

Idempotency Tests

At-least-once delivery means your consumer might see the same event twice. Your tests should verify that processing an event twice produces the same result as processing it once.

def test_order_placed_handler_is_idempotent():
    """Processing the same event twice should not create duplicate side effects."""
    handler = OrderPlacedHandler(
        inventory_service=real_inventory_service,
        db=test_database,
    )

    event = {
        "eventId": "evt-12345",
        "eventType": "OrderPlaced",
        "data": {
            "orderId": "ord-123",
            "items": [{"sku": "WIDGET-001", "quantity": 2}],
        }
    }

    # Process the event twice
    handler.handle(event)
    handler.handle(event)  # duplicate delivery

    # Verify the side effect happened exactly once
    reservations = test_database.query(
        "SELECT COUNT(*) FROM inventory_reservations WHERE order_id = %s",
        ("ord-123",)
    )
    assert reservations == 1, \
        f"Expected 1 reservation, got {reservations}. Handler is not idempotent."

Test the deduplication mechanism explicitly. If your idempotency relies on an eventId stored in a deduplication table, write a test that verifies the deduplication table is populated after first processing and checked before second processing. Don't assume the mechanism works — prove it.

Chaos Engineering: What Happens When the Broker Dies?

You've tested the happy path. Events flow, consumers process, state is correct. Now test what happens when the infrastructure misbehaves — because in production, it will.

Toxiproxy: Network Chaos Made Easy

Toxiproxy sits between your services and the broker, injecting network faults on demand.

# Toxiproxy setup for Kafka chaos testing
from toxiproxy import Toxiproxy

toxiproxy = Toxiproxy()

# Create a proxy in front of Kafka
kafka_proxy = toxiproxy.create(
    name="kafka",
    listen="0.0.0.0:19092",
    upstream="kafka-broker:9092"
)

# Your application connects through the proxy
producer_config = {
    'bootstrap.servers': 'toxiproxy:19092',  # proxy, not real broker
    # ... other config
}

def test_producer_retries_on_network_latency():
    """Producer should succeed even with network latency."""
    # Add 2 seconds of latency
    kafka_proxy.add_toxic(
        name="latency",
        type="latency",
        attributes={"latency": 2000, "jitter": 500}
    )

    try:
        # Producer should still succeed (with retries)
        future = producer.send("orders", event)
        result = future.get(timeout=30)
        assert result.offset is not None
    finally:
        kafka_proxy.remove_toxic("latency")


def test_consumer_recovers_from_broker_disconnect():
    """Consumer should resume processing after a broker outage."""
    # Produce some events
    for i in range(10):
        producer.produce("orders", f'{{"seq": {i}}}')
    producer.flush()

    # Cut the connection
    kafka_proxy.add_toxic(
        name="disconnect",
        type="reset_peer",
        attributes={"timeout": 0}
    )

    # Wait for the outage to be noticed
    time.sleep(5)

    # Restore the connection
    kafka_proxy.remove_toxic("disconnect")

    # Verify the consumer catches up
    consumed = consume_all_events(consumer, topic="orders", timeout=30)
    assert len(consumed) == 10, f"Expected 10 events, got {len(consumed)}"

Chaos Scenarios to Test

Scenario	What You're Testing	How to Inject
Broker becomes unreachable	Producer retries, consumer reconnection	Toxiproxy `reset_peer` or Docker `pause`
High network latency	Timeout handling, request timeouts	Toxiproxy `latency` toxic
Packet loss	Retry logic, duplicate handling	Toxiproxy `bandwidth` toxic with limit
Broker disk full	Producer backpressure, error handling	Fill the container's disk
Consumer process crash mid-processing	Offset management, reprocessing	`kill -9` the consumer process
Rebalancing during processing	At-least-once processing, offset commit timing	Add/remove consumers from the group
Schema registry unavailable	Serialization failure handling	Stop the schema registry container

Load Testing Event Pipelines

Event-driven systems have different failure modes under load than synchronous systems. Instead of returning HTTP 503, they silently accumulate lag. A system that looks fine at 1,000 events/second might fall apart at 10,000 — but the failure manifests as growing consumer lag, not immediate errors.

# Simple load test for event throughput
import time
from confluent_kafka import Producer
import json

def load_test_producer(bootstrap_servers, topic, num_events, batch_size=1000):
    producer = Producer({
        'bootstrap.servers': bootstrap_servers,
        'linger.ms': 50,          # batch events for better throughput
        'batch.num.messages': 10000,
        'queue.buffering.max.messages': 100000,
        'compression.type': 'lz4',
    })

    start = time.time()
    delivered = 0
    errors = 0

    def delivery_callback(err, msg):
        nonlocal delivered, errors
        if err:
            errors += 1
        else:
            delivered += 1

    for i in range(num_events):
        event = {
            "eventType": "LoadTestEvent",
            "sequence": i,
            "timestamp": time.time(),
            "payload": "x" * 500,  # ~500 byte payload
        }
        producer.produce(
            topic=topic,
            key=str(i % 100),  # distribute across 100 keys
            value=json.dumps(event),
            callback=delivery_callback,
        )

        # Periodically flush to avoid buffer overflow
        if i % batch_size == 0:
            producer.flush()

    producer.flush()
    elapsed = time.time() - start

    print(f"Produced {num_events} events in {elapsed:.2f}s")
    print(f"Throughput: {num_events / elapsed:.0f} events/sec")
    print(f"Delivered: {delivered}, Errors: {errors}")

# Run it
load_test_producer(
    bootstrap_servers="broker1:9092",
    topic="load-test",
    num_events=1_000_000
)

What to measure during a load test:

Producer throughput (events/second)
Consumer throughput (events/second consumed)
Consumer lag (the gap between what's been produced and what's been consumed)
End-to-end latency (time from produce to consume — embed a timestamp in the event)
Broker resource utilization (CPU, memory, disk I/O, network)
Error rates (serialization failures, timeout errors, rebalancing events)

The most important metric is consumer lag under sustained load. If lag grows unboundedly, your consumers can't keep up, and you need to either add consumers, optimize processing, or increase parallelism.

Testing Schema Evolution

Schema evolution — changing the format of events over time — is inevitable. Testing that old consumers can handle new schemas (backward compatibility) and new consumers can handle old schemas (forward compatibility) prevents production outages during deployment.

# Test backward compatibility: new schema, old consumer
def test_old_consumer_handles_new_event_format():
    """An existing consumer should gracefully handle events with new fields."""

    # Old event format (v1)
    v1_event = {
        "eventType": "OrderPlaced",
        "version": 1,
        "data": {
            "orderId": "ord-123",
            "customerId": "cust-456",
            "totalAmount": "59.98",
        }
    }

    # New event format (v2) — added 'currency' and 'loyaltyPoints'
    v2_event = {
        "eventType": "OrderPlaced",
        "version": 2,
        "data": {
            "orderId": "ord-456",
            "customerId": "cust-789",
            "totalAmount": "99.99",
            "currency": "EUR",       # new field
            "loyaltyPoints": 150,    # new field
        }
    }

    # The v1 consumer should handle both
    consumer = OrderPlacedConsumerV1()

    consumer.handle(v1_event)   # should work (same version)
    consumer.handle(v2_event)   # should work (ignores unknown fields)

    # Verify both were processed correctly
    assert len(consumer.processed_orders) == 2


# Test forward compatibility: old schema, new consumer
def test_new_consumer_handles_old_event_format():
    """A new consumer should handle events from before the schema change."""

    v1_event = {
        "eventType": "OrderPlaced",
        "version": 1,
        "data": {
            "orderId": "ord-123",
            "customerId": "cust-456",
            "totalAmount": "59.98",
            # no 'currency' field — it didn't exist in v1
        }
    }

    consumer = OrderPlacedConsumerV2()  # expects currency but has a default
    consumer.handle(v1_event)

    order = consumer.processed_orders[0]
    assert order.total_amount == Decimal("59.98")
    assert order.currency == "USD"  # default value when field is absent

Schema Registry Compatibility Testing

If you're using a schema registry (you should be), test compatibility before registering a new schema:

#!/bin/bash
# Test schema compatibility before deployment

SCHEMA_REGISTRY_URL="http://schema-registry:8081"
SUBJECT="orders-value"
NEW_SCHEMA_FILE="schemas/order-placed-v2.avsc"

# Check compatibility
RESULT=$(curl -s -X POST \
  "${SCHEMA_REGISTRY_URL}/compatibility/subjects/${SUBJECT}/versions/latest" \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d "{\"schema\": $(cat ${NEW_SCHEMA_FILE} | jq -Rs .)}")

IS_COMPATIBLE=$(echo $RESULT | jq -r '.is_compatible')

if [ "$IS_COMPATIBLE" != "true" ]; then
  echo "INCOMPATIBLE SCHEMA CHANGE DETECTED"
  echo "Details: $RESULT"
  exit 1
fi

echo "Schema is compatible. Safe to deploy."

The Test Pyramid for Event-Driven Systems

The traditional test pyramid (many unit tests, fewer integration tests, even fewer E2E tests) doesn't map cleanly onto event-driven systems. A more accurate model:

                    /\
                   /  \
                  / E2E \          (few, slow, high-confidence)
                 /--------\
                /  Chaos    \      (periodic, infrastructure-focused)
               /  Engineering \
              /----------------\
             / Contract Tests    \  (per-consumer, per-producer)
            /--------------------\
           /  Integration Tests    \  (with real broker via Testcontainers)
          /------------------------\
         /  Unit Tests (handlers)    \  (fast, many, logic-focused)
        /----------------------------\

Unit tests: Test event construction, handler logic, serialization/deserialization. Fast. Many. No broker.
Integration tests: Test event flow through a real broker. Producer -> broker -> consumer. Testcontainers. Slower. Fewer.
Contract tests: Verify producer-consumer agreements. Can be run independently. Medium speed.
Chaos tests: Verify resilience. Periodic, not on every commit. Slow.
E2E tests: Verify complete business flows. Few. Slow. High-maintenance. Essential.

Coverage Guidance

What to Test	Level	How
Event payload construction	Unit	Mock the publisher
Event handler business logic	Unit	Pass in events directly
Serialization/deserialization	Unit	Round-trip test with schema
Event flow through broker	Integration	Testcontainers
Producer-consumer contract	Contract	Pact or schema compatibility
Ordering guarantees	Integration	Produce N events, verify order
Idempotency	Integration	Process same event twice, verify state
Error handling (poison pill)	Integration	Produce invalid event, verify DLQ
Schema evolution	Integration + Contract	Both old and new formats
Broker failure recovery	Chaos	Toxiproxy, container stop/start
Consumer lag under load	Load	Sustained traffic test
Complete business flow	E2E	API to final state, polling or event-based

Anti-Patterns: Testing the Broker and Flaky Async Assertions

Anti-Pattern 1: Testing the Broker

# DON'T DO THIS
def test_kafka_retains_messages():
    """Verify Kafka retains messages for the configured retention period."""
    # ...
    # This is Kafka's job. Kafka has its own tests. Test YOUR code.

You're not testing Kafka. You're not testing RabbitMQ. You didn't write them. They have their own test suites. Test your code's interaction with the broker, not the broker's behavior.

Anti-Pattern 2: Fixed Sleep in Assertions

# DON'T DO THIS
def test_event_is_consumed():
    producer.produce("orders", event)
    time.sleep(10)  # hope and pray
    assert consumer.last_event == event

Use polling with a timeout. Use event-based verification. Use Awaitility (Java) or tenacity (Python). Never use a fixed sleep as your synchronization mechanism.

Anti-Pattern 3: Shared State Between Tests

# DON'T DO THIS
TOPIC = "orders"  # all tests share this topic

def test_order_placed():
    produce_to(TOPIC, order_placed_event)
    # might consume an event from a different test

Use unique topic names per test. Use unique consumer groups per test. Tests should be independent.

Anti-Pattern 4: Testing Only the Happy Path

If your test suite doesn't include tests for: malformed events, duplicate events, events arriving out of order, broker unavailability, and schema mismatches — your test suite is a wish list, not a verification.

Anti-Pattern 5: Not Testing Consumer Offset Management

# DO THIS — verify your consumer resumes correctly after restart
def test_consumer_resumes_from_last_committed_offset():
    """After restart, consumer should process only new events."""
    # Produce 5 events
    for i in range(5):
        producer.produce("orders", f'{{"seq": {i}}}')
    producer.flush()

    # Consume all 5 and commit
    consumed_first = consume_n_events(consumer, n=5, timeout=10)
    consumer.commit()
    consumer.close()

    # Produce 3 more events
    for i in range(5, 8):
        producer.produce("orders", f'{{"seq": {i}}}')
    producer.flush()

    # Create a new consumer with the same group ID
    new_consumer = create_consumer(group_id=consumer_group_id)
    consumed_second = consume_n_events(new_consumer, n=3, timeout=10)

    # Should get events 5, 6, 7 — not 0-4 again
    sequences = [json.loads(e)["seq"] for e in consumed_second]
    assert sequences == [5, 6, 7]

Summary

Testing event-driven systems requires a fundamentally different approach from testing synchronous systems. The key principles:

Separate business logic from infrastructure. Extract event handlers into testable units that don't depend on the broker.
Use contract tests to verify producer-consumer agreements without running both simultaneously.
Use Testcontainers for integration tests with real brokers. Embedded fakes are acceptable fallbacks, not first choices.
Never use fixed sleeps. Use polling, event-based verification, or awaiting libraries.
Test idempotency explicitly. Process every event at least twice in your tests.
Test failure modes. Chaos engineering isn't optional for production event-driven systems.
Test schema evolution. Old consumers with new events. New consumers with old events. Both directions.
Use unique topics and consumer groups per test. Shared state between tests is the fastest path to a flaky test suite.

The goal isn't 100% coverage — it's confidence that your system behaves correctly under normal conditions and degrades gracefully under abnormal ones. In an event-driven system, "abnormal conditions" includes most of the conditions you'll encounter in production.

Anti-Patterns and Pitfalls

Every architectural style has its failure modes — the predictable ways that well-intentioned teams turn a good idea into a bad system. Event-driven architecture is no exception. In fact, because EDA is genuinely powerful and genuinely different from what most teams are used to, its failure modes are often more spectacular. A poorly designed monolith is slow. A poorly designed event-driven system is slow, inconsistent, impossible to debug, and occasionally loses data in ways that take weeks to discover.

This chapter is a catalog of the ways things go wrong. Some of these are mistakes you make during initial design. Others are diseases that develop gradually, like architectural arthritis, until one day the system can barely move. All of them are easier to prevent than to cure.

Event Soup — When Everything Is an Event and Nothing Makes Sense

The pattern: The team discovers events and goes all-in. Every action, every state change, every internal implementation detail becomes an event. The system produces hundreds of event types, many of which are consumed by nobody, and the event stream becomes an incomprehensible torrent of noise.

What it looks like:

Topic: user-events
Events:
  UserLoggedIn
  UserLoggedOut
  UserClickedButton
  UserScrolledPage
  UserMovedMouse          <- really?
  UserHoveredOverLink
  UserResizedBrowser
  UserSessionHeartbeat
  UserPreferencesLoaded
  UserCacheWarmed
  UserDatabaseQueryExecuted  <- this is not a domain event
  UserThreadPoolAdjusted     <- this is an internal implementation detail

Why it happens: Teams conflate domain events with technical events. A domain event represents something meaningful that happened in the business domain — "a customer placed an order." A technical event represents something that happened inside a system — "the connection pool was resized." These are fundamentally different things and should not live in the same event stream or, in most cases, in an event stream at all.

The damage: Consumer teams can't find the events they care about amid the noise. Topic size grows rapidly, increasing storage costs and replay times. New developers stare at the event catalog and give up trying to understand it. Monitoring alerts fire on meaningless events.

How to fix it:

Apply the "would a domain expert care?" test. If you told a product manager about this event, would they care? "Customer placed an order" — yes. "Thread pool resized to 50" — no.
Separate domain events from operational telemetry. Operational data belongs in metrics, logs, and traces — not in domain event streams.
Require at least one consumer before publishing an event. If nobody consumes it, don't produce it. Events without consumers are just disk usage.
Maintain an event catalog with ownership, purpose, and known consumers for each event type.

God Events — The 47-Field Event That Knows Too Much

The pattern: A single event type carries an enormous payload containing everything any consumer might ever need. Instead of publishing focused events that represent specific things that happened, the producer dumps its entire internal state into every event.

What it looks like:

{
  "eventType": "OrderUpdated",
  "orderId": "ord-123",
  "customerId": "cust-456",
  "customerName": "Alice Smith",
  "customerEmail": "alice@example.com",
  "customerPhone": "+1-555-0123",
  "customerLoyaltyTier": "GOLD",
  "customerLifetimeValue": 12450.00,
  "customerSignupDate": "2019-03-15",
  "shippingStreet": "123 Main St",
  "shippingCity": "Springfield",
  "shippingState": "IL",
  "shippingZip": "62701",
  "shippingCountry": "US",
  "billingStreet": "456 Oak Ave",
  "billingCity": "Springfield",
  "billingState": "IL",
  "billingZip": "62702",
  "billingCountry": "US",
  "items": [...],
  "subtotal": 59.98,
  "taxRate": 0.0825,
  "taxAmount": 4.95,
  "shippingCost": 5.99,
  "totalAmount": 70.92,
  "currency": "USD",
  "paymentMethod": "CREDIT_CARD",
  "paymentLast4": "4242",
  "paymentBrand": "VISA",
  "warehouseId": "wh-east-1",
  "fulfillmentPriority": "STANDARD",
  "estimatedDelivery": "2025-11-20",
  "internalNotes": "Customer called about delivery window",
  "createdAt": "2025-11-15T10:30:00Z",
  "updatedAt": "2025-11-15T14:22:00Z",
  "updatedBy": "system",
  "previousStatus": "PENDING",
  "currentStatus": "CONFIRMED",
  "statusReason": "Payment verified",
  "version": 17
}

Why it happens: The producer team doesn't know what consumers need, so they include everything. Or a single generic event type like OrderUpdated replaces what should be multiple specific events (OrderConfirmed, OrderShipped, OrderCancelled).

The damage:

Tight coupling. Every consumer is coupled to the producer's internal data model. When the producer renames customerLoyaltyTier to loyaltyLevel, every consumer breaks.
PII sprawl. The event contains customer PII (name, email, phone, addresses) even when the consumer only needed the order status. Now every consumer of this topic has access to PII, regardless of whether they need it.
Schema evolution hell. Evolving a 47-field schema is exponentially harder than evolving a 7-field schema. Every field is a potential breaking change.
Bandwidth waste. The shipping service needs orderId, shippingAddress, and items. It receives 2KB of data it ignores.

How to fix it:

Use specific event types instead of generic ones. OrderConfirmed is better than OrderUpdated with currentStatus: "CONFIRMED". Each event type carries only the fields relevant to what happened.
Apply interface segregation. An event should contain only the data needed to understand what happened. Consumers that need more context can look it up by ID.
Separate the event from the entity. An event is not a database row notification. It's a record of something that happened, with enough context to be meaningful but not so much that it's a data dump.

// BETTER: Focused event
{
  "eventType": "OrderConfirmed",
  "orderId": "ord-123",
  "customerId": "cust-456",
  "confirmedAt": "2025-11-15T14:22:00Z",
  "totalAmount": "70.92",
  "currency": "USD",
  "estimatedDelivery": "2025-11-20"
}

The Distributed Monolith — Temporal Coupling Through Events

The pattern: Services are technically separate, deployed independently, and communicate through events. But they're coupled so tightly through event dependencies that you can't change, deploy, or operate any of them independently. You've achieved the worst of both worlds: the complexity of distribution with none of the benefits.

What it looks like:

OrderService                                    (1) OrderPlaced →
  PaymentService                                (2) PaymentProcessed →
    InventoryService                            (3) InventoryReserved →
      ShippingService                           (4) ShipmentCreated →
        NotificationService                     (5) CustomerNotified →
          AnalyticsService                      (6) OrderAnalyticsUpdated →
            LoyaltyService                      (7) LoyaltyPointsAwarded

Every service waits for the previous service to complete before it can act. Changing the order of operations requires changing multiple services. A failure in step 3 cascades to steps 4-7. You have a sequential pipeline disguised as an event-driven architecture.

The test: Can you deploy InventoryService without coordinating with PaymentService and ShippingService? If not, you have a distributed monolith.

Why it happens: Teams take an existing sequential workflow and replace synchronous calls with events without rethinking the workflow. The arrows change from HTTP calls to events, but the dependencies don't.

The damage:

Deploy coupling. Services must be deployed in a specific order. A schema change in OrderPlaced triggers a cascade of changes across every downstream service.
Fragile chains. The reliability of the chain is the product of the reliability of each link. If each service is 99.9% reliable, a 7-service chain is 99.3% reliable — and that's before accounting for the broker.
Debugging nightmares. An end-to-end operation touches 7 services. Finding where something went wrong requires correlating events across all of them.

How to fix it:

Identify truly independent reactions. In the example above, ShippingService genuinely needs to know the payment succeeded. But does AnalyticsService need to wait for ShipmentCreated? Can it react directly to OrderPlaced?
Fan out, don't chain. Multiple services reacting to the same event is good (fan-out). Services forming a daisy chain where each depends on the previous one's output is often bad (pipeline).
Choreography doesn't mean sequential. The beauty of event-driven choreography is that independent actions happen in parallel. If everything is serial, you might want an orchestrator (a saga) that coordinates explicitly.

// BETTER: Fan-out from the originating event
OrderService: OrderPlaced →
  ├── PaymentService (reacts to OrderPlaced)
  ├── InventoryService (reacts to OrderPlaced)
  ├── AnalyticsService (reacts to OrderPlaced)
  └── NotificationService (reacts to OrderPlaced)

PaymentService: PaymentProcessed →
  ├── ShippingService (reacts to PaymentProcessed + InventoryReserved)
  └── LoyaltyService (reacts to PaymentProcessed)

Chatty Services — Death by a Thousand Events

The pattern: Services communicate every minor internal state change as an event. A single user action generates dozens of events, each triggering further processing, which generates more events, which triggers more processing. The system is drowning in its own verbosity.

What it looks like:

User clicks "Place Order" -> the system produces:

OrderInitiated
OrderValidationStarted
OrderAddressValidated
OrderItemsValidated
OrderPricingValidated
OrderValidationCompleted
OrderPaymentInitiated
OrderPaymentAuthorized
OrderPaymentCaptured
OrderPaymentCompleted
OrderInventoryCheckStarted
OrderInventoryAvailable
OrderInventoryReserved
OrderInventoryCheckCompleted
OrderConfirmed
OrderConfirmationEmailQueued
OrderConfirmationEmailSent
OrderAnalyticsRecorded

Eighteen events for one user action. And if the notification service fails and retries, you get more events for the retry. And if the analytics service processes events in batches, it might re-emit its own events for each batch.

Why it happens: Over-decomposition. The team has internalized "events are good" and concluded "more events are more good." Or each team is independently logging their internal state transitions as events, not realizing that the aggregate volume is crushing the broker.

The damage:

Broker overload. Event volume grows superlinearly with user activity.
Consumer lag. Consumers spend most of their time processing noise events they don't care about.
Increased latency. More events means more broker writes, more network traffic, more consumer processing.
Storage costs. All those events are stored. Retained. Replicated. Backed up.

How to fix it:

Distinguish between internal and external events. OrderInventoryCheckStarted is an internal state transition of the InventoryService. It should be a log line, not an event on a shared topic.
Publish outcome events, not step events. OrderConfirmed is an outcome. OrderAddressValidated is a step. The outside world cares about outcomes.
Batch related state changes. Instead of 5 events for the payment lifecycle, publish one PaymentCompleted event with the relevant details.
Measure your event-to-business-action ratio. If one user action generates more than 3-5 external events, question whether all of them need to exist.

Premature Event Sourcing — "We Might Need the History Someday"

The pattern: The team adopts event sourcing — storing all state as a sequence of events rather than as current state — for every service, regardless of whether the benefits justify the costs. The justification is usually some variation of "it gives us a complete audit trail" or "we might need to replay events someday."

Why it's a problem: Event sourcing is a powerful technique with genuine use cases: financial systems where you need a complete audit trail, collaborative editing where you need to merge divergent histories, domains with complex temporal queries. But it's also one of the most operationally expensive architectural patterns in existence.

The costs nobody mentions in the conference talk:

Event store management. You now have an append-only log that grows forever. Snapshots help but add complexity. Compaction has different semantics than in a traditional database.
Projection rebuilds. When a read model projection has a bug, you need to replay all events to rebuild it. For a mature system with millions of events, this can take hours or days.
Schema evolution is brutal. Every version of every event format must be deserializable forever. You can't just run a database migration. You need upcasters that convert old event formats to new ones on the fly.
Debugging difficulty. "What's the current state?" requires replaying events. You can't just SELECT * FROM orders WHERE id = 'ord-123'.
Eventual consistency everywhere. Read models are projections of the event stream, and they're always at least slightly behind. This is fine for most cases and terrible for others (like showing a user the order they just placed).

# When event sourcing is warranted
class BankAccount:
    """
    Financial regulations require a complete, immutable audit trail.
    Event sourcing is a natural fit.
    """
    def apply(self, event):
        match event:
            case Deposited(amount=amount):
                self.balance += amount
            case Withdrawn(amount=amount):
                self.balance -= amount
            case Frozen(reason=reason):
                self.is_frozen = True

# When event sourcing is NOT warranted
class UserPreferences:
    """
    User changed their notification settings.
    Nobody needs the history of notification preference changes.
    Just use a database.
    """
    pass  # Use a regular UPDATE statement. Seriously.

How to fix it:

Use event sourcing only where the history is the feature. If the business requirement is "show me the current state," a database is simpler, faster, and easier to operate.
Event sourcing per aggregate, not per system. The BankAccount aggregate might be event-sourced. The UserPreferences aggregate should not be.
CQRS without event sourcing. You can have separate read and write models (CQRS) without storing the write model as events. Many of the benefits of CQRS come from the separation of concerns, not from the event store.

The Event-Driven Bandwagon — Using EDA Because It's Trendy

The pattern: The team adopts event-driven architecture not because the problem demands it, but because it's what the industry is talking about, it looks good on a resume, or someone went to a conference.

Symptoms:

The system has fewer than five services and traffic that a single PostgreSQL database handles comfortably.
Events are consumed by exactly one consumer (in which case, why not a direct call?).
The team spent three months setting up Kafka for a system that processes 100 events per day.
Every architecture discussion includes the phrase "but what if we need to scale?" about a product that has 200 users.

The honest truth: Most software systems don't need event-driven architecture. A well-designed monolith with a relational database handles an enormous range of requirements. EDA adds value when you have genuine decoupling requirements, multiple independent consumers for the same data, high throughput demands, or complex workflows that benefit from choreography.

The test: Would a synchronous API call between these two services work? Is there a reason it can't be synchronous? If the only reason for using events is "events are better," you don't have a reason.

How to fix it: Be honest about your requirements. If you've already deployed the infrastructure, consider whether you can simplify by replacing some event-driven communication with direct calls where appropriate. There's no shame in synchronous communication. It's been powering the internet since before most of your team was born.

Synchronous Disguised as Asynchronous — Request-Reply Over Events

The pattern: A service publishes an event and then blocks waiting for a response event. The producer has a correlation ID, a timeout, and a temporary reply topic. Congratulations, you've reinvented HTTP but worse.

# This is a synchronous call pretending to be asynchronous
class OrderService:
    def place_order(self, order):
        correlation_id = str(uuid.uuid4())

        # Publish "request" event
        self.producer.produce("payment-requests", {
            "correlationId": correlation_id,
            "orderId": order.id,
            "amount": order.total,
        })

        # Block waiting for "response" event
        response = self.reply_consumer.wait_for(
            topic="payment-responses",
            correlation_id=correlation_id,
            timeout=30,  # seconds
        )

        if response is None:
            raise TimeoutError("Payment service didn't respond")

        if response["status"] == "APPROVED":
            return OrderConfirmation(order.id)
        else:
            raise PaymentDeclinedException(response["reason"])

Why it's a problem:

You've added the latency of two broker hops (request + response) to what could have been one HTTP call.
You've added the complexity of correlation IDs, reply topics, and timeout handling.
You've lost the benefits of asynchronous communication (temporal decoupling, independent scaling) because the producer is blocking anyway.
If the broker is down, the "synchronous" call fails — and you don't even get a clear HTTP error code.

When request-reply over events IS appropriate: When the request and response genuinely occur at different times (hours, days), when you need the request to be durably queued, or when you need multiple services to see the request (fan-out with response aggregation). These are rare cases.

How to fix it: Use a synchronous call (HTTP, gRPC) for request-response interactions. Use events for fire-and-forget notifications and reactions. If you need the call to be resilient, add retries and a circuit breaker to the HTTP call. That's what those patterns are for.

Schema Anarchy — No Governance, No Contracts, No Hope

The pattern: There is no schema registry, no schema validation, and no governance over event formats. Each producer publishes whatever JSON it feels like. Consumers parse with json.loads() and hope.

What it looks like in production:

# Producer A's idea of an OrderPlaced event
{"type": "order_placed", "order_id": "123", "amount": 59.98}

# Producer B's idea of an OrderPlaced event
{"eventType": "OrderPlaced", "orderId": "ORD-123", "totalAmount": "59.98", "currency": "USD"}

# Producer C's idea (after a Friday afternoon refactor)
{"event": "ORDER_PLACED", "id": "123", "total": 5998}  # amount in cents, because why not

# The consumer
def handle_order(event):
    order_id = event.get("order_id") or event.get("orderId") or event.get("id")
    amount = event.get("amount") or event.get("totalAmount") or event.get("total")
    if isinstance(amount, str):
        amount = float(amount)
    if amount > 1000:  # probably cents?
        amount = amount / 100
    # I hate my life

Why it happens: Schema governance is unglamorous work. Nobody gets promoted for setting up a schema registry. The team moves fast in the early days, shipping features without formal schemas, and by the time the pain is unbearable, there are 200 event types in production with no consistent format.

The damage:

Consumer fragility. Consumers break on every producer change because there's no contract.
Silent data corruption. A producer changes a field from dollars to cents. The consumer doesn't know. Reports are wrong for three weeks.
Onboarding difficulty. New developers cannot understand the system because there's no authoritative documentation of event formats.
Impossible schema evolution. You can't evolve what you haven't defined.

How to fix it:

Deploy a schema registry. Confluent Schema Registry, Apicurio, or even a Git repository with reviewed schema files.
Enforce schema validation on produce. Events that don't match the registered schema are rejected. No exceptions.
Establish naming conventions. camelCase or snake_case, pick one. eventType or type, pick one. Document it. Enforce it in code review.
Require schema review for new event types. Like database migration review, but for events.

The Dual-Write Problem — Writing to DB and Broker Without Coordination

The pattern: A service writes to its database and publishes an event in two separate operations without coordination. If one succeeds and the other fails, the database and the event stream are inconsistent.

# THE BUG
def place_order(self, order):
    # Step 1: Write to database
    self.db.insert(order)        # succeeds

    # Step 2: Publish event
    self.producer.produce(       # FAILS (broker is down)
        "orders",
        OrderPlaced(order)
    )

    # Result: order exists in DB but no event was published.
    # Downstream services don't know the order exists.
    # The customer gets charged but shipping never starts.

# THE OTHER BUG (reversing the order doesn't help)
def place_order(self, order):
    # Step 1: Publish event
    self.producer.produce(       # succeeds
        "orders",
        OrderPlaced(order)
    )

    # Step 2: Write to database
    self.db.insert(order)        # FAILS (unique constraint violation)

    # Result: event was published but order doesn't exist in DB.
    # Downstream services try to process a phantom order.

Why it happens: Developers are accustomed to transactional databases, where two writes either both succeed or both fail. Databases and message brokers are separate systems and don't share transactions (in general).

How to fix it:

Transactional outbox pattern. Write the event to an "outbox" table in the same database transaction as the business data. A separate process reads the outbox and publishes to the broker.

def place_order(self, order):
    with self.db.transaction() as tx:
        tx.insert("orders", order)
        tx.insert("outbox", {
            "topic": "orders",
            "key": order.id,
            "payload": json.dumps(OrderPlaced(order).to_dict()),
            "created_at": datetime.utcnow(),
        })
    # Both writes succeed or both fail — atomic.
    # A separate outbox relay process publishes the events.

Change Data Capture (CDC). Use Debezium or a similar tool to capture changes from the database's transaction log and publish them as events. The database is the source of truth; events are derived.
Event sourcing. If the event IS the write (event sourcing), there's no dual write — there's only one write.

Missing Idempotency — "It Worked in Dev"

The pattern: Consumers process events without any deduplication or idempotency mechanism. In development, with at-most-once delivery and a single consumer, everything looks fine. In production, with at-least-once delivery, rebalancing, and retries, customers get charged twice.

# NOT IDEMPOTENT — will charge the customer for every delivery attempt
class PaymentConsumer:
    def handle(self, event):
        if event["eventType"] == "OrderPlaced":
            self.payment_gateway.charge(
                customer_id=event["data"]["customerId"],
                amount=event["data"]["totalAmount"],
            )
            # If the consumer crashes AFTER charging but BEFORE committing
            # the offset, the event will be redelivered and the customer
            # will be charged again.

Why it happens: At-least-once delivery semantics mean that under normal operation, most events are delivered exactly once. The duplicates appear during edge cases: consumer rebalancing, broker failover, network hiccups, process crashes. These don't happen in local development. They happen at 3 AM on Saturday in production.

How to fix it:

# IDEMPOTENT — safe to process multiple times
class PaymentConsumer:
    def handle(self, event):
        if event["eventType"] == "OrderPlaced":
            event_id = event["eventId"]

            # Check if we've already processed this event
            if self.dedup_store.has_been_processed(event_id):
                logger.info(f"Skipping duplicate event {event_id}")
                return

            # Process the event
            self.payment_gateway.charge(
                customer_id=event["data"]["customerId"],
                amount=event["data"]["totalAmount"],
                idempotency_key=event_id,  # payment gateway also deduplicates
            )

            # Record that we've processed this event
            self.dedup_store.mark_processed(event_id)

Better yet, use natural idempotency keys. Instead of a generic eventId, use a domain-specific key that makes duplicates naturally harmless:

-- Idempotent via unique constraint
INSERT INTO payments (order_id, amount, status)
VALUES ('ord-123', 59.98, 'CHARGED')
ON CONFLICT (order_id) DO NOTHING;
-- Second insert is silently ignored. No duplicate charge.

Ignoring Back-Pressure — The Firehose Problem

The pattern: A producer publishes events at a rate far exceeding what consumers can process. There's no mechanism to slow the producer down, no monitoring of consumer lag, and no alerting until the broker's disk fills up.

What it looks like:

Producer: 50,000 events/sec
Consumer: 5,000 events/sec
Lag growth: 45,000 events/sec
Time to fill broker disk: ~4 hours
Time until alert fires: never (nobody set one up)
Time until on-call page: when the broker crashes

Why it happens: The producer and consumer are developed by different teams. The producer team load-tested their producer. The consumer team load-tested their consumer. Nobody tested them together at production-grade volumes.

How to fix it:

Monitor consumer lag. This is the single most important metric for any event-driven system. Alert when lag exceeds a threshold.
Set broker-side quotas. Limit per-producer and per-consumer throughput.

# Kafka producer quotas
quota.producer.default=10485760  # 10 MB/sec per producer
quota.consumer.default=10485760  # 10 MB/sec per consumer

Right-size your consumer parallelism. If a single consumer can handle 5,000 events/sec and you're producing 50,000 events/sec, you need at least 10 consumer instances (and at least 10 partitions).
Implement backpressure in the producer when possible. If the producer is ingesting from an external source, it may be able to slow down or buffer when the downstream system is overwhelmed.
Set retention limits that match your consumer's ability to catch up. If your consumer can never process a month of events in a reasonable time, a month-long retention policy gives you false confidence.

Over-Partitioning and Under-Partitioning

Over-Partitioning

The pattern: "We might need to scale to 100 consumers someday, so let's create 100 partitions now." The system has 3 consumers and 100 partitions. 97 partitions sit idle. Broker metadata overhead increases. Rebalancing takes longer. Leader election after a broker failure is slower.

The costs of too many partitions:

Each partition has a leader and replicas. More partitions = more metadata, more leader election overhead.
Consumer rebalancing time increases linearly with partition count.
End-to-end latency increases because the broker batches by partition, and more partitions mean smaller batches.
File descriptor usage on the broker increases (each partition has at least one open segment file per replica).

Under-Partitioning

The pattern: The topic has 1 partition. The consumer cannot be parallelized. When load increases, the only option is to process faster — you cannot add more consumers.

The costs of too few partitions:

Maximum consumer parallelism equals the partition count. One partition = one consumer.
You can increase the partition count later, but you can't decrease it (in Kafka). And increasing it breaks key-based ordering guarantees for existing keys.

How to fix it: Start with a partition count based on your expected peak throughput and consumer parallelism, with modest headroom. A common heuristic: max(throughput_mbps / consumer_throughput_per_partition_mbps, expected_max_consumers). For most workloads, 6-30 partitions per topic is reasonable. 100+ is almost always premature. 1 is almost always insufficient.

The "Just Replay Everything" Fallacy

The pattern: "If anything goes wrong with a consumer's state, we'll just replay all events from the beginning." This sounds reasonable when you have 10,000 events. It stops sounding reasonable when you have 10 billion.

The problems:

Replay time. Replaying a year of events for a single consumer can take days. The consumer is unavailable during replay, or serving stale data.
Side effects. If the consumer's event handler has side effects (sending emails, charging credit cards, calling external APIs), replaying events re-triggers those side effects. You now need to distinguish between "live" processing and "replay" processing, which adds complexity to every handler.
Schema evolution. Events from a year ago might be in a format that the current consumer doesn't support. You need event upcasters or versioned handlers.
Resource consumption. Replaying generates enormous read load on the broker. If other consumers are sharing the same broker, their performance degrades.

How to fix it:

Take periodic snapshots. Instead of replaying from the beginning, replay from the last known good snapshot. This bounds the replay window.
Build idempotent consumers. If replay is a recovery mechanism, the consumer must handle replayed events safely (see "Missing Idempotency" above).
Design handlers to detect replay mode. Suppress side effects during replay (no emails, no API calls, no charges).
Set realistic retention policies. If you can't replay more than a week's worth of events in a reasonable time, a 90-day retention policy is giving you 83 days of false comfort.
Monitor replay progress. If you're replaying, know how long it will take. "It's replaying" is not a status. "It's replayed 4.2 billion of 7.8 billion events, estimated completion in 9 hours" is a status.

How to Recognize You're in Trouble and How to Dig Out

Warning Signs

You might already be living with some of these anti-patterns. Here's how to tell:

Symptoms of Event Soup:

Your event catalog has more than 100 event types and nobody can explain what half of them are for.
New team members take more than a week to understand the event flow.
You have topics with no active consumers.

Symptoms of a Distributed Monolith:

Deploying one service requires coordinating with three other teams.
A failure in one service cascades to multiple downstream services within seconds.
Your deployment pipeline has a specific service ordering.

Symptoms of Missing Idempotency:

Customers report duplicate charges, duplicate notifications, or duplicate orders "sometimes."
The bugs are never reproducible in development.
Your on-call rotation coincides with broker maintenance windows.

Symptoms of Schema Anarchy:

Consumer code is full of try/except blocks around deserialization.
Field names change without warning.
Nobody knows the authoritative format for any event type.

Symptoms of the Firehose Problem:

Consumer lag grows during business hours and shrinks overnight.
Broker disk usage grows monotonically.
End-to-end latency increases throughout the day.

Digging Out

If you recognize these symptoms, here's the uncomfortable truth: fixing anti-patterns in a running system is harder than preventing them. But it's not impossible.

Step 1: Observe. Before changing anything, instrument the system. Add consumer lag monitoring. Build an event flow diagram. Catalog every event type and its producers/consumers. You can't fix what you can't see.

Step 2: Prioritize by blast radius. The dual-write problem that occasionally loses orders is more urgent than the chatty service that wastes bandwidth. Fix the things that cause data loss or incorrect behavior first.

Step 3: Introduce governance incrementally. You don't need to boil the ocean. Start with a schema registry and require schemas for new event types. Existing unschematized events can be migrated gradually.

Step 4: Fix idempotency. This is the single highest-value improvement for most event-driven systems. Make every consumer idempotent. Use the outbox pattern for producers. This doesn't fix the architecture, but it prevents the architecture's problems from reaching customers.

Step 5: Consolidate event types. Kill the event types nobody consumes. Merge the event types that differ by one field. Replace the god events with focused events. This is slow, thankless work, and it's the most impactful architectural improvement you can make.

Step 6: Establish ownership. Every topic has an owning team. Every event type has an owning producer. Every schema has a reviewer. Without ownership, entropy wins.

Summary

Event-driven architecture is not inherently better or worse than other architectural styles. It's a set of trade-offs. The anti-patterns in this chapter all share a common root cause: adopting the style without fully understanding the trade-offs, or understanding them in theory but not in the operational reality of a production system.

The good news: every anti-pattern here has been encountered, diagnosed, and survived by teams before you. The bad news: many of those teams encountered it in production, diagnosed it under pressure, and survived it by the narrowest of margins.

Read this chapter before you build. Reread it six months after you ship. The anti-patterns you recognize the second time will be different — and more personally relevant — than the ones you recognized the first time.

Evaluation Framework

Every message broker's marketing page says the same thing: fast, reliable, scalable, easy to operate. This is approximately as useful as a restaurant describing its food as "delicious." You need a framework — a structured set of criteria that lets you compare brokers on dimensions that actually matter for your workload, your team, and your budget. Otherwise you are choosing infrastructure based on blog post popularity and conference talk charisma, which is how organisations end up running Apache Kafka for a system that processes twelve messages per hour.

This chapter defines the evaluation framework we will use for every broker in Part 2. We are not going to rank brokers on a single axis, because single-axis rankings are how you end up with headlines like "Kafka vs RabbitMQ: Which Is Better?" — a question roughly as answerable as "Hammer vs Screwdriver: Which Is Better?" The answer, as always, is: it depends. But it depends on specific, measurable things, and that is what this chapter is about.

Why a Framework Matters

The problem with choosing a message broker is not a lack of options. It is a surplus of options combined with a deficit of honest, apples-to-apples comparison. Every broker occupies a slightly different point in the design space. Some optimise for throughput. Some optimise for routing flexibility. Some optimise for operational simplicity. Some optimise for the VC pitch deck and will figure out the rest later.

Without a framework, broker evaluation degenerates into one of the following failure modes:

The "My Last Job" heuristic. You used Kafka at your previous company. It worked. You use Kafka again. This is fine until it is not — your new workload may have completely different characteristics.
The benchmarketing trap. You read a vendor benchmark showing 2 million messages per second. You did not notice it was running on 96 cores with messages the size of a TCP ACK, no replication, no durability, and consumers that discard every message immediately. In production you will get a tenth of that. Maybe.
The "what does Google use?" fallacy. Google uses a custom-built system you cannot buy. But even if you could, you are not Google. You do not have Google's traffic, Google's budget, or Google's army of SREs. Stop optimising for problems you do not have.
The feature checkbox. The broker supports exactly-once delivery. Great. Except "supports" means "there is a configuration option that, when combined with idempotent consumers, transactional producers, and a very specific set of operational practices, approximates exactly-once semantics under conditions that rarely hold in practice." The checkbox does not capture this nuance.

A framework forces you to ask the right questions, weight them according to your actual needs, and make trade-offs explicitly rather than accidentally.

Dimension 1: Throughput

Throughput is the most frequently cited and most frequently misunderstood broker metric. It comes in two flavours:

Messages per second — how many discrete messages the broker can handle. This matters when your messages are small and your bottleneck is per-message overhead (serialisation, routing decisions, acknowledgment processing).

Bytes per second — how much raw data the broker can move. This matters when your messages are large (images, documents, fat JSON blobs) and your bottleneck is I/O bandwidth.

The distinction is critical. A broker that excels at 1KB messages may choke on 1MB messages, and vice versa. Always benchmark with messages that resemble your actual workload.

Burst vs Sustained

Most systems do not produce traffic at a constant rate. They have peaks — Black Friday, market open, batch job completion, the moment a popular notification goes out. You need to know two things:

Sustained throughput: what the broker can handle indefinitely without degradation.
Burst throughput: what it can absorb for short periods before backpressure kicks in, latency spikes, or things start falling over.

A broker with excellent sustained throughput but no burst capacity will punish you during traffic spikes. A broker with excellent burst capacity but mediocre sustained throughput is living on borrowed time during prolonged load.

What to Watch For

Replication cost. Most benchmarks quote throughput with replication factor 1 (i.e., no replication). In production, you will run replication factor 3. This typically cuts throughput significantly — how much depends on the broker's replication protocol.
Acknowledgment mode. "Fire and forget" throughput is always higher than "wait for durable acknowledgment" throughput. Make sure the benchmark matches your durability requirements.
Consumer throughput vs producer throughput. They are not always symmetric. Some brokers are write-optimised; some are read-optimised.
Partition/queue count. Throughput often scales with the number of partitions or queues, but at some point the overhead of managing many partitions exceeds the parallelism benefit.

Dimension 2: Latency

If throughput is about how much, latency is about how fast. And the number that matters is almost never the average.

Percentile Latency

p50 (median): Half your messages are faster than this, half are slower. This is what most people think of as "latency," and it is the least interesting number.
p95: 1 in 20 messages is slower than this. This is where user-visible pain begins.
p99: 1 in 100 messages is slower than this. This is where SLAs live and die.
p99.9 and beyond (tail latency): The worst-case scenario, excluding extreme outliers. Tail latency is caused by garbage collection pauses, disk flushes, rebalancing, leader elections, and other events that are rare individually but nearly guaranteed to happen at scale. If you process a million messages per day, your p99.9 latency is what ten thousand of those messages experience.

Why Tail Latency Matters

In a microservices architecture, a single user request often fans out to multiple services. If each service adds a little tail latency, the overall request latency is dominated by the slowest component. This is the "tail at scale" problem. A broker with excellent p50 latency but terrible p99.9 will poison your entire request path.

What to Watch For

GC pauses. JVM-based brokers (Kafka, Pulsar) are susceptible to garbage collection pauses that show up as latency spikes. Tuning helps. Eliminating does not.
Batching trade-offs. Many brokers batch messages for throughput efficiency. Batching improves throughput at the cost of latency — your message waits in a buffer until the batch fills or a timeout expires.
Network round trips. How many round trips does it take to publish a message and receive an acknowledgment? The answer varies dramatically between brokers and protocols.
Coordinated omission. A common benchmarking error where the measurement tool slows down during broker slowdowns, making latency look better than it actually is. If a vendor's latency benchmark does not mention coordinated omission, be sceptical.

Dimension 3: Durability

Durability is the broker's answer to the question: "If something goes wrong, will my messages survive?"

Failure Scenarios

"Something goes wrong" comes in degrees:

Process crash. The broker process dies and restarts. Messages in memory may be lost unless they were flushed to disk.
Disk failure. The physical storage device fails. Messages on that disk are gone unless replicated elsewhere.
Node failure. An entire machine goes down — power loss, hardware failure, kernel panic. Same as disk failure, but also affects any in-flight state.
Network partition. Nodes are alive but cannot communicate. The broker must decide whether to remain available (accepting writes that may diverge) or consistent (refusing writes until the partition heals). This is the CAP theorem in action, and every broker makes a different choice — or, more accurately, gives you a configuration knob to make the choice yourself.
Datacenter failure. An entire availability zone or region goes dark. Your messages survive only if they were replicated to another datacenter.

What to Watch For

Default configuration. Many brokers ship with durability settings optimised for performance, not safety. Kafka's default acks=1 means the producer gets an acknowledgment after one broker writes to its page cache — not to disk. If that broker crashes before flushing, the message is gone. You want acks=all in most production scenarios, but you have to know to set it.
Replication protocol. Synchronous replication is safer but slower. Asynchronous replication is faster but allows data loss during failover. Understand which one your broker uses and under what conditions.
fsync behaviour. Does the broker actually call fsync, or does it rely on the OS page cache? The answer has enormous implications for durability after power loss.
Fencing and split-brain. When a leader fails and a follower is promoted, can the old leader come back and accept writes that conflict with the new leader? Proper fencing prevents this; not all brokers implement it correctly.

Dimension 4: Ordering Guarantees

"Messages arrive in order" is a statement that sounds simple and is anything but. Order is always relative to something:

No ordering guarantee. Messages may arrive in any order. This is the simplest model and the easiest to scale, but your consumers must be idempotent and order-independent.

Partition-level ordering. Messages within a single partition (or queue) are ordered. Messages across partitions are not. This is the Kafka model. If you need ordering for a specific entity (e.g., all events for order #1234), you route all events for that entity to the same partition using a partition key.

Topic-level ordering. All messages within a topic are totally ordered. This is simpler to reason about but limits throughput to what a single writer can produce, since total ordering requires a single serialisation point.

Global ordering. All messages across all topics are totally ordered. This is extremely expensive and almost never offered by distributed brokers. If you think you need global ordering, you probably need to reconsider your design.

What to Watch For

Ordering and parallelism are in tension. Stronger ordering guarantees mean fewer opportunities for parallel processing. A single partition gives you perfect ordering and zero parallelism. Pick your poison.
Redelivery breaks ordering. If a message fails processing and is redelivered, it will arrive after messages that were originally behind it. Your "ordered" stream is now out of order. This is a fundamental tension in any system with retries.
Consumer concurrency. Even if the broker delivers messages in order, if your consumer processes them concurrently (multiple threads pulling from the same partition), you have destroyed ordering at the application level.

Dimension 5: Delivery Semantics

The holy trinity of messaging delivery guarantees:

At-most-once. The message is delivered zero or one times. Simple: fire and forget. If anything goes wrong, the message is lost. Appropriate for metrics, telemetry, and anything where losing a fraction of messages is acceptable.

At-least-once. The message is delivered one or more times. The broker retries until the consumer acknowledges receipt. This means duplicates are possible — your consumer must be idempotent, meaning processing the same message twice produces the same result as processing it once. This is the most common production setting and the one you should default to unless you have a specific reason not to.

Exactly-once. The message is delivered exactly one time. This is the white whale of distributed messaging. True exactly-once delivery in a distributed system is, by the laws of physics and distributed computing theory, impossible in the general case. What brokers actually offer is exactly-once semantics — the system behaves as if each message was delivered exactly once, through a combination of idempotent producers, transactional writes, and consumer offset management. It works. It is also slower, more complex, and more fragile than at-least-once with idempotent consumers. Use it when you genuinely need it (financial transactions, inventory updates). Do not use it because it sounds nice.

What to Watch For

The scope of "exactly-once." Kafka's exactly-once semantics apply within the Kafka ecosystem — from Kafka topic to Kafka topic via Kafka Streams. The moment you write to an external database, you are back to at-least-once unless you implement your own deduplication. No broker can guarantee exactly-once delivery to an arbitrary external system.
The cost of exactly-once. Transactional producers and consumers add overhead — additional round trips, coordinator involvement, and reduced throughput. Benchmark with transactions enabled, not disabled.
Idempotency is your actual safety net. Regardless of what the broker promises, design your consumers to be idempotent. It costs you almost nothing in most cases and protects you from an entire category of bugs.

Dimension 6: Operational Complexity

This is the dimension that separates conference talks from production incidents. Every broker is easy to run in a Docker Compose file on your laptop. The question is what happens when you run it in production with real data, real traffic, and real failure modes.

Deployment

How many components do you need to deploy? Kafka historically required ZooKeeper — a separate distributed system with its own failure modes. Pulsar requires ZooKeeper and BookKeeper. RabbitMQ is a single binary with an Erlang runtime.
Can you run it on Kubernetes? Is there a mature operator? Or does the broker have stateful requirements that fight Kubernetes's abstractions?
What is the minimum viable cluster for production? Three nodes? Five? One node with a prayer?

Monitoring

What are the key metrics? Every broker has its own set of critical indicators — consumer lag, under-replicated partitions, queue depth, memory alarms.
Is there a built-in dashboard, or do you need to configure Prometheus/Grafana/Datadog from scratch?
How easy is it to correlate broker metrics with application-level symptoms?

Upgrades

Can you do rolling upgrades with zero downtime? This seems like a basic requirement, but the reality varies. Some brokers handle it gracefully. Others require careful partition leadership migration, client compatibility checks, and possibly ritual sacrifice.
How frequently are new versions released? Is the upgrade path well-documented?
Are there protocol version negotiations between clients and brokers?

Staffing

This is the one nobody talks about in the evaluation spreadsheet. If your broker requires specialised expertise to operate, you need to hire or train people with that expertise. Kafka operations is a genuine specialisation. Erlang debugging is a genuine specialisation. BookKeeper tuning is a genuine specialisation. If you are a team of five and nobody has touched your broker's underlying technology, factor in the learning curve — or the consulting bill.

Dimension 7: Ecosystem

A message broker does not exist in isolation. It lives in an ecosystem of:

Client libraries. Are there official clients for your language? Are they well-maintained? Are there community clients, and if so, are they production-quality or weekend projects?
Connectors. Can you stream data to and from your databases, data lakes, search indices, and cloud services without writing custom code? Kafka Connect has hundreds of connectors. Other brokers have fewer.
Stream processing. Can you do lightweight transformations on the broker itself, or do you need a separate processing framework? Kafka has Kafka Streams and ksqlDB. Pulsar has Pulsar Functions. RabbitMQ has... a plugin for that, probably.
Schema management. Is there a schema registry for enforcing contracts between producers and consumers?
Tooling. Command-line tools, admin UIs, debugging utilities, performance testing tools.
Integration with observability stacks. OpenTelemetry support, distributed tracing propagation, structured logging.

A broker with a rich ecosystem lets you build faster and integrate more easily. A broker with a thin ecosystem means you are writing more glue code and building more tooling yourself.

Dimension 8: Cost

Cost has three layers, and most evaluations only look at the first one.

Layer 1: Licensing and Infrastructure

Licensing. Is the broker open source? Truly open source (Apache 2.0, MIT) or source-available with restrictions (BSL, SSPL)? Does the vendor offer a commercial edition with features you actually need?
Infrastructure. How much compute, memory, and storage does the broker require? A broker that requires SSDs for acceptable performance costs more than one that is happy on spinning disks. A broker that requires 32GB of heap per node costs more than one that runs in 2GB.
Managed service pricing. If you use a managed offering, what is the pricing model? Per message? Per byte? Per partition? Per hour? Managed services shift cost from ops headcount to cloud bills, but the total cost may be higher or lower depending on your usage patterns.

Layer 2: Operational Cost

Staffing. How many people does it take to keep this thing running? A complex broker that requires a dedicated team is more expensive than a simple one that your existing platform team can manage alongside other services.
Incident cost. When things go wrong — and they will — how expensive is the outage? A broker that is hard to debug, hard to recover, or hard to failover extends your MTTR and increases the cost of every incident.

Layer 3: Opportunity Cost

Time to market. A broker with a steep learning curve delays your project. A broker with a rich ecosystem accelerates it.
Lock-in. How much effort does it take to switch brokers later? If you have built your entire architecture around Kafka Streams and Schema Registry, migrating to Pulsar is not a weekend project. It is a quarter. Maybe two.

Dimension 9: Community and Longevity

Will this broker exist in five years? This is not a trivial question.

Apache Foundation projects (Kafka, Pulsar, ActiveMQ) have the backing of a foundation and a contributor community that outlives any single company. But foundation governance can also mean slow progress and design-by-committee.
Corporate-backed projects (RabbitMQ under Broadcom, Redis under Redis Ltd) depend on the continued investment of their corporate steward. Corporate priorities shift. Acquisitions happen. Licence changes happen.
VC-funded startups (various newer brokers) exist as long as the funding lasts and the business model works. Some will thrive. Some will pivot. Some will acqui-hire their engineering team and shut down the product.

Look at the contributor graph, the release cadence, the mailing list or forum activity, and the job market. If nobody is hiring for your broker, that is a signal — either it is so simple that nobody needs specialists (good) or nobody is using it (bad).

Dimension 10: Cloud-Native Readiness

Like it or not, most new deployments target cloud environments, and increasingly Kubernetes.

Managed offerings. Does a major cloud provider or the vendor offer a fully managed service? Managed services trade control for convenience, and for many teams, the trade is worth it.
Kubernetes operators. Is there a mature, actively maintained operator? Operators handle the stateful lifecycle management that makes running distributed systems on Kubernetes bearable.
Tiered storage / cloud-native storage. Can the broker offload cold data to object storage (S3, GCS)? This dramatically reduces storage costs for high-retention workloads.
Elasticity. Can you scale the broker up and down in response to load? Or is it sized for peak and wasting resources the rest of the time?

Dimension 11: Multi-Tenancy

If multiple teams or applications share a broker cluster — and they will, because running dedicated clusters for every team is expensive — you need multi-tenancy support.

Namespace isolation. Can you create logical namespaces with separate access controls, quotas, and policies? Pulsar has first-class multi-tenancy with tenants and namespaces. Kafka has ACLs and quotas but no formal namespace concept. RabbitMQ has virtual hosts.
Resource quotas. Can you limit the throughput, storage, or connection count for a specific tenant? Without quotas, one noisy team can starve everyone else.
Topic/queue policies. Can you set retention, replication, and other policies per-tenant rather than cluster-wide?
Observability per tenant. Can you see metrics broken down by tenant, or is everything aggregated?

Weak multi-tenancy means you end up running multiple clusters anyway, at which point you are paying the operational cost of multi-tenancy without the resource efficiency benefit.

The Framework at a Glance

Here is the complete evaluation framework in table form. In subsequent chapters, we will score each broker against these dimensions.

Dimension	Key Questions	Why It Matters
Throughput	Messages/sec? Bytes/sec? Burst vs sustained?	Can it handle your volume?
Latency	p50, p95, p99, p99.9?	Can it handle your speed requirements?
Durability	Replication? fsync? Datacenter failure?	Will you lose messages?
Ordering	None, partition, topic, global?	Can your consumers process correctly?
Delivery Semantics	At-most-once, at-least-once, exactly-once?	What happens when things fail?
Operational Complexity	Components, monitoring, upgrades, staffing?	What does it cost to keep running?
Ecosystem	Clients, connectors, tooling, stream processing?	What can you build without custom code?
Cost	Licensing, infrastructure, ops, opportunity?	What is the total cost of ownership?
Community & Longevity	Contributors, releases, governance, job market?	Will it outlive your project?
Cloud-Native Readiness	Managed services, K8s operators, tiered storage?	Does it fit your deployment model?
Multi-Tenancy	Namespaces, quotas, per-tenant policies?	Can teams share safely?

How to Use This Framework

The framework is not a scorecard where the highest total wins. Different workloads weight these dimensions differently:

High-throughput event streaming (clickstream, IoT telemetry): weight throughput and cost heavily; latency and ordering may be less critical.
Financial transaction processing: weight durability, ordering, and delivery semantics heavily; cost is secondary to correctness.
Microservice command/event bus: weight operational complexity and ecosystem heavily; raw throughput is rarely the bottleneck.
Multi-team platform: weight multi-tenancy and ecosystem heavily; you need a broker that scales organisationally, not just technically.

The chapters that follow will evaluate each broker honestly against this framework. Some brokers will shine in certain dimensions and stumble in others. That is not a failure of the broker — it is a reflection of the trade-offs inherent in distributed systems design. The goal is not to find the "best" broker. It is to find the best broker for you.

Let us begin.

Apache Kafka

If event-driven architecture has a mascot, it is Apache Kafka. Not because it was first — it was not — but because it achieved something rare in infrastructure software: it became the default. When someone says "we need a message broker," the next sentence is usually "so, Kafka?" regardless of whether Kafka is appropriate for the workload in question. This is a testament to both its genuine capabilities and its formidable marketing apparatus.

Kafka deserves its reputation as a powerful, battle-tested platform for event streaming at scale. It also deserves an honest assessment of its sharp edges, operational demands, and the significant gap between a "Hello World" producer and a production-grade deployment. This chapter provides both.

Overview

What It Is

Apache Kafka is a distributed event streaming platform. At its core, it is a distributed, partitioned, replicated commit log with pub/sub semantics bolted on top. It was designed to be the central nervous system for data — a unified platform that handles real-time event streams and historical data replay with equal competence.

Brief History

Kafka was born at LinkedIn in 2010, created by Jay Kreps, Neha Narkhede, and Jun Rao. LinkedIn needed to move massive amounts of data — user activity events, metrics, logs — between systems in real time, and nothing on the market did what they needed at the scale they needed it.

The key insight was deceptively simple: model the message broker as an append-only log. Instead of the traditional message queue model (message arrives, consumer processes it, message disappears), Kafka retains messages for a configurable period. Consumers track their own position (offset) in the log. This means multiple consumers can read the same data independently, consumers can rewind and replay, and the broker does not need to track per-consumer state.

Kreps named it after Franz Kafka, the author, because "it is a system optimised for writing." Whether the existential dread of operating it in production was intentional homage is left as an exercise for the reader.

Kafka was open-sourced under the Apache License in 2011, became an Apache Top-Level Project in 2012, and Kreps, Narkhede, and Rao founded Confluent in 2014 to build a commercial platform around it. Confluent has been enormously successful, going public in 2021, and has become the primary driver of Kafka's development — for better and for worse, as the line between "open source Kafka" and "Confluent Platform" can be blurry.

Who Runs It

The Apache Software Foundation governs the open-source project. Confluent employs most of the core committers. This creates the usual tension of a corporate-backed open-source project: Confluent has a financial incentive to add premium features to their commercial offering rather than to the open-source core. To date, this has not crippled the community edition, but it is worth being aware of.

Architecture

The Commit Log

Everything in Kafka flows from a single abstraction: the append-only commit log. A Kafka topic is a named log. Producers append records to the end. Consumers read from a position (offset) and move forward. Records are immutable once written. The log is retained for a configurable period (time-based or size-based) and then old segments are deleted or compacted.

This is powerful because:

Decoupling in time. Consumers do not need to be running when messages are produced. They catch up later.
Multiple consumers. Different consumer groups can read the same topic independently at different speeds.
Replay. Reset a consumer's offset to the beginning and reprocess everything. This is invaluable for bug fixes, new consumer deployments, and data reprocessing.
Ordering. Within a single partition, records are strictly ordered by offset.

Brokers and Clusters

A Kafka cluster consists of multiple broker nodes. Each broker is a JVM process that handles read/write requests, manages partitions, and replicates data. Brokers are stateful — they store data on local disks.

A production cluster typically runs at least three brokers for replication. Large deployments run dozens or hundreds.

Partitions

Each topic is divided into one or more partitions. A partition is a single, ordered, append-only log. Partitions are the unit of parallelism in Kafka:

Producers can write to different partitions concurrently.
Each partition in a consumer group is consumed by exactly one consumer instance, so more partitions means more consumer parallelism.
Partitions are distributed across brokers for load balancing.

The partition count is set at topic creation and is very difficult to change later. Increasing partitions is possible but reshuffles key-based routing. Decreasing partitions is not supported without recreating the topic. Choose wisely, or more realistically, over-provision slightly and hope for the best.

Replication and ISR

Each partition has one leader replica and zero or more follower replicas on different brokers. All reads and writes go through the leader. Followers replicate by fetching from the leader.

The In-Sync Replica (ISR) set contains the leader and all followers that are "caught up" within a configurable lag threshold. When a producer sends a message with acks=all, the broker waits until all ISR members have acknowledged the write before confirming to the producer. If a follower falls behind, it is removed from the ISR. If the leader fails, a new leader is elected from the ISR.

This design gives you a tunable trade-off between durability and latency. acks=all with min.insync.replicas=2 on a replication factor 3 topic means you can lose one broker without data loss and without downtime. Losing two brokers loses the partition (it becomes unavailable, not corrupted — assuming unclean.leader.election.enable=false, which it should be).

ZooKeeper and KRaft

Historically, Kafka depended on Apache ZooKeeper for cluster metadata management: broker registration, controller election, topic configuration, and partition assignments. ZooKeeper worked, but it was a separate distributed system with its own operational requirements, failure modes, and scaling limits. Running ZooKeeper well is its own skillset, and many Kafka operational issues were actually ZooKeeper issues.

KRaft (Kafka Raft) is the long-awaited replacement, moving metadata management into the Kafka brokers themselves using a Raft-based consensus protocol. KRaft was marked production-ready in Kafka 3.3 (late 2022), and ZooKeeper support was formally deprecated. As of Kafka 4.0 (early 2025), new clusters should use KRaft exclusively. Migration from ZooKeeper to KRaft is supported but non-trivial — it involves running both systems in parallel during the transition.

KRaft eliminates the ZooKeeper dependency, simplifies deployment, and removes the metadata scaling bottleneck that limited Kafka clusters to hundreds of thousands of partitions. It also removes one of the most common "I set up Kafka in 10 minutes" blog post lies, since those 10 minutes never included ZooKeeper tuning.

Producer Semantics

The Basics

A Kafka producer sends records to topics. Each record has a key (optional), a value, a timestamp, and optional headers. The key determines partition assignment: records with the same key go to the same partition (assuming the partition count does not change), giving you per-key ordering.

Acknowledgment Modes

The acks configuration controls durability:

acks=0: Fire and forget. The producer does not wait for any acknowledgment. Maximum throughput, maximum data loss potential.
acks=1: The leader writes to its local log and acknowledges. If the leader crashes before followers replicate, the message is lost. This is the default, which is a somewhat aggressive choice.
acks=all (or acks=-1): The leader waits for all ISR members to replicate before acknowledging. Combined with min.insync.replicas=2, this is the safe production setting.

Idempotent Producer

Enabling enable.idempotence=true (the default since Kafka 3.0) assigns each producer a unique ID and sequence number per partition. The broker deduplicates messages with the same producer ID and sequence number. This prevents duplicates caused by producer retries — if the producer sends a message, the broker receives it but the acknowledgment is lost, the producer retries, and the broker recognises the duplicate.

Idempotent production is free in terms of configuration and nearly free in terms of performance. There is no good reason to disable it.

Transactional Producer

For atomically writing to multiple partitions and topics — "either all of these messages are committed or none of them are" — Kafka provides transactional producers. This is the foundation of Kafka's exactly-once semantics (EOS).

The transactional producer coordinates with a transaction coordinator (a broker) to begin transactions, send messages, and commit or abort atomically. Combined with the idempotent producer and transactional consumers (using read_committed isolation), this provides exactly-once semantics within the Kafka ecosystem.

The catch: exactly-once applies to Kafka-to-Kafka pipelines. The moment you write to an external system (a database, an API), you are outside the transaction boundary. You need your own deduplication or two-phase commit mechanism for end-to-end exactly-once.

Consumer Groups and Rebalancing

Consumer Groups

Consumers are organised into consumer groups. Each partition in a topic is assigned to exactly one consumer in a group. If you have 6 partitions and 3 consumers in a group, each consumer gets 2 partitions. If you have 6 partitions and 8 consumers, 2 consumers sit idle. If you have 6 partitions and 1 consumer, that consumer handles all 6.

This is simple, elegant, and the source of a great deal of operational misery.

Rebalancing

When a consumer joins or leaves a group — by starting up, crashing, or failing to send a heartbeat within session.timeout.ms — the group rebalances. During a rebalance, all consumers in the group stop processing, partitions are redistributed, and consumers resume from their last committed offset.

The problem: rebalancing is stop-the-world. For the duration of the rebalance, no messages are processed. In a large consumer group with many partitions, rebalancing can take seconds to minutes. If your consumers are slow to start, it takes even longer. If a consumer is flapping (repeatedly joining and leaving), you get rebalance storms — a cascade of rebalances that can effectively halt processing.

Mitigation Strategies

Static group membership (group.instance.id): Assigns a persistent identity to each consumer so that temporary disconnections do not trigger rebalances.
Cooperative rebalancing (CooperativeStickyAssignor): Instead of stop-the-world, only the affected partitions are revoked and reassigned. This dramatically reduces rebalance impact.
Incremental rebalancing (Kafka 3.x+): Further improvements to minimize disruption.

These mitigations work well but require configuration. The default rebalancing behaviour is the stop-the-world "eager" protocol, because Kafka respects backward compatibility more than your uptime.

Partition Assignment Strategies

RangeAssignor: Assigns contiguous partition ranges to consumers. Can be uneven.
RoundRobinAssignor: Distributes partitions evenly across consumers.
StickyAssignor: Tries to minimize partition movement during rebalances.
CooperativeStickyAssignor: Sticky + cooperative rebalancing. This is what you want.

The Kafka Ecosystem

Kafka Streams

A Java library for building stream processing applications. Not a separate cluster — it runs inside your application. Kafka Streams provides stateful operations (aggregations, joins, windowing) backed by local state stores (RocksDB) with changelog topics for fault tolerance.

Kafka Streams is genuinely excellent for Kafka-centric stream processing. It is also Java-only, which limits its audience. If your team writes Python or Go, Kafka Streams is not an option unless you want to maintain a separate Java service.

ksqlDB

SQL-like syntax on top of Kafka Streams. Write SELECT * FROM orders WHERE amount > 1000 EMIT CHANGES and get a streaming query. Powerful for prototyping and simple transformations. Less suitable for complex business logic. It is a Confluent product, and its licensing has shifted over time — check the current terms before building on it.

Kafka Connect

A framework for streaming data between Kafka and external systems using pre-built connectors. Source connectors pull data into Kafka (e.g., from a database via CDC). Sink connectors push data from Kafka to external systems (e.g., to Elasticsearch, S3, a data warehouse).

The connector ecosystem is Kafka's most underrated asset. There are hundreds of connectors — some from Confluent, some from the community, some from vendors. The quality varies, but the top-tier connectors (Debezium for CDC, S3 sink, JDBC source/sink) are production-grade and save enormous amounts of custom integration code.

Connect runs as a separate cluster of worker nodes, which means it is another thing to deploy, monitor, and scale. But the alternative — writing and maintaining custom producer/consumer code for every data integration — is worse.

Schema Registry

Confluent Schema Registry stores Avro, Protobuf, and JSON Schema definitions and enforces compatibility rules (backward, forward, full) when schemas evolve. Producers and consumers negotiate schemas via the registry, and serialisation/deserialisation happens automatically.

Schema Registry is not part of Apache Kafka — it is a Confluent project. There is an open-source version under the Confluent Community License (not Apache 2.0) and alternatives like Apicurio Registry and AWS Glue Schema Registry. Schema management is essential for any serious Kafka deployment; which registry you use is a practical choice.

Strengths

Throughput

Kafka was built for throughput. A well-tuned cluster on modern hardware can sustain millions of messages per second. The sequential I/O design, zero-copy transfer (using sendfile to stream data directly from page cache to network socket without copying through the JVM), and batching at every layer make it extraordinarily efficient at moving large volumes of data.

Ecosystem

Nothing else comes close. Client libraries in every mainstream language. Hundreds of connectors. Kafka Streams, ksqlDB, Schema Registry. Integration with every major data tool, cloud platform, and monitoring system. If you choose Kafka, you will rarely be the first person to solve a particular integration problem.

Battle-Tested at Scale

Kafka runs at LinkedIn (7 trillion messages per day as of their last public disclosure), Netflix, Uber, Apple, and thousands of other companies. It has been hammered, broken, patched, and hardened by the most demanding workloads on the planet. The failure modes are well-documented. The operational practices are well-established. There is a large community of experienced operators.

The Log Abstraction

The commit log model is simply a better abstraction than the traditional message queue for many workloads. Replay capability, consumer group independence, and time-based retention make it natural for event sourcing, stream processing, CDC, and analytics pipelines.

Exactly-Once Semantics

Kafka is one of the few systems that offers genuine exactly-once semantics (within the Kafka ecosystem). The combination of idempotent producers, transactional writes, and consumer offset commits inside transactions is well-engineered and works reliably.

Weaknesses

Operational Complexity

Running Kafka well is a full-time job. A medium-sized deployment requires attention to broker configuration (there are over 200 configuration parameters), partition management, replication monitoring, consumer group health, disk management, JVM tuning, and network configuration. The learning curve from "it works on my laptop" to "it is reliable in production" is steep and expensive.

JVM Tuning

Kafka runs on the JVM, and GC pauses are a real concern. Long GC pauses can cause brokers to drop out of the ISR, trigger leader elections, and increase tail latency. Tuning the garbage collector (G1 or ZGC, heap sizing, GC logging) is part of every serious Kafka deployment. The ZGC garbage collector in modern JVMs has improved things significantly, but it remains a factor.

Rebalance Storms

Consumer group rebalancing, as discussed above, is Kafka's most annoying operational issue. It has improved dramatically with cooperative rebalancing and static membership, but legacy consumers using the default eager protocol will still experience it.

Partition Management

Partitions are Kafka's unit of parallelism, but they are also its unit of operational pain. Each partition has a leader, followers, and metadata. More partitions means more metadata, more file handles, more recovery time after a broker failure, and more complex rebalancing. There is a practical upper limit to partition count per broker (historically tens of thousands, improved with KRaft), and getting the partition count wrong at topic creation is a mistake that haunts you forever — or at least until you recreate the topic.

No Built-In Message Routing

Kafka has topics and partitions. If you want to route messages based on content (this order goes to the fraud detector, that order goes to the warehouse), you build that routing logic in your producers or stream processors. There is no equivalent to RabbitMQ's exchange-based routing. This is by design — Kafka is a log, not a router — but it means more application-level code for routing-heavy workloads.

Cost at Scale

Kafka clusters need fast disks (SSDs for latency-sensitive workloads), plenty of memory (page cache is critical for performance), and enough network bandwidth for replication and consumer traffic. A three-broker cluster with replication factor 3, reasonable retention, and production-grade monitoring is not cheap. It is cheaper than the alternatives at very high throughput, but it is expensive at low to moderate throughput.

Ideal Use Cases

High-throughput event streaming: clickstream, IoT telemetry, log aggregation, metrics pipelines
Event sourcing and CQRS: the log is a natural event store
Stream processing: when combined with Kafka Streams, ksqlDB, or Flink
Change data capture: with Debezium and Kafka Connect
Data integration hub: centralized pipeline between operational and analytical systems
Microservice event bus (at scale): when you have enough traffic to justify the operational overhead

Operational Reality

Minimum Viable Cluster

Three brokers with KRaft (three controller nodes, which can be co-located with brokers for small clusters). Replication factor 3, min.insync.replicas=2. This gives you single-node failure tolerance. For development, a single broker works, but do not mistake development for production.

Key Monitoring Metrics

Under-replicated partitions: Any value above 0 is a red alert. It means data is at risk.
Consumer lag: How far behind are your consumers? Growing lag means consumers cannot keep up.
Request latency (produce, fetch): p99 latency increasing? Time to investigate.
ISR shrink/expand rate: Frequent ISR changes indicate broker instability.
Controller metrics: Leader elections, active controller count.
Disk usage and I/O: Kafka is I/O-bound. Watch for disk saturation.
JVM GC pauses: Long pauses directly impact broker responsiveness.

Use Prometheus with the JMX Exporter, plus Grafana dashboards. There are well-established community dashboards. Confluent Control Center provides a richer view but requires a Confluent licence.

Upgrades

Kafka supports rolling upgrades with zero downtime, but the process requires care:

Update broker configurations for the new inter-broker protocol version.
Roll brokers one at a time, waiting for each to rejoin the ISR before proceeding.
Update the inter-broker protocol version cluster-wide.
Update the log message format version.

Client library versions must be compatible with the broker version. Kafka maintains backward compatibility for several major versions, but testing is essential.

Multi-Datacenter

Kafka is not natively multi-datacenter. Options:

MirrorMaker 2: Replicates topics between clusters. It works, it is asynchronous (so some data loss during failover is possible), and managing the replication topology requires attention.
Confluent Replicator: A commercial alternative with more features.
Stretched clusters: Running a single cluster across datacenters with rack-aware replica placement. This works but requires low-latency inter-datacenter links and careful configuration. Latency between DCs directly impacts produce latency with acks=all.

Managed Offerings

Confluent Cloud: The most feature-rich managed Kafka. Serverless and dedicated options. Schema Registry, ksqlDB, Connect managed. Not cheap, but eliminates operational burden.
Amazon MSK: AWS-managed Kafka. Less opinionated, gives you the raw brokers. MSK Serverless is simpler but has limitations. You still manage topics, consumers, and monitoring.
Aiven for Apache Kafka: Multi-cloud managed Kafka with a clean interface. Good support for open-source tooling.
Azure Event Hubs (Kafka-compatible): Not actually Kafka, but implements the Kafka protocol. Works for basic use cases; do not expect full Kafka feature parity.
Redpanda Cloud: A Kafka-compatible alternative, not Apache Kafka. Covered in its own chapter.

Code Examples

Java Producer

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;

import java.util.Properties;

public class OrderEventProducer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);

        // Production settings — do not skip these
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
        props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
        props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);

        try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
            ProducerRecord<String, String> record = new ProducerRecord<>(
                "order-events",          // topic
                "order-7829",            // key (partition routing)
                "{\"type\":\"OrderPlaced\",\"orderId\":\"order-7829\"}"
            );

            producer.send(record, (metadata, exception) -> {
                if (exception != null) {
                    System.err.println("Send failed: " + exception.getMessage());
                } else {
                    System.out.printf("Sent to partition %d at offset %d%n",
                        metadata.partition(), metadata.offset());
                }
            });
        }
    }
}

Java Consumer

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.time.Duration;
import java.util.List;
import java.util.Properties;

public class OrderEventConsumer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processing-group");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);

        // Start from earliest offset if no committed offset exists
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        // Disable auto-commit — commit manually after processing
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

        // Use cooperative rebalancing to avoid stop-the-world rebalances
        props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
            CooperativeStickyAssignor.class.getName());

        try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
            consumer.subscribe(List.of("order-events"));

            while (true) {
                ConsumerRecords<String, String> records =
                    consumer.poll(Duration.ofMillis(100));

                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("Received: key=%s, value=%s, partition=%d, offset=%d%n",
                        record.key(), record.value(),
                        record.partition(), record.offset());

                    // Process the record here...
                }

                // Commit offsets after processing the batch
                consumer.commitSync();
            }
        }
    }
}

Python Producer (confluent-kafka)

from confluent_kafka import Producer

conf = {
    'bootstrap.servers': 'localhost:9092',
    'acks': 'all',
    'enable.idempotence': True,
    'retries': 10000000,
}

producer = Producer(conf)

def delivery_callback(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}')

producer.produce(
    topic='order-events',
    key='order-7829',
    value='{"type": "OrderPlaced", "orderId": "order-7829"}',
    callback=delivery_callback,
)

# flush() blocks until all messages are delivered or timeout
producer.flush(timeout=10)

Python Consumer (confluent-kafka)

from confluent_kafka import Consumer

conf = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'order-processing-group',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,
}

consumer = Consumer(conf)
consumer.subscribe(['order-events'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            print(f'Consumer error: {msg.error()}')
            continue

        print(f'Received: key={msg.key()}, value={msg.value()}, '
              f'partition={msg.partition()}, offset={msg.offset()}')

        # Process the message here...

        # Commit the offset after successful processing
        consumer.commit(asynchronous=False)
finally:
    consumer.close()

Verdict

Kafka is the obvious choice for high-throughput event streaming — if you can afford its operational cost. "Afford" here means both money and expertise. If your organisation has a platform team that can dedicate time to Kafka operations, or if you are willing to pay for a managed service, Kafka's ecosystem, throughput, and battle-tested reliability make it the strongest option for data-intensive workloads.

Pick Kafka when:

You need sustained high throughput (hundreds of thousands to millions of messages per second)
You need event replay and stream processing
You need a rich connector ecosystem for data integration
You have the team (or the budget for managed services) to operate it properly
You are building a centralised event streaming platform for multiple teams

Avoid Kafka when:

Your throughput is modest and a simpler broker would suffice
You need complex message routing (look at RabbitMQ)
You are a small team without dedicated infrastructure engineers and do not want to pay for managed Kafka
Your use case is simple task queues or request-reply patterns
You need native multi-tenancy with strong isolation (look at Pulsar)

Kafka is not the answer to every messaging problem. But for the problems it was designed to solve — high-volume, durable, replayable event streaming — it remains the benchmark against which everything else is measured.

RabbitMQ

If Kafka is the event streaming platform that conquered the world through sheer throughput and ambition, RabbitMQ is the message broker that quietly kept the world running while Kafka was getting all the conference keynotes. It is older, more traditional in its design, and refreshingly honest about what it is: a message broker. Not an event streaming platform. Not a distributed commit log. A broker. It accepts messages, routes them according to rules you define, and delivers them to consumers. It does this reliably, flexibly, and with a routing model that remains unmatched in the industry.

RabbitMQ is also a project in an interesting phase of its life — mature, widely deployed, and navigating the transition from its traditional queue-based model toward event streaming capabilities with the introduction of Streams. Whether that transition succeeds in keeping RabbitMQ relevant against Kafka and its competitors is one of the more interesting questions in the messaging space.

Overview

What It Is

RabbitMQ is an open-source message broker that implements the Advanced Message Queuing Protocol (AMQP). It provides a flexible routing model based on exchanges and bindings, supports multiple messaging protocols, and offers strong delivery guarantees through publisher confirms and consumer acknowledgments.

Brief History

RabbitMQ was created in 2007 by Rabbit Technologies, a small company founded by Alexis Richardson and Matthias Radestock. The founding premise was to build a proper implementation of the AMQP specification — a protocol designed by JP Morgan Chase and a consortium of financial institutions who were tired of expensive proprietary messaging middleware (IBM MQ, TIBCO).

The implementation language choice was Erlang/OTP, which was unusual then and remains unusual now. Erlang was designed by Ericsson for telecommunications switching systems — highly concurrent, fault-tolerant, soft real-time systems. For a message broker, this is an almost suspiciously good fit. Erlang's lightweight process model, pattern matching, and "let it crash" supervision philosophy translate directly into broker capabilities: millions of concurrent connections, isolated failure handling, and hot code upgrades.

Rabbit Technologies was acquired by VMware in 2010. VMware spun it into Pivotal in 2013. VMware re-acquired Pivotal in 2019. VMware was acquired by Broadcom in 2023. If you are keeping score, that is four corporate parents in fifteen years, which is enough to make any open-source project nervous. Broadcom's acquisition, in particular, raised concerns — Broadcom has a reputation for aggressive cost optimisation of acquired software businesses, and the RabbitMQ community watched carefully for signs of reduced investment. As of this writing, development continues, but the stewardship question is a legitimate factor in long-term planning.

Who Runs It

RabbitMQ is open source under the Mozilla Public License 2.0. The core team is employed by Broadcom (via the VMware Tanzu division). There is an active community of contributors, but the core development is heavily concentrated in the Broadcom-employed team.

Architecture

The AMQP Model: Exchanges, Queues, and Bindings

RabbitMQ's routing model is its defining feature and the reason it excels at use cases where Kafka struggles. The model has three components:

Producers publish messages to exchanges. An exchange is not a queue — it does not store messages. It is a routing engine that examines each incoming message and decides which queues should receive a copy based on bindings — rules that connect exchanges to queues.

Queues store messages until consumers retrieve them. Unlike Kafka's log, a traditional RabbitMQ queue removes messages once they are acknowledged by a consumer. The message lifecycle is: produced → routed → queued → delivered → acknowledged → gone.

Bindings define the routing rules between exchanges and queues. The binding key and the exchange type together determine how messages are routed.

Exchange Types

This is where RabbitMQ shines:

Direct exchange: Routes messages to queues whose binding key exactly matches the message's routing key. Simple, predictable. Use it when you know exactly where a message should go. Think of it as a precise address.

Topic exchange: Routes messages based on wildcard pattern matching on the routing key. Routing keys are dot-delimited strings (e.g., order.placed.us-east), and binding patterns can use * (match one word) and # (match zero or more words). So a binding of order.*.us-east matches order.placed.us-east but not order.placed.eu-west, while order.# matches everything starting with order.. This is extremely powerful for building flexible event routing topologies.

Fanout exchange: Routes messages to all bound queues, ignoring the routing key entirely. Every queue gets a copy. This is your pub/sub broadcast mechanism.

Headers exchange: Routes based on message header attributes rather than the routing key. Less commonly used but valuable when routing decisions depend on multiple attributes (e.g., "route to this queue if content-type is application/json AND priority is high").

Default exchange: A special direct exchange where every queue is automatically bound with a binding key equal to the queue name. Publishing to the default exchange with routing key "my-queue" delivers directly to the queue named "my-queue". This makes RabbitMQ feel like a simple point-to-point queue system when you want it to.

This routing model means you can implement complex event distribution topologies — fan-out to multiple consumers, content-based routing, topic hierarchies — without writing any application-level routing code. The broker handles it. In Kafka, all of this is your problem.

Queue Types

RabbitMQ has evolved from a single queue implementation to three distinct types, each with different trade-offs.

Classic Queues

The original queue type. Messages are stored in memory (with overflow to disk) on a single node. In a cluster, classic queues can be mirrored (replicated) to other nodes for high availability, but classic mirrored queues are deprecated as of RabbitMQ 3.13 and should not be used for new deployments.

Classic mirrored queues had several well-documented problems: synchronisation during initial mirroring blocked the queue, adding mirrors to a loaded queue caused backlogs, and the promotion logic during node failures had edge cases that could lose messages. They worked, mostly, but they were a source of operational anxiety.

Quorum Queues

The modern replacement for mirrored queues, introduced in RabbitMQ 3.8. Quorum queues use the Raft consensus protocol for replication. They require a majority (quorum) of nodes to be available for writes and guarantee data safety through replicated, durable logs.

Quorum queues are the recommended queue type for any workload that requires high availability and data safety. They are more predictable than mirrored queues, handle node failures more gracefully, and have better performance characteristics under normal operation.

The trade-off: quorum queues use more disk I/O (every message is written to a write-ahead log on all replicas) and do not support some features of classic queues (message TTL per message, queue length limits via drop-head, priorities). For most workloads, these limitations are acceptable.

Streams

Introduced in RabbitMQ 3.9, streams are RabbitMQ's answer to the "but can it do what Kafka does?" question. A stream is an append-only log — messages are not removed when consumed. Consumers can read from any point in the stream, replay from the beginning, or start from a timestamp. Sound familiar?

Streams give RabbitMQ event streaming capabilities without requiring a separate system. They support high fan-out (many consumers reading the same data), time-based retention, and offset tracking. The implementation is optimised for sequential disk I/O, borrowing ideas from — you guessed it — Kafka's log design.

We will cover streams in more detail below. The short version: they work, they are improving rapidly, and they make RabbitMQ viable for use cases that previously required Kafka. But they are younger and less battle-tested than Kafka's log.

The AMQP Protocol

RabbitMQ's native protocol is AMQP 0-9-1, the "original" AMQP that the broker was built to implement. Despite the version number suggesting it is a pre-release specification, AMQP 0-9-1 is a mature, well-defined protocol with broad client support.

RabbitMQ also supports AMQP 1.0 (the OASIS standard, which is a substantially different protocol despite sharing a name), MQTT (for IoT workloads), and STOMP (for text-based simplicity). The multi-protocol support is a genuine differentiator — you can have IoT devices publishing via MQTT and backend services consuming via AMQP from the same broker.

As of RabbitMQ 4.0 (late 2024), AMQP 1.0 has become a first-class citizen alongside 0-9-1, with native support for streams and quorum queues over the 1.0 protocol. This is significant because AMQP 1.0 is the protocol that cloud providers and enterprise middleware vendors have standardised on.

Acknowledgments and Publisher Confirms

Consumer Acknowledgments

When a consumer receives a message, it must acknowledge (ack) it to tell the broker the message was successfully processed. Until the ack is received, the message stays in the queue and will be redelivered if the consumer disconnects. This is the foundation of at-least-once delivery.

Consumers can also reject (nack) a message, optionally requesting requeue (put it back in the queue for another attempt) or dead-lettering (route it to a designated dead-letter exchange for error handling).

Manual acknowledgment with basic.ack after successful processing is the safe default. Auto-acknowledgment (the broker considers the message delivered as soon as it sends it) is at-most-once delivery and appropriate only for non-critical messages.

Publisher Confirms

The producer-side equivalent of consumer acks. When publisher confirms are enabled on a channel, the broker sends a confirmation (or negative confirmation) to the producer after the message has been durably stored. This closes the "I published a message but don't know if the broker actually received it" gap.

Without publisher confirms, a message could be lost between the producer sending it and the broker persisting it — network failure, broker crash, or just a full queue with a reject policy. Publisher confirms are essential for any workflow where message loss is unacceptable.

Clustering

RabbitMQ clustering connects multiple nodes into a single logical broker. Cluster metadata (exchange definitions, queue definitions, bindings, users, policies) is replicated to all nodes. Queue data (the actual messages) is not automatically replicated — you need quorum queues or streams for data replication.

Clustering gives you:

Horizontal scaling: Distribute queues across nodes to spread load.
High availability: With quorum queues, survive node failures without message loss.
Unified management: A single management interface for the entire cluster.

Clustering requires reliable, low-latency networking. RabbitMQ clusters should be deployed within a single datacenter or availability zone. For cross-datacenter replication, use federation or shovel.

Federation and Shovel

Federation links exchanges or queues across RabbitMQ clusters (or individual nodes) that may be geographically distributed. Federated exchanges forward messages to downstream exchanges based on bindings. Federated queues allow consumers on one cluster to consume from a queue on another.

Federation is asynchronous, tolerates WAN latency and intermittent connectivity, and does not require the clusters to share the same Erlang cookie (authentication secret). It is designed for cross-datacenter and cross-region scenarios.

Shovel is a simpler mechanism: it is a built-in plugin that acts as a consumer on one broker and a producer on another, forwarding messages between them. Less intelligent than federation, but simpler and more flexible — you can shovel between any two AMQP endpoints, including non-RabbitMQ brokers.

Strengths

Routing Flexibility

No other mainstream broker matches RabbitMQ's routing model. Topic exchanges with wildcard bindings, headers-based routing, and exchange-to-exchange bindings give you content-based message routing that would require custom application code in any other system. If your use case involves directing messages to different consumers based on message attributes, RabbitMQ is the natural choice.

Mature Protocol Support

AMQP 0-9-1 is a well-understood, widely implemented protocol. The multi-protocol support (AMQP 1.0, MQTT, STOMP) means RabbitMQ can serve as a polyglot messaging layer for heterogeneous environments. This is particularly valuable in enterprise settings where different systems speak different protocols.

Ease of Getting Started

RabbitMQ has one of the best out-of-box experiences of any message broker. Install it, start it, open the management UI, and you have a working broker with a web-based dashboard for creating exchanges, queues, bindings, publishing test messages, and monitoring. The learning curve from "nothing installed" to "processing messages" is measured in minutes, not hours.

Management UI and Observability

The built-in management UI is genuinely useful — not just a toy dashboard, but a tool that operators use daily. It shows queue depths, message rates, connection counts, consumer utilisation, and node health. It also exposes an HTTP API for programmatic management.

Prometheus metrics are available via a built-in plugin, with well-maintained Grafana dashboards. The observability story is solid.

Plugin Ecosystem

RabbitMQ has a mature plugin system. Notable plugins include the management UI, Prometheus metrics, federation, shovel, MQTT support, STOMP support, tracing, and delayed message exchange. The plugin architecture means you can extend the broker's functionality without forking it.

Quorum Queues

The introduction of quorum queues was a turning point for RabbitMQ's reliability story. Raft-based replication provides predictable, well-understood behaviour during node failures. If you are deploying RabbitMQ today, quorum queues should be your default for any queue that matters.

Weaknesses

Throughput Ceiling

RabbitMQ is not designed for the same throughput as Kafka. A well-tuned RabbitMQ cluster can handle tens of thousands of messages per second per queue — respectable, but an order of magnitude less than what Kafka achieves. The routing layer, per-message acknowledgment overhead, and queue-based storage model all contribute to this ceiling.

For many workloads, tens of thousands of messages per second is plenty. But if you need hundreds of thousands or millions, RabbitMQ is not the right tool, and no amount of tuning will change that.

Queue Depth Problems

When consumers fall behind and queues grow deep, RabbitMQ suffers. Large queues increase memory usage, slow down message delivery (because the broker is managing more state), and can trigger memory alarms that block publishers. The broker is designed for queues that are relatively short — messages flow in and out quickly. Long queues with millions of messages are a sign that something is wrong.

This is a fundamental design difference from Kafka, where retaining millions of messages is the normal operating mode. RabbitMQ's storage model is optimised for message throughput, not message retention.

Erlang Operational Expertise

Erlang is a fantastic language for building RabbitMQ. It is a less fantastic language for debugging RabbitMQ. When things go wrong at the system level — processes accumulating, memory growing, nodes failing to cluster — understanding what is happening requires familiarity with Erlang's process model, OTP supervision trees, and the Erlang VM's (BEAM) behaviour under stress.

You do not need to write Erlang to operate RabbitMQ. But when you need to read Erlang crash logs, interpret process dump output, or understand why the Erlang distribution protocol is rejecting connections, you will wish you had someone on the team who speaks the language.

No Native Message Replay

Traditional RabbitMQ queues delete messages after acknowledgment. If you need to reprocess historical messages — because you found a bug, deployed a new consumer, or want to rebuild state — those messages are gone. You either need a separate archival system, or you use RabbitMQ Streams (which do support replay, but are a different thing from queues).

This is a significant limitation for event-driven architectures that rely on replay capability. It is also the primary reason teams choose Kafka over RabbitMQ for event sourcing and stream processing workloads.

Classic Mirrored Queue Legacy

If you are running an older RabbitMQ deployment with classic mirrored queues, you are carrying technical debt. Mirrored queues are deprecated and will be removed in a future release. Migration to quorum queues is well-supported but requires planning — the queue types have different semantics, and some applications may depend on features that quorum queues do not support.

Ideal Use Cases

Task queues and work distribution: Distribute tasks to a pool of workers with acknowledgment-based reliability. This is RabbitMQ's bread and butter.
Complex routing topologies: Route messages based on content, headers, or topic patterns without custom code.
Request-reply patterns: RabbitMQ has first-class support for RPC-style messaging with reply-to queues and correlation IDs.
Multi-protocol environments: IoT devices on MQTT, backend services on AMQP, legacy systems on STOMP — all on one broker.
Microservice command bus: Distributing commands (not events) to specific services, with routing and acknowledgment.
Moderate-throughput event distribution: When you need pub/sub but do not need Kafka-scale throughput or log-based retention.

Operational Reality

Memory and Disk Alarms

RabbitMQ has a built-in flow control mechanism tied to resource limits:

Memory alarm: When the broker's memory usage exceeds the configured threshold (default 40% of system RAM), all publishers are blocked. Consumers continue to drain queues, but no new messages are accepted until memory drops below the threshold. This is aggressive but effective — it prevents the broker from running out of memory and crashing.
Disk alarm: When free disk space drops below the configured threshold (default 50MB, which is far too low for production — set it higher), the same publisher blocking occurs.

These alarms are your early warning system. If they fire frequently, your queues are too deep, your consumers are too slow, or your cluster is undersized.

Queue Depth Monitoring

The single most important metric in RabbitMQ operations is queue depth per queue. A queue that is growing means consumers are not keeping up. A queue with millions of messages means you have a problem that is getting worse. Unlike Kafka, where large backlogs are normal and expected, a deep RabbitMQ queue is a symptom that needs attention.

Upgrade Strategies

RabbitMQ supports rolling upgrades within certain version ranges. The process is:

Stop one node.
Upgrade the binary.
Start the node and let it rejoin the cluster.
Wait for quorum queues to synchronise.
Repeat for each node.

Major version upgrades (e.g., 3.x to 4.x) may require feature flags to be enabled sequentially and can involve more significant changes to configuration and plugin compatibility. Read the release notes. All of them.

The Erlang version also matters — RabbitMQ has specific Erlang version requirements, and upgrading RabbitMQ sometimes requires upgrading Erlang first.

Cluster Sizing

A typical production RabbitMQ cluster is 3 nodes with quorum queues (replication factor 3). This gives you majority-based fault tolerance — you can lose one node without losing data or availability.

For higher throughput, add nodes and distribute queues across them. Unlike Kafka, where partitions provide automatic parallelism within a topic, RabbitMQ requires you to manage queue distribution yourself (or use consistent hash exchange plugins for automatic sharding).

Memory sizing depends heavily on queue depth and message size. A node processing messages quickly (short queues) needs less memory than one with deep backlogs. Start with 4-8GB per node for moderate workloads and monitor from there.

Managed Offerings

CloudAMQP: The most established RabbitMQ-as-a-service provider. Multi-cloud, solid management interface, good support.
Amazon MQ for RabbitMQ: AWS-managed, limited configuration options, simpler but less flexible.
VMware Tanzu RabbitMQ: Commercial distribution with additional features for enterprise environments.
Azure Service Bus: Not RabbitMQ, but supports AMQP and is often considered as an alternative in Azure environments.

RabbitMQ Streams in Depth

Streams deserve special attention because they represent RabbitMQ's strategic response to the event streaming trend.

A stream is an append-only, immutable log — the same abstraction as a Kafka topic partition. Messages are written once and retained based on time or size limits. Consumers can attach at any point in the stream, read forward, and track their offset.

Streams use a custom binary protocol (separate from AMQP) optimised for high throughput and low overhead. They leverage sequential disk I/O, memory-mapped files, and a purpose-built storage engine.

When to Use Streams vs Queues

Streams: Large fan-out (many consumers reading the same data), replay requirements, high-throughput write-once-read-many workloads.
Queues: Task distribution (each message processed by one consumer), complex routing via exchanges, message-level TTL and priority.

Streams are not a replacement for queues. They are a complementary tool for a different set of use cases. The power of modern RabbitMQ is that you can use both in the same cluster, with the same management tools, and even route between them using exchanges.

Limitations

Streams are younger than Kafka's log implementation and have fewer features. Sub-second filtering within streams is limited. The ecosystem around streams (connectors, stream processing) is not comparable to Kafka's. They are improving with each release, but as of now, if event streaming is your primary use case, Kafka remains the more complete solution.

Code Examples

Python Producer (pika)

import pika
import json

connection = pika.BlockingConnection(
    pika.ConnectionParameters(host='localhost')
)
channel = connection.channel()

# Declare exchange and queue
channel.exchange_declare(exchange='order-events', exchange_type='topic', durable=True)
channel.queue_declare(queue='order-processing', durable=True)
channel.queue_bind(
    queue='order-processing',
    exchange='order-events',
    routing_key='order.placed.*'  # Match all regions
)

# Enable publisher confirms for durability
channel.confirm_delivery()

message = json.dumps({
    'type': 'OrderPlaced',
    'orderId': 'order-7829',
    'region': 'us-east',
})

try:
    channel.basic_publish(
        exchange='order-events',
        routing_key='order.placed.us-east',
        body=message,
        properties=pika.BasicProperties(
            delivery_mode=2,  # Persistent message
            content_type='application/json',
        ),
    )
    print('Message published and confirmed')
except pika.exceptions.UnroutableError:
    print('Message could not be routed')

connection.close()

Python Consumer (pika)

import pika
import json

connection = pika.BlockingConnection(
    pika.ConnectionParameters(host='localhost')
)
channel = connection.channel()

channel.queue_declare(queue='order-processing', durable=True)

# Fair dispatch — don't send more than one message to a worker at a time
channel.basic_qos(prefetch_count=1)

def on_message(channel, method, properties, body):
    event = json.loads(body)
    print(f"Processing order: {event['orderId']}")

    # Process the message...

    # Acknowledge after successful processing
    channel.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(
    queue='order-processing',
    on_message_callback=on_message,
    auto_ack=False,  # Manual acknowledgment
)

print('Waiting for messages...')
channel.start_consuming()

Java Producer (Spring AMQP)

import org.springframework.amqp.core.*;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class RabbitConfig {

    @Bean
    public TopicExchange orderExchange() {
        return new TopicExchange("order-events");
    }

    @Bean
    public Queue orderProcessingQueue() {
        return QueueBuilder.durable("order-processing")
            .quorum()  // Use quorum queue for replication
            .build();
    }

    @Bean
    public Binding orderBinding(Queue orderProcessingQueue,
                                 TopicExchange orderExchange) {
        return BindingBuilder
            .bind(orderProcessingQueue)
            .to(orderExchange)
            .with("order.placed.*");
    }
}

// In your service:
@Service
public class OrderEventPublisher {

    private final RabbitTemplate rabbitTemplate;

    public OrderEventPublisher(RabbitTemplate rabbitTemplate) {
        this.rabbitTemplate = rabbitTemplate;
    }

    public void publishOrderPlaced(String orderId, String region) {
        String routingKey = "order.placed." + region;
        String message = String.format(
            "{\"type\":\"OrderPlaced\",\"orderId\":\"%s\"}", orderId
        );

        rabbitTemplate.convertAndSend("order-events", routingKey, message);
    }
}

Java Consumer (Spring AMQP)

import org.springframework.amqp.rabbit.annotation.RabbitListener;
import org.springframework.stereotype.Component;

@Component
public class OrderEventListener {

    @RabbitListener(queues = "order-processing")
    public void handleOrderPlaced(String message) {
        System.out.println("Received: " + message);

        // Process the order event...

        // Acknowledgment is automatic with Spring's default settings
        // (acknowledged after the method returns without exception)
    }
}

Verdict

RabbitMQ is the right choice when your problem is routing messages rather than streaming events. Its exchange-binding-queue model is the most expressive routing system in the messaging world, and for task distribution, RPC patterns, and complex event routing, nothing matches it.

Pick RabbitMQ when:

You need flexible message routing based on content, topics, or headers
Your primary pattern is task distribution (work queues with acknowledgment)
You need request-reply / RPC messaging
You want a broker that is easy to set up, manage, and understand
You need multi-protocol support (AMQP, MQTT, STOMP)
Your throughput requirements are moderate (tens of thousands of messages/sec)
You need a broker that your team can operate without a PhD in distributed systems

Avoid RabbitMQ when:

You need high-throughput event streaming (hundreds of thousands+ messages/sec)
You need log-based message retention and replay as a core feature (though Streams are closing this gap)
You need built-in stream processing (Kafka Streams, ksqlDB)
Your primary workload is event sourcing with long-term retention
You need a massive connector ecosystem for data integration

RabbitMQ is not trying to be Kafka, and that is its greatest strength. It is a message broker — arguably the best general-purpose message broker available — and for the workloads it was designed for, it remains an excellent choice. The addition of Streams extends its relevance into event streaming territory, but that feature is still maturing. If your workload is primarily event streaming, evaluate RabbitMQ Streams honestly against your requirements rather than assuming the name alone is sufficient.

Apache Pulsar

Apache Pulsar is what happens when you look at Kafka and think, "What if we separated the compute from the storage, added multi-tenancy from day one, made geo-replication a first-class feature, and accepted that the result would be three distributed systems in a trenchcoat?" The answer is a platform that is genuinely impressive in its capabilities and genuinely demanding in its operational requirements.

Pulsar occupies a fascinating position in the messaging landscape. It addresses real limitations of Kafka — multi-tenancy, tiered storage, geo-replication, and the unified queuing-plus-streaming model — with architectural choices that are technically sound but operationally expensive. Whether those capabilities justify that expense depends entirely on your specific needs. This chapter will help you figure out if you are one of the organisations for whom Pulsar is the right answer.

Overview

What It Is

Apache Pulsar is a cloud-native, distributed messaging and event streaming platform. It provides pub/sub messaging, event streaming with replay, and traditional message queuing — all within a single system. Its distinguishing architectural feature is the separation of serving (brokers) and storage (Apache BookKeeper), which enables independent scaling of compute and storage.

Brief History

Pulsar was created at Yahoo! in 2013. Yahoo! needed a messaging platform that could serve multiple business units (Yahoo Mail, Yahoo Finance, Flickr, Tumblr) on shared infrastructure without one team's misbehaving workload taking down another team's production service. Nothing on the market did this. Kafka had no meaningful multi-tenancy. RabbitMQ was not designed for Yahoo!'s scale. So they built their own.

The key design decisions were made early: separate compute from storage, build multi-tenancy into the core rather than bolting it on later, and support both queuing and streaming semantics. The implementation leveraged Apache BookKeeper — a distributed, write-ahead log storage system originally developed at Yahoo! for HDFS NameNode journaling — as the durable storage layer.

Yahoo! open-sourced Pulsar in 2016 and donated it to the Apache Software Foundation, where it became a Top-Level Project in 2018. StreamNative, founded by members of the original Yahoo! Pulsar team (including Sijie Guo and Jia Zhai), became the primary commercial entity behind Pulsar, offering managed services and enterprise features.

The project's journey has not been entirely smooth. Community governance disputes, competition with the well-funded Confluent ecosystem, and the inherent complexity of the system have all created headwinds. The community is smaller than Kafka's but dedicated, and development continues actively.

Who Runs It

The Apache Software Foundation governs the project. StreamNative employs many of the core committers. DataStax (through its Astra Streaming offering) also contributes to the ecosystem. The project has a more diverse committer base than Kafka (which is dominated by Confluent employees), though the total contributor count is smaller.

Architecture

The Two-Layer Architecture

This is the fundamental difference between Pulsar and Kafka, and it cascades into everything else.

Kafka's architecture: Brokers serve requests and store data. A Kafka broker is a stateful node — it owns partitions on its local disks. Scaling storage means adding brokers. Rebalancing data means moving partitions between brokers, which is an expensive, bandwidth-intensive operation.

Pulsar's architecture: Brokers serve requests. BookKeeper bookies store data. The two layers are independent.

Pulsar Brokers are stateless (mostly — they cache data, but the source of truth is in BookKeeper). They handle producer and consumer connections, manage topic ownership, and serve read/write requests. Because they are stateless, they can be added, removed, and replaced without data movement. Scaling the serving layer is fast and minimally disruptive.

Apache BookKeeper Bookies are the storage nodes. Each bookie manages a set of ledgers (sequential log segments). When a Pulsar topic receives messages, they are written to a set of bookies in parallel. Bookies are stateful, but their data is managed in smaller units (ledgers/segments) than Kafka's partitions, which makes storage operations more granular.

ZooKeeper (yes, ZooKeeper again) manages metadata for both Pulsar and BookKeeper: broker registration, topic ownership, ledger metadata, and configuration. Pulsar's ZooKeeper dependency is more extensive than Kafka's was, because both the broker layer and the storage layer rely on it.

Segments vs Partitions

This is a nuance worth understanding. In Kafka, a partition is a single log file (or set of segment files) on a single broker. The entire partition lives on one broker (plus its replicas). This means:

The maximum size of a partition is limited by the broker's local disk.
Rebalancing a partition means moving the entire thing to another broker.
If a broker fails, its partitions are unavailable until a new leader is elected from the replicas.

In Pulsar, a topic partition is divided into segments (BookKeeper ledgers). Each segment is stored across multiple bookies (striped, not residing on a single node). When the current segment reaches a size or time threshold, a new segment is created on potentially different bookies. This means:

No single-broker storage bottleneck: A topic's data is distributed across many bookies.
Faster recovery: When a bookie fails, only its segments need to be recovered, and the recovery reads come from multiple surviving bookies in parallel.
Tiered storage: Old segments can be offloaded to cheap object storage (S3, GCS, Azure Blob) while hot segments remain on bookies. This is built into Pulsar's design, not an afterthought.

The downside: this architecture means more components, more metadata, more coordination, and more things that can go wrong.

Tiered Storage

Pulsar's tiered storage model is one of its most compelling features for cost-conscious, high-retention workloads.

Messages flow through three tiers:

BookKeeper (hot storage): Recent messages on bookie disks (SSD or HDD). Fast read/write access.
Object storage (cold storage): Older segments offloaded to S3, GCS, or Azure Blob Storage. Dramatically cheaper per GB.
(Optional) Local cache on brokers: Frequently accessed data cached in broker memory for low-latency reads.

This means you can retain months or years of event history at object storage prices, while recent data remains on fast storage for low-latency access. Kafka can do this too (with Confluent's Tiered Storage or community implementations), but Pulsar had it first and it is more deeply integrated.

Multi-Tenancy

Multi-tenancy is Pulsar's headline feature and the reason it was built.

The Hierarchy

Pulsar organises resources in a three-level hierarchy:

Tenants: Top-level organisational unit. Typically maps to a team, business unit, or application. Each tenant has its own admin permissions and resource policies.
Namespaces: A grouping of topics within a tenant. Policies (retention, backlog limits, replication, schema enforcement) are set at the namespace level.
Topics: The actual message streams, living within a namespace.

The full topic name looks like: persistent://tenant/namespace/topic-name

Isolation Mechanisms

Authentication and authorisation: Per-tenant access control. One team cannot access another team's topics without explicit permission.
Resource quotas: Limits on message rate, bandwidth, storage, and number of topics per namespace or tenant.
Namespace isolation policies: You can designate specific brokers for specific namespaces, ensuring that a noisy tenant's traffic does not compete with a latency-sensitive tenant's brokers.
Backlog quotas: Limits on how much unconsumed data a topic can accumulate before producers are throttled or the oldest data is dropped.

This is genuinely more sophisticated than what Kafka or RabbitMQ offer. Kafka has ACLs and quotas, but no formal tenant abstraction. RabbitMQ has vhosts, which provide isolation but not the policy richness of Pulsar's namespace system.

Geo-Replication

Pulsar's geo-replication is built into the core, not bolted on as an external tool.

How It Works

Configure two (or more) Pulsar clusters in different regions. Set a replication policy on a namespace. Pulsar's brokers automatically replicate messages between clusters, using dedicated replication connections. Each cluster maintains its own copy of the data with its own consumer offsets.

Replication is asynchronous (so there is a lag window during which a disaster in one region can lose the most recent messages), but the mechanism is integrated, monitored via standard Pulsar metrics, and configured through the standard admin API.

Comparison

Kafka: Requires MirrorMaker 2 or Confluent Replicator — separate processes that consume from one cluster and produce to another. It works, but it is more operational moving parts.
RabbitMQ: Federation and shovel provide cross-cluster replication, but with less sophistication and no namespace-level policy control.
Pulsar: Built-in, policy-driven, per-namespace. The cleanest implementation among the three.

Subscription Types

Pulsar supports four subscription types, giving you more consumer patterns than either Kafka or RabbitMQ in a single system.

Exclusive: One consumer on the subscription. If another consumer tries to subscribe, it is rejected. This is the simplest and provides strict ordering.

Shared: Multiple consumers share the subscription. Messages are distributed round-robin across consumers. This is the traditional work queue pattern. Ordering is not guaranteed because different consumers process messages at different speeds.

Failover: One active consumer, one or more standby consumers. If the active consumer disconnects, a standby takes over. Ordering is preserved (within a partition) during normal operation.

Key_Shared: Messages with the same key are delivered to the same consumer, while messages with different keys can be distributed across consumers. This gives you per-key ordering with parallelism across keys — the same pattern as Kafka's partition-key-based consumer model, but without requiring you to pre-configure partition counts.

The availability of all four subscription types on the same topic is a significant advantage. In Kafka, you get exclusive-per-partition (roughly equivalent to failover) and that is it — shared consumption requires architectural workarounds. In RabbitMQ, you get shared consumption from queues but not the Kafka-style exclusive-per-partition model (without manual coordination).

Pulsar Functions and Pulsar IO

Pulsar Functions

Lightweight, serverless-style compute functions that run inside the Pulsar cluster. A Pulsar Function consumes messages from one or more topics, processes them, and produces results to another topic. Supported languages: Java, Python, Go.

Pulsar Functions are useful for simple transformations, routing, and enrichment — the kind of glue logic that does not justify a separate stream processing framework. They run as threads within the broker, in separate processes, or in Kubernetes pods.

They are not a replacement for Kafka Streams or Flink. For complex stateful stream processing (windowed aggregations, multi-way joins, exactly-once stateful operations), you need an external framework. But for lightweight processing, they reduce the operational surface area by keeping the logic close to the broker.

Pulsar IO

Pulsar's equivalent of Kafka Connect. Source connectors pull data into Pulsar; sink connectors push data from Pulsar to external systems. The connector ecosystem is smaller than Kafka Connect's — significantly smaller. This is a meaningful practical limitation.

Available connectors cover the common cases (JDBC, Elasticsearch, Cassandra, Kafka adapter, S3), but the long tail of specialised connectors that Kafka Connect offers is not there. If your integration needs are standard, Pulsar IO works. If you need a connector for an obscure system, you may be writing it yourself.

Schema Registry

Pulsar includes a schema registry as a built-in feature — no separate service to deploy.

Schemas are associated with topics. Producers declare their schema when connecting, and the broker validates messages against the registered schema. Schema compatibility is enforced (backward, forward, full, or none). Supported formats include Avro, Protobuf, JSON Schema, and primitive types.

Having the schema registry built in (rather than as a separate service like Confluent Schema Registry) simplifies deployment and ensures that schema enforcement is always available. The trade-off is that Pulsar's schema registry is less feature-rich than Confluent's — fewer compatibility modes, less tooling around it.

Strengths

Multi-Tenancy

The best multi-tenancy implementation in the open-source messaging world. If you are building a shared messaging platform for multiple teams with different requirements, Pulsar handles this natively where other brokers require operational gymnastics or multiple clusters.

Geo-Replication

Built-in, policy-driven, per-namespace. Simpler to configure and operate than Kafka's MirrorMaker 2 and more capable than RabbitMQ's federation.

Tiered Storage

Native support for offloading old data to object storage. This makes Pulsar significantly cheaper for high-retention workloads compared to keeping everything on broker-local SSDs.

Unified Messaging Model

Queuing and streaming from the same platform, with four subscription types. You do not need to run RabbitMQ for your work queues and Kafka for your event streams — Pulsar can do both. Whether the operational complexity of Pulsar is less than the operational complexity of two separate systems is a calculation worth doing carefully.

Scalability Architecture

The separation of brokers and bookies means you can scale serving and storage independently. Adding read capacity (more brokers) does not require moving data. Adding storage capacity (more bookies) does not require migrating topic ownership. This is architecturally elegant and practically valuable at large scale.

Built-In Schema Registry

One less component to deploy and manage. Schema enforcement is always available.

Weaknesses

Three Distributed Systems in a Trenchcoat

This is the elephant in the room. A Pulsar deployment consists of:

Pulsar brokers — a distributed, stateful (caching) service.
Apache BookKeeper bookies — a distributed storage system with its own replication, journaling, and garbage collection.
Apache ZooKeeper — a distributed coordination service.

Each of these is a production system that needs to be deployed, configured, monitored, scaled, and upgraded. Each has its own failure modes, its own performance characteristics, and its own operational expertise requirements.

When advocates say Pulsar is "more complex" than Kafka, this is what they mean. It is not that any single component is harder than a Kafka broker. It is that you are operating three production distributed systems instead of one (or two, if you count Kafka's now-deprecated ZooKeeper dependency). The total operational surface area is larger.

With KRaft, Kafka has eliminated its ZooKeeper dependency. Pulsar has not — both its broker layer and its storage layer depend on ZooKeeper. There is ongoing work to reduce this dependency (using the upcoming Oxia metadata store as an alternative), but as of this writing, ZooKeeper remains a hard requirement.

Smaller Ecosystem

Pulsar's client libraries are good for Java. They are adequate for Python, Go, and C++. They are less mature for other languages. The community-maintained clients vary in quality.

The connector ecosystem (Pulsar IO) is a fraction of Kafka Connect's. The stream processing options (Pulsar Functions) are lightweight compared to Kafka Streams. The tooling ecosystem is thinner — fewer monitoring dashboards, fewer management tools, fewer blog posts explaining how to solve specific problems.

This matters in practice. When you hit an obscure issue with Kafka, someone has probably blogged about it. When you hit an obscure issue with Pulsar, you may be reading the source code.

BookKeeper Expertise

BookKeeper is a powerful storage system, but it is not widely known. The operational knowledge base is small. Tuning BookKeeper — journal device configuration, ledger device configuration, compaction settings, read/write quorum sizes — requires understanding its internals. Finding engineers with BookKeeper experience is harder than finding engineers with Kafka experience. Significantly harder.

When a Pulsar cluster misbehaves, the root cause is often in BookKeeper. Diagnosing slow writes, compaction backlogs, or ledger metadata issues requires a different skillset than diagnosing Kafka partition issues.

Community Size and Momentum

Pulsar's community is active but smaller than Kafka's. Fewer contributors means slower bug fixes for non-critical issues, fewer third-party integrations, and a smaller pool of operational knowledge. The project has had some governance turbulence — disputes between corporate contributors, concerns about the PMC composition — that have consumed energy that could have gone into development.

Pulsar is not at risk of abandonment. It is a viable, actively developed project. But the community tailwinds that Kafka enjoys are not there to the same degree.

Learning Curve

The concept count is high. Tenants, namespaces, topics (persistent and non-persistent), partitioned topics, subscriptions (four types), cursors, ledgers, segments, bookies, brokers, functions, IO connectors. A new engineer needs to understand more concepts before they can operate Pulsar confidently than they would for Kafka or RabbitMQ.

Ideal Use Cases

Multi-tenant messaging platforms: The use case Pulsar was designed for. Shared infrastructure for multiple teams with isolation and per-tenant policies.
Global, geo-replicated messaging: When you need active-active messaging across regions with built-in replication.
High-retention event streaming: When you need months or years of retention without paying SSD prices for cold data.
Mixed workloads: When the same platform needs to handle both event streaming (Kafka-style) and traditional queuing (RabbitMQ-style).
Large-scale deployments: Where the ability to scale brokers and storage independently provides meaningful operational advantages.

Operational Reality

Minimum Viable Cluster

A production Pulsar deployment requires, at minimum:

3 ZooKeeper nodes (for metadata)
3 BookKeeper bookies (for data storage, with write quorum 3, ack quorum 2)
2+ Pulsar brokers (for serving, stateless so you want at least 2 for availability)

That is 8 nodes minimum. Compare this to Kafka's 3 nodes (with KRaft). The infrastructure cost of entry is higher.

For development and testing, you can run everything on a single machine with Pulsar's standalone mode, which bundles a broker, bookie, and ZooKeeper in one process. Do not confuse standalone mode with production readiness.

Key Monitoring Metrics

Broker metrics:

Throughput (messages in/out per topic, namespace, tenant)
Publish latency (p50, p99)
Subscription backlog (messages and bytes)
Topic count and memory usage
Connection count

BookKeeper metrics:

Write latency (journal and ledger)
Read latency
Disk usage and I/O per bookie
Compaction progress (garbage collection of deleted data)
Ledger count and open ledger count

ZooKeeper metrics:

Request latency
Outstanding requests
Connection count
Leader election events

You need monitoring for all three layers. A Pulsar cluster is only as healthy as its least healthy component. BookKeeper degradation is often the first sign of trouble and the hardest to diagnose.

Prometheus metrics are available for all components, and there are community Grafana dashboards. StreamNative offers enhanced monitoring through their commercial tools.

Upgrade Strategy

Upgrading Pulsar involves upgrading three systems, in order:

ZooKeeper (if required by the new version)
BookKeeper bookies (rolling upgrade, one at a time, waiting for replication to catch up)
Pulsar brokers (rolling upgrade, simpler because they are quasi-stateless)

The ordering matters — brokers depend on bookies, and both depend on ZooKeeper. You cannot upgrade brokers first.

Protocol compatibility is generally maintained within minor versions, but major version upgrades require careful testing. The documentation for upgrade paths is adequate but not as exhaustive as Kafka's.

The ZooKeeper + BookKeeper Dependency Chain

This deserves its own section because it is the source of most Pulsar operational pain.

ZooKeeper is a mature system, but it is sensitive to disk latency (it writes to a transaction log synchronously) and has a well-known throughput limit on the number of writes per second. In a large Pulsar deployment, the metadata operations from both brokers and bookies can push ZooKeeper to its limits. Symptoms include increasing request latency, session timeouts, and — in the worst case — ZooKeeper ensemble instability that cascades into both broker and bookie failures.

BookKeeper's dependency on ZooKeeper for ledger metadata adds another failure coupling. If ZooKeeper becomes slow, BookKeeper cannot create new ledgers, which means Pulsar cannot roll segments, which means writes eventually stall.

Mitigations:

Dedicated ZooKeeper nodes (do not co-locate with brokers or bookies in production)
Fast SSDs for ZooKeeper transaction logs
Separate ZooKeeper ensembles for Pulsar and BookKeeper (adds more nodes but isolates failure domains)
Monitor ZooKeeper latency obsessively

Managed Offerings

StreamNative Cloud: The primary managed Pulsar offering, from the team that built it. Fully managed, multi-cloud (AWS, GCP, Azure). Includes management UI, monitoring, and support. This is the easiest way to run Pulsar if you want the capabilities without the operational burden.
DataStax Astra Streaming: Pulsar-based managed service, part of the DataStax Astra platform. Positioned alongside Astra DB (Cassandra). The Pulsar integration is solid, and the pricing model is consumption-based.

The managed offerings are the honest recommendation for most teams considering Pulsar. The operational complexity of self-managed Pulsar is high enough that the managed service premium is often worth it — unless you have a dedicated platform team with distributed systems expertise and the specific need for self-hosted infrastructure.

Code Examples

Java Producer

import org.apache.pulsar.client.api.*;

public class OrderEventProducer {
    public static void main(String[] args) throws PulsarClientException {
        PulsarClient client = PulsarClient.builder()
            .serviceUrl("pulsar://localhost:6650")
            .build();

        Producer<String> producer = client.newProducer(Schema.STRING)
            .topic("persistent://public/default/order-events")
            .producerName("order-service")
            .sendTimeout(10, java.util.concurrent.TimeUnit.SECONDS)
            .blockIfQueueFull(true)
            .create();

        try {
            MessageId messageId = producer.newMessage()
                .key("order-7829")
                .value("{\"type\":\"OrderPlaced\",\"orderId\":\"order-7829\"}")
                .property("region", "us-east")
                .send();

            System.out.println("Published message: " + messageId);
        } finally {
            producer.close();
            client.close();
        }
    }
}

Java Consumer (Shared Subscription)

import org.apache.pulsar.client.api.*;

public class OrderEventConsumer {
    public static void main(String[] args) throws PulsarClientException {
        PulsarClient client = PulsarClient.builder()
            .serviceUrl("pulsar://localhost:6650")
            .build();

        Consumer<String> consumer = client.newConsumer(Schema.STRING)
            .topic("persistent://public/default/order-events")
            .subscriptionName("order-processing")
            .subscriptionType(SubscriptionType.Key_Shared)  // Per-key ordering
            .ackTimeout(30, java.util.concurrent.TimeUnit.SECONDS)
            .subscribe();

        try {
            while (true) {
                Message<String> msg = consumer.receive();

                try {
                    System.out.printf("Received: key=%s, value=%s, messageId=%s%n",
                        msg.getKey(), msg.getValue(), msg.getMessageId());

                    // Process the message...

                    // Acknowledge after successful processing
                    consumer.acknowledge(msg);

                } catch (Exception e) {
                    // Negative acknowledge — message will be redelivered
                    consumer.negativeAcknowledge(msg);
                }
            }
        } finally {
            consumer.close();
            client.close();
        }
    }
}

Python Producer

import pulsar

client = pulsar.Client('pulsar://localhost:6650')

producer = client.create_producer(
    topic='persistent://public/default/order-events',
    producer_name='order-service-python',
    block_if_queue_full=True,
)

message_id = producer.send(
    content='{"type": "OrderPlaced", "orderId": "order-7829"}'.encode('utf-8'),
    partition_key='order-7829',
    properties={'region': 'us-east'},
)

print(f'Published message: {message_id}')

producer.close()
client.close()

Python Consumer

import pulsar

client = pulsar.Client('pulsar://localhost:6650')

consumer = client.subscribe(
    topic='persistent://public/default/order-events',
    subscription_name='order-processing',
    consumer_type=pulsar.ConsumerType.KeyShared,
)

while True:
    msg = consumer.receive()
    try:
        data = msg.data().decode('utf-8')
        print(f'Received: key={msg.partition_key()}, value={data}')

        # Process the message...

        consumer.acknowledge(msg)

    except Exception as e:
        print(f'Processing failed: {e}')
        consumer.negative_acknowledge(msg)

consumer.close()
client.close()

Verdict

Pulsar is the right choice for a specific set of problems, and for those problems, it is arguably the best option available. Multi-tenant messaging platforms, geo-replicated deployments, and high-retention workloads with tiered storage are areas where Pulsar's architecture provides genuine advantages over the alternatives.

The cost is complexity. You are operating three distributed systems. You need expertise that is harder to hire for. The ecosystem is thinner. The community is smaller. These are not theoretical concerns — they are the daily reality of running Pulsar in production.

Pick Pulsar when:

Multi-tenancy is a hard requirement, not a nice-to-have
You need built-in geo-replication across regions
You need cost-effective long-term retention via tiered storage
You want both queuing and streaming patterns from a single platform
You have a platform team with distributed systems expertise (or you are using a managed service)
You need to scale serving and storage independently

Avoid Pulsar when:

Your deployment is single-tenant (Kafka is simpler and more mature for this)
You do not need geo-replication or tiered storage (you are paying complexity tax for features you are not using)
Your team is small and cannot absorb the operational overhead of three distributed systems
You need a large connector ecosystem (Kafka Connect wins)
You need extensive stream processing (Kafka Streams and Flink integration are more mature in the Kafka ecosystem)

Pulsar is not Kafka's replacement. It is Kafka's alternative for workloads that need what Kafka does not offer natively. If you need multi-tenancy or geo-replication badly enough, Pulsar earns its complexity. If you do not, you are choosing the harder path for no reason — which is not engineering, it is masochism.

Amazon SNS/SQS and EventBridge

Every cloud provider eventually builds a message broker. Amazon built three of them, gave them names that sound like government agencies, and then told you to use all of them together. Welcome to the AWS-native path to event-driven architecture, where the infrastructure is invisible, the scaling is automatic, and the bill is... educational.

This chapter covers Simple Notification Service (SNS), Simple Queue Service (SQS), and EventBridge — the trio that forms the backbone of serverless event-driven design on AWS. Each solves a different problem. Together, they form something surprisingly coherent, provided you can navigate the configuration surface area. We will also touch on EventBridge Scheduler, because AWS apparently believes no service is complete until it can trigger a Lambda on a cron expression.

Overview

Amazon Simple Notification Service launched in 2010, making it one of the oldest managed pub/sub services still in active use. Its job is straightforward: you publish a message to a topic, and SNS delivers it to every subscriber. Fan-out. That is the pitch, and it delivers on it reliably.

SNS supports multiple subscriber types — SQS queues, Lambda functions, HTTP/S endpoints, email, SMS, and mobile push notifications. This makes it the Swiss Army knife of notification delivery, though like most Swiss Army knives, some of the tools are more useful than others. (The email subscriber, for instance, is fine for ops alerts and catastrophically wrong for anything customer-facing.)

SQS: The Patient Queue

SQS is older than SNS — it launched in 2004, making it one of AWS's first services, period. It is a fully managed message queue. You put messages in, you take messages out, and in between SQS handles durability, scaling, and the existential dread of distributed systems. It was one of the services described in the famous Werner Vogels "everything fails all the time" era, and its design reflects that philosophy: simple, durable, relentlessly boring.

EventBridge: The Smart Router

EventBridge arrived in 2019, born from the ashes of CloudWatch Events (which still exists underneath, a fact that will confuse you exactly once during a debugging session). EventBridge is a serverless event bus with content-based routing. Where SNS says "here is a message, deliver it to everyone who subscribed to this topic," EventBridge says "here is an event, deliver it to everyone whose rule matches its content." This is a meaningful distinction. EventBridge is the opinionated one.

EventBridge also integrates with over 90 AWS services as event sources and over 20 as targets. It has a schema registry. It has archive and replay. It has Pipes for point-to-point integrations. AWS has made it clear that EventBridge is the future of event-driven architecture on their platform, and they are investing accordingly.

Architecture

SNS is a push-based system. When you publish a message to a topic, SNS fans it out to all subscribers in parallel. There is no consumer polling, no long-lived connections. SNS pushes and moves on.

Topic types:

Standard topics: Best-effort ordering, at-least-once delivery, nearly unlimited throughput (the published limit is 100,000 messages per second per topic for API calls, though the actual delivery fan-out is higher). Messages may be delivered out of order and may be delivered more than once.
FIFO topics: Strict ordering within a message group, exactly-once delivery (within the deduplication window), but throughput is capped at 300 publishes per second (or 10 MB/s), and subscribers are limited to SQS FIFO queues. The trade-off is exactly what you would expect: you get ordering guarantees in exchange for throughput and flexibility.

Message filtering is one of SNS's underappreciated features. Instead of creating a topic per event type (a pattern that scales poorly), you can publish all events to a single topic and attach filter policies to subscriptions. Filters can match on message attributes using exact values, prefix matching, numeric ranges, and even exists/not-exists checks. This moves routing logic from your application code into infrastructure configuration, which is either a win for separation of concerns or a debugging nightmare, depending on how well you document your filter policies.

Delivery policies control retry behaviour for HTTP/S subscribers. You can configure the number of retries, backoff functions (linear, geometric, exponential), and the time between retries. For SQS and Lambda subscribers, delivery is effectively guaranteed by the underlying integration — SNS will retry until SQS accepts the message.

SQS Internals

SQS is a pull-based system. Messages sit in a queue until a consumer fetches them. This is the fundamental architectural difference from SNS: SQS is about buffering, not broadcasting.

Queue types:

Standard queues: Nearly unlimited throughput, at-least-once delivery, best-effort ordering. Messages are stored redundantly across multiple availability zones. The "best-effort ordering" part means that messages usually come out in roughly the order they went in, but you must not depend on this. If you are depending on this, you will discover your mistake at 2 AM on a Saturday when a partition rebalance shuffles your messages.
FIFO queues: Exactly-once processing (via deduplication), strict ordering within message groups, but throughput is limited to 300 messages per second without batching (3,000 with batching and high-throughput mode). Message groups are the key concept — ordering is guaranteed within a group, not across groups, which lets you partition your ordering requirements and get some parallelism back.

Visibility timeout is SQS's answer to the question "what happens if a consumer takes a message and then dies?" When a consumer receives a message, SQS hides it from other consumers for a configurable period (default 30 seconds, maximum 12 hours). If the consumer does not delete the message within that window, it becomes visible again for another consumer to pick up. Get this value wrong and you get either duplicate processing (too short) or long delays after failures (too long). Most teams get it wrong at least once.

Long polling is the answer to the question "why is my SQS consumer burning money on empty ReceiveMessage calls?" Instead of returning immediately when no messages are available, long polling waits up to 20 seconds for a message to arrive. This reduces empty responses and costs. There is no good reason not to use it, and yet I have reviewed production systems where it was not enabled.

Dead letter queues (DLQ) catch messages that have been received but not successfully processed after a configurable number of attempts (the maxReceiveCount). When a message exceeds the receive count, SQS moves it to the DLQ. This is your safety net — without it, poison messages will cycle through your consumer forever, consuming resources and generating alerts. Every production SQS queue should have a DLQ. This is not a suggestion.

Message deduplication in FIFO queues uses either a content-based hash (SHA-256 of the message body) or an explicit deduplication ID. The deduplication window is five minutes. If you send the same message twice within five minutes, the second one is silently dropped. After five minutes, all bets are off. This means FIFO queues provide exactly-once delivery within a window, not for all time. Plan accordingly.

The SNS+SQS Fan-Out Pattern

This is the bread and butter of AWS event-driven architecture. You publish to an SNS topic, and the topic delivers to multiple SQS queues. Each queue feeds a different consumer. This gives you:

Fan-out: One event reaches multiple consumers.
Buffering: Each consumer processes at its own pace.
Failure isolation: If one consumer fails, the others are unaffected.
Replay from DLQ: Failed messages land in per-consumer DLQs.

Producer → SNS Topic → SQS Queue A → Consumer A
                     → SQS Queue B → Consumer B
                     → SQS Queue C → Consumer C

This pattern is so common that AWS has special integrations for it. SNS can deliver directly to SQS with no intermediate HTTP call, and the IAM policies basically write themselves (or more accurately, the CloudFormation templates do).

The limitation is that SNS+SQS gives you topic-based routing. If you need content-based routing — "send this event to Consumer A only if the region field is eu-west-1" — you either use SNS message filtering or you reach for EventBridge.

EventBridge Architecture

EventBridge is structured around three core concepts:

Event buses are the pipelines. Every AWS account has a default event bus that receives events from AWS services (EC2 state changes, S3 notifications, etc.). You can create custom event buses for your application events. Events are JSON objects with a specific envelope structure:

{
  "version": "0",
  "id": "12345678-1234-1234-1234-123456789012",
  "detail-type": "OrderPlaced",
  "source": "com.mycompany.orders",
  "account": "123456789012",
  "time": "2025-11-14T09:32:17Z",
  "region": "us-east-1",
  "resources": [],
  "detail": {
    "orderId": "ord-7829",
    "totalAmount": 149.99
  }
}

Rules are the routing logic. Each rule has an event pattern (a JSON filter) and one or more targets. Event patterns support exact matching, prefix matching, numeric ranges, exists/not-exists, and boolean operators. Rules can match on any field in the event envelope or the detail payload. You can have up to 300 rules per event bus (a soft limit that is raiseable but indicates roughly where AWS expects you to hit design problems).

Schema registry is EventBridge's attempt to bring some order to the chaos. It can automatically discover schemas from events flowing through your bus, and you can also register schemas manually. This integrates with code generation tools so your consumers can work with typed objects instead of raw JSON. In practice, the auto-discovery is useful for exploration and the manual registry is useful for contracts.

Archive and replay lets you store events and replay them later. You can archive all events or filter by pattern. Replay sends the archived events back through the bus, where they match rules just like fresh events. This is genuinely useful for testing, disaster recovery, and populating new consumers with historical data. The limitation is that replay replays through the same rules, so if your rules have changed since the events were archived, you may get different routing behaviour. Also, replay is not instantaneous — large archives take time to replay, and the throughput is not documented.

EventBridge Pipes are point-to-point integrations with optional filtering, enrichment, and transformation. A pipe connects a source (SQS, Kinesis, DynamoDB Streams, Kafka, etc.) to a target through an optional filter, enrichment step (Lambda, Step Functions, API Gateway, API destination), and transformation. Pipes are the answer to "I just need to get data from A to B with a bit of transformation" — a use case that previously required writing a Lambda function, which always felt like overkill.

EventBridge Scheduler is a managed scheduler that can invoke targets on a schedule (cron or rate) or at a specific time. The "specific time" part is the interesting bit — you can schedule a one-time invocation at a future time, which is useful for delayed processing, reminders, and time-based workflows. It supports up to millions of scheduled invocations, which puts it in a different league from CloudWatch Events rules with cron expressions (which are limited to 300 per account).

Strengths

Zero operational overhead. There are no servers to manage, no clusters to resize, no brokers to patch. SNS, SQS, and EventBridge are fully managed. You do not SSH into anything. You do not think about disk space. You do not get paged because a Kafka broker ran out of file descriptors. For teams that want to focus on application logic rather than infrastructure, this is genuinely liberating.

Deep AWS integration. Over 90 AWS services emit events to EventBridge natively. SQS integrates as a Lambda event source with automatic scaling of consumer concurrency. Step Functions can wait for events. API Gateway can publish to SNS. The integration surface area is enormous, and it mostly works without custom glue code.

Pay-per-use pricing. You pay for what you use, not for what you provision. An SQS queue that processes ten messages a day costs fractions of a cent. This is transformative for development and staging environments, where a Kafka cluster sits idle burning money.

Managed scaling. SQS scales to millions of messages per second with no configuration changes. SNS standard topics have no practical throughput ceiling. EventBridge scales to handle event bursts. You do not pre-provision capacity or monitor partition hotspots.

Durability. SQS stores messages redundantly across multiple availability zones. Message loss is, for practical purposes, not a concern. The documented durability is not published as a number of nines (unlike S3), but the engineering behind it is serious.

Security model. IAM policies, VPC endpoints, encryption at rest (KMS), encryption in transit (TLS). The security tooling integrates with the rest of AWS security. Resource policies on SNS topics and SQS queues enable cross-account access without sharing credentials.

Weaknesses

Vendor lock-in. Let us not dance around this. If you build on SNS, SQS, and EventBridge, you are locked into AWS. The APIs are proprietary. The event format (especially EventBridge's envelope) is AWS-specific. Migrating to another cloud or to a self-hosted solution means rewriting your event infrastructure. Some teams accept this trade-off willingly. Some discover it was a trade-off after it is too late to change.

Throughput limits on FIFO resources. SQS FIFO queues max out at 300 TPS without batching (3,000 with batching). SNS FIFO topics max at 300 publishes per second. If you need ordering guarantees and high throughput, you will hit these limits faster than you expect. The workaround — sharding across multiple message groups — adds complexity that undermines the "simple" part of SQS.

Cross-account and cross-region complexity. Sending events across AWS accounts requires resource policies, IAM roles, and often EventBridge cross-account event bus configurations. It works, but the IAM policy debugging experience is not fun. Cross-region event routing with EventBridge requires setting up rules that forward events to buses in other regions, and each hop adds latency and cost.

Latency characteristics. SQS long polling adds up to 20 seconds of latency by design (you can set it lower, but then you pay for more API calls). SNS push delivery is fast but not sub-millisecond. EventBridge rule evaluation adds processing time. If you need single-digit millisecond end-to-end latency, AWS managed services are not your answer. You are looking at Kafka, NATS, or Redis Streams.

Limited replay. EventBridge archive and replay exists, but it is not a first-class event log. You cannot randomly access events by offset. Replay is all-or-nothing within a time range. SQS has no replay at all — once a message is deleted, it is gone. If you need event sourcing with full replay capability, you are in Kafka territory, and no amount of AWS marketing can change that.

Message size limits. SQS messages max out at 256 KB. SNS messages max at 256 KB. EventBridge events max at 256 KB. If your events are larger, you need the "claim check" pattern (store the payload in S3, put the reference in the message). This is a well-known pattern but it adds complexity, latency, and failure modes.

Observability gaps. CloudWatch metrics for SQS give you queue depth, age of oldest message, and approximate message counts. They do not give you consumer lag per consumer, message throughput broken down by message type, or end-to-end latency distributions. You can build these, but you are building them from scratch with custom metrics. After using Kafka's consumer group lag monitoring, SQS feels like flying blind.

Ideal Use Cases

Serverless applications where Lambda is the primary compute. The SQS-Lambda integration is excellent, and the pay-per-use model matches Lambda's pricing.
Fan-out patterns where one event needs to reach multiple independent consumers. SNS+SQS is purpose-built for this.
AWS-to-AWS integration where you need to react to infrastructure events (EC2 state changes, S3 uploads, CodePipeline completions).
Low-to-medium throughput workloads where the simplicity of managed services outweighs the throughput limitations of FIFO ordering.
Event routing with complex rules where EventBridge's content-based filtering saves you from maintaining routing logic in application code.
Teams that want to ship features, not operate infrastructure. This is the real value proposition, and it is genuine.

Operational Reality

Operating SNS, SQS, and EventBridge is, by design, uneventful. There are no clusters to manage, no rebalancing events to monitor, no broker failovers to orchestrate. Your operational concerns shift from "is the broker healthy?" to "is my consumer keeping up?"

Monitoring centres on CloudWatch metrics. For SQS: ApproximateNumberOfMessagesVisible (queue depth), ApproximateAgeOfOldestMessage (consumer lag proxy), and NumberOfMessagesSent/NumberOfMessagesReceived. For SNS: NumberOfMessagesPublished and NumberOfNotificationsFailed. For EventBridge: Invocations, FailedInvocations, and MatchedEvents. Set alarms on queue depth and message age. These are your canaries.

DLQ management is the operational task that most teams underestimate. Messages land in DLQs for a reason — usually bugs in consumer code, unexpected message formats, or transient downstream failures that outlasted the retry count. You need a process for inspecting DLQ messages, understanding why they failed, fixing the underlying issue, and replaying them. SQS DLQ redrive (which moves messages from the DLQ back to the source queue) was added in 2021 and works well, but "works well" assumes you have fixed whatever caused the failures in the first place.

Cost management is straightforward at low scale and surprising at high scale. SQS charges per API call ($0.40 per million requests for standard, $0.50 for FIFO). Each ReceiveMessage call is a request, whether or not it returns a message (long polling helps here). Each SendMessage is a request. Each DeleteMessage is a request. A consumer processing 1,000 messages per second is making roughly 2,000 API calls per second (receive + delete), which is about $4.15 million requests per day, or about $1,660 per month per queue. Scale that to ten queues and you are spending real money. EventBridge charges $1.00 per million events. SNS charges $0.50 per million publishes for standard topics.

Infrastructure as Code is non-negotiable. CloudFormation, CDK, Terraform, or Pulumi — pick one and use it. The number of resources involved in a production SNS+SQS setup (topics, queues, DLQs, subscriptions, filter policies, IAM policies, CloudWatch alarms, KMS keys) grows quickly, and managing them through the console is a recipe for configuration drift and 3 AM incidents.

Code Examples

SNS + SQS: Publishing and Consuming (Python/boto3)

import json
import boto3

sns = boto3.client('sns', region_name='us-east-1')
sqs = boto3.client('sqs', region_name='us-east-1')

# --- Publishing to SNS ---
def publish_order_event(order_id: str, total: float, region: str):
    """Publish an order event to SNS with message attributes for filtering."""
    response = sns.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789012:order-events',
        Message=json.dumps({
            'orderId': order_id,
            'totalAmount': total,
            'region': region,
            'timestamp': '2025-11-14T09:32:17Z'
        }),
        MessageAttributes={
            'event_type': {
                'DataType': 'String',
                'StringValue': 'OrderPlaced'
            },
            'region': {
                'DataType': 'String',
                'StringValue': region
            }
        }
    )
    print(f"Published message: {response['MessageId']}")
    return response


# --- SNS Subscription with filter policy ---
# This would typically be in your IaC, but for illustration:
def create_filtered_subscription(topic_arn: str, queue_arn: str):
    """Subscribe an SQS queue to SNS with a filter for EU orders only."""
    sns.subscribe(
        TopicArn=topic_arn,
        Protocol='sqs',
        Endpoint=queue_arn,
        Attributes={
            'FilterPolicy': json.dumps({
                'region': ['eu-west-1', 'eu-central-1']
            }),
            'FilterPolicyScope': 'MessageAttributes'
        }
    )


# --- Consuming from SQS ---
def consume_orders(queue_url: str, max_messages: int = 10):
    """
    Long-poll SQS for messages. In production, this runs in a loop
    (or better, use SQS as a Lambda event source).
    """
    response = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=max_messages,
        WaitTimeSeconds=20,           # Long polling - always use this
        MessageAttributeNames=['All'],
        AttributeNames=['All']
    )

    messages = response.get('Messages', [])
    for message in messages:
        try:
            # SNS wraps the original message in an envelope
            sns_envelope = json.loads(message['Body'])
            order_event = json.loads(sns_envelope['Message'])

            print(f"Processing order: {order_event['orderId']}")
            process_order(order_event)

            # Delete the message only after successful processing
            sqs.delete_message(
                QueueUrl=queue_url,
                ReceiptHandle=message['ReceiptHandle']
            )
        except Exception as e:
            # Don't delete the message — it will return to the queue
            # after the visibility timeout expires, and eventually
            # land in the DLQ after maxReceiveCount attempts.
            print(f"Failed to process message: {e}")

    return len(messages)


def process_order(event: dict):
    """Your business logic here."""
    pass


# --- SQS FIFO: Sending with ordering ---
def send_fifo_message(queue_url: str, order_id: str, event: dict):
    """
    Send to a FIFO queue with message group ID for ordering.
    All events for the same order are processed in order.
    """
    sqs.send_message(
        QueueUrl=queue_url,  # Must end in .fifo
        MessageBody=json.dumps(event),
        MessageGroupId=order_id,          # Ordering key
        MessageDeduplicationId=f"{order_id}-{event['type']}-{event['version']}"
    )

EventBridge: Publishing Events and Rule Patterns (Python/boto3)

import json
import boto3
from datetime import datetime

events = boto3.client('events', region_name='us-east-1')

# --- Publishing to EventBridge ---
def publish_to_eventbridge(order_id: str, total: float, region: str):
    """Put an event on a custom EventBridge bus."""
    response = events.put_events(
        Entries=[
            {
                'Source': 'com.mycompany.orders',
                'DetailType': 'OrderPlaced',
                'Detail': json.dumps({
                    'orderId': order_id,
                    'totalAmount': total,
                    'region': region,
                    'timestamp': datetime.utcnow().isoformat()
                }),
                'EventBusName': 'orders-bus'
            }
        ]
    )
    # Always check FailedEntryCount — put_events does NOT throw
    # on partial failures. This is a footgun.
    if response['FailedEntryCount'] > 0:
        for entry in response['Entries']:
            if 'ErrorCode' in entry:
                print(f"Failed to publish: {entry['ErrorCode']}: "
                      f"{entry['ErrorMessage']}")
    return response


# --- Batch publishing (up to 10 events per call) ---
def publish_batch(order_events: list[dict]):
    """
    Publish multiple events in a single API call.
    Max 10 entries per put_events call. Max total size 256 KB.
    """
    entries = [
        {
            'Source': 'com.mycompany.orders',
            'DetailType': event['type'],
            'Detail': json.dumps(event['data']),
            'EventBusName': 'orders-bus'
        }
        for event in order_events
    ]

    # put_events accepts max 10 entries, so chunk if needed
    for i in range(0, len(entries), 10):
        chunk = entries[i:i+10]
        response = events.put_events(Entries=chunk)
        if response['FailedEntryCount'] > 0:
            handle_failures(response, chunk)

EventBridge Rule Patterns

// Match all OrderPlaced events from the orders service
{
  "source": ["com.mycompany.orders"],
  "detail-type": ["OrderPlaced"]
}

// Match high-value orders (total > 1000) from EU regions
{
  "source": ["com.mycompany.orders"],
  "detail-type": ["OrderPlaced"],
  "detail": {
    "totalAmount": [{ "numeric": [">", 1000] }],
    "region": [{ "prefix": "eu-" }]
  }
}

// Match any event EXCEPT from the test source
{
  "source": [{ "anything-but": "com.mycompany.test" }]
}

// Match events that have a "priority" field (regardless of value)
{
  "detail": {
    "priority": [{ "exists": true }]
  }
}

EventBridge Scheduler: One-Time Delayed Processing

import boto3
import json
from datetime import datetime, timedelta

scheduler = boto3.client('scheduler', region_name='us-east-1')

def schedule_payment_reminder(order_id: str, reminder_time: datetime):
    """
    Schedule a one-time event to fire at a specific time.
    Useful for payment reminders, SLA checks, delayed notifications.
    """
    scheduler.create_schedule(
        Name=f"payment-reminder-{order_id}",
        ScheduleExpression=f"at({reminder_time.strftime('%Y-%m-%dT%H:%M:%S')})",
        FlexibleTimeWindow={'Mode': 'OFF'},
        Target={
            'Arn': 'arn:aws:lambda:us-east-1:123456789012:function:payment-reminder',
            'RoleArn': 'arn:aws:iam::123456789012:role/scheduler-invoke-role',
            'Input': json.dumps({
                'orderId': order_id,
                'action': 'send_payment_reminder'
            })
        },
        ActionAfterCompletion='DELETE'  # Clean up after firing
    )

Lambda Consumer (SQS Event Source)

import json

def lambda_handler(event, context):
    """
    Lambda function triggered by SQS event source mapping.
    AWS handles polling, batching, and scaling consumer concurrency.
    This is the recommended way to consume SQS in serverless architectures.
    """
    batch_item_failures = []

    for record in event['Records']:
        try:
            # If the SQS queue is subscribed to SNS, unwrap the envelope
            body = json.loads(record['body'])
            if 'Message' in body and 'TopicArn' in body:
                # SNS envelope
                message = json.loads(body['Message'])
            else:
                # Direct SQS message
                message = body

            process_event(message)

        except Exception as e:
            # Report individual item failure for partial batch response
            # Requires ReportBatchItemFailures in the event source mapping
            batch_item_failures.append({
                'itemIdentifier': record['messageId']
            })

    return {
        'batchItemFailures': batch_item_failures
    }


def process_event(event: dict):
    """Your business logic. Raise an exception to signal failure."""
    print(f"Processing: {json.dumps(event)}")

Cost Analysis at Various Scales

The following estimates assume us-east-1 pricing as of 2025 and use the SNS+SQS fan-out pattern with three consumers.

Scale	Events/month	SNS cost	SQS cost (3 queues)	EventBridge cost	Estimated total
Startup	1M	$0.50	~$2.40	$1.00	~$4/mo
Growth	100M	$50	~$240	$100	~$390/mo
Scale	1B	$500	~$2,400	$1,000	~$3,900/mo
Enterprise	10B	$5,000	~$24,000	$10,000	~$39,000/mo

At startup scale, AWS managed services are essentially free. At enterprise scale, you are paying $39,000/month for what a self-hosted Kafka cluster might cost $15,000/month to run (including engineer time, which is the number people always conveniently forget when comparing). The breakeven point depends entirely on how you value your team's time spent not operating infrastructure.

The SQS cost is the largest line item because of the per-request pricing model. Each message consumed involves at minimum a ReceiveMessage and a DeleteMessage call. With three consumer queues, each event generates at least six SQS API calls plus the original SNS publish. At 10 billion events per month, that is 60+ billion SQS API calls.

Hidden costs to watch:

CloudWatch Logs from Lambda consumers (can dwarf the messaging costs)
KMS charges if using customer-managed keys ($1/month per key + $0.03 per 10,000 requests)
Data transfer if sending events cross-region ($0.01–0.02 per GB)

Integration with Lambda, Step Functions, and ECS

Lambda is the natural consumer for all three services. SQS event source mappings handle polling, batching, and scaling automatically. SNS can invoke Lambda directly (though routing through SQS first gives you better error handling and DLQ support). EventBridge can target Lambda as a rule target. The Lambda service manages consumer concurrency — it will scale up invocations as queue depth increases, up to your concurrency limit.

Step Functions integrate with EventBridge through the .waitForTaskToken pattern, allowing long-running workflows to pause and resume based on events. Step Functions can also publish events to EventBridge as a workflow step. This combination is powerful for orchestrating multi-step processes that span multiple services and require human approval, external callbacks, or time-based waits.

ECS (and EKS) consumers poll SQS directly using the SDK, typically running as long-lived services with multiple threads or processes. This is the right model when your consumer requires persistent connections to downstream systems, heavy initialisation, or more control over concurrency than Lambda provides. ECS auto-scaling based on SQS queue depth (using ApproximateNumberOfMessagesVisible as a CloudWatch metric for target tracking scaling) works well, though the scaling response time is measured in minutes, not seconds.

Verdict

SNS, SQS, and EventBridge are the sensible default for event-driven architecture on AWS. Not the fastest. Not the most flexible. Not the cheapest at massive scale. But the most practical for the overwhelming majority of workloads.

The zero-ops model is real and valuable. The pay-per-use pricing makes experimentation cheap. The integration with the rest of the AWS ecosystem is unmatched. And the durability and availability of these services — battle-tested over nearly two decades in SQS's case — is not something you replicate easily with self-hosted alternatives.

The trade-offs are equally real. You are locked into AWS. Your throughput ceilings for ordered messaging are low. Your replay capabilities are limited. Your end-to-end latency will be measured in tens or hundreds of milliseconds, not single digits. And at truly massive scale, the per-request pricing model makes the cost conversation interesting.

The honest recommendation: if you are building on AWS (and you have already made that decision), start with SNS+SQS for fan-out and EventBridge for routing. You will get a functioning, production-grade event-driven architecture in days rather than weeks, and your on-call engineers will thank you for not giving them a Kafka cluster to babysit. If and when you hit the limits — and you will know when you hit them — you can evaluate Kafka, Kinesis, or MSK for the specific workloads that need more muscle.

Do not over-engineer your messaging layer. The most reliable message broker is the one you do not have to operate.

Google Pub/Sub and Azure Event Hubs

Not every organisation runs on AWS, and not every organisation should. Google Cloud Platform and Microsoft Azure each built their own managed event infrastructure, and while neither commands the same market share as AWS, both have engineering depth that deserves serious examination rather than the "also-ran" treatment they sometimes get in broker comparisons.

This chapter covers two services that solve similar problems in meaningfully different ways. Google Pub/Sub is a topic-and-subscription messaging service with a focus on simplicity and horizontal scaling. Azure Event Hubs is a partitioned log service with a focus on high-throughput ingestion and Kafka wire protocol compatibility. They are not interchangeable, and understanding where each shines — and where each quietly falls apart — will save you from making an expensive mistake.

Google Cloud Pub/Sub

Overview

Google Cloud Pub/Sub launched in 2015, though its roots go back much further into Google's internal messaging infrastructure. If you have read the original Millwheel paper (2013), you have seen the ancestry. Pub/Sub was designed to be a globally distributed, fully managed messaging service that "just works" — and to a surprising degree, it does.

Pub/Sub is built on the same infrastructure that handles messaging inside Google. This is not mere marketing. The service runs on Forstore, Google's distributed messaging backbone, which means it inherits properties like global message routing, synchronous replication across zones, and throughput that scales without any capacity planning from the user. You create a topic, publish messages, and Google handles the rest. The simplicity is the product.

Architecture

Pub/Sub's architecture revolves around two primitives: topics and subscriptions.

A topic is a named resource to which publishers send messages. Unlike Kafka topics (which are partitioned logs), Pub/Sub topics are logical channels. There is no partition concept visible to the user. Google handles sharding internally, which means you never think about partition counts, key distribution, or rebalancing. This is either a feature or a limitation, depending on how much control you want.

A subscription represents a subscriber's interest in a topic. Crucially, a single topic can have multiple subscriptions, and each subscription receives an independent copy of every message. This is Pub/Sub's fan-out mechanism. Within a subscription, messages are delivered to consumer instances in a load-balanced fashion.

Pull delivery is the default model. Consumers call pull (or the more efficient StreamingPull which maintains a long-lived bidirectional gRPC connection) to receive messages. After processing, the consumer sends an acknowledge (ack). Unacknowledged messages are redelivered after the ackDeadline (configurable from 10 seconds to 600 seconds). This is conceptually similar to SQS's visibility timeout.

Push delivery flips the model. Pub/Sub sends messages to a configured HTTPS endpoint. The endpoint returns a 2xx status code to acknowledge receipt. Push subscriptions are useful when your consumer is a Cloud Function, Cloud Run service, or any HTTP endpoint that can handle webhooks. The push subscriber does not need to maintain a connection or poll.

Ordering keys provide within-key ordering guarantees. When you publish messages with the same ordering key, Pub/Sub guarantees that a single subscriber receives them in order. Without ordering keys, message ordering is best-effort. This is the closest equivalent to Kafka's partition-key ordering, but the implementation is different: ordering is per-subscription and per-key, and enabling ordering on a subscription caps throughput for a given key at about 1 MB/s. Publish with ordering keys only when you need ordering. Do not use them "just in case."

Exactly-once delivery was added in 2022 and applies within a subscription. When enabled, Pub/Sub guarantees that an acknowledged message will not be delivered again, within the bounds of the ack deadline. This is implemented through server-side deduplication of acks, not deduplication of publishes. The distinction matters: the publisher can still retry a publish and create duplicates; it is the delivery to the subscriber that is deduplicated. To get true end-to-end exactly-once, your publisher needs its own deduplication (or you use idempotent processing, which you should be doing anyway).

Dead lettering routes messages that have been delivered but not acknowledged beyond a configurable maxDeliveryAttempts to a dead letter topic. The dead letter topic is itself a Pub/Sub topic with its own subscriptions, so you can process dead-lettered messages however you like. This works well.

Message retention defaults to 7 days but can be set up to 31 days. Acknowledged messages can also be retained (for replay purposes) for up to 31 days. Unacknowledged messages that exceed the retention period are dropped.

Seek and replay: You can seek a subscription to a timestamp, which redelivers all messages from that point forward. You can also seek to a snapshot, which captures the acknowledgement state of a subscription at a point in time. This gives you a limited form of replay — not as powerful as Kafka's offset-based replay, but significantly more than what SQS offers (which is nothing).

Strengths

Radical simplicity. No partitions to manage. No rebalancing. No broker sizing. You create a topic, create subscriptions, publish messages. The operational surface area is so small that there is almost nothing to get wrong at the infrastructure level. This is a genuine competitive advantage over self-managed alternatives.

Auto-scaling that actually works. Pub/Sub scales from zero to millions of messages per second without any configuration changes. You do not pre-provision throughput. You do not monitor shard utilization. Google's internal infrastructure handles scaling transparently. For bursty workloads, this is invaluable.

Global availability. Pub/Sub is a global service — topics and subscriptions are not region-scoped (though you can configure message storage policies to restrict data residency). Messages are replicated synchronously across zones within a region. This gives you multi-zone durability without any additional configuration.

Dataflow integration. Google Cloud Dataflow (the managed Apache Beam runner) integrates deeply with Pub/Sub. The Pub/Sub I/O connector handles watermarking, windowing, and exactly-once processing out of the box. If your event processing pipeline involves stream processing (windowed aggregations, sessionization, complex event processing), the Pub/Sub + Dataflow combination is one of the smoothest paths available.

Reasonable pricing at moderate scale. Pub/Sub charges $40 per TiB of data published and delivered. For moderate throughput with small messages, this is competitive with self-hosted alternatives when you factor in operational costs.

Weaknesses

Cost at high scale. That $40/TiB pricing adds up. If you are ingesting 1 TB of messages per day, your monthly Pub/Sub bill for data alone is roughly $1,200 for publishing plus $1,200 for each subscription that receives the data. With three subscriptions, you are at $4,800/month just for data movement. A Kafka cluster handling the same throughput on GKE would cost less, though the comparison is unfair until you add the cost of the engineer who operates it.

Ordering limitations. Ordering keys limit throughput to about 1 MB/s per key. If you need strict ordering on a high-throughput stream, you need to shard your ordering keys carefully. If you need total ordering across all messages in a topic, Pub/Sub cannot give you that. Kafka can (within a single partition), though it comes with its own trade-offs.

No partition-level control. The lack of user-visible partitions is a strength for simplicity but a weakness for use cases where you want partition-level assignment, consumer-partition affinity, or compaction. Pub/Sub does not support log compaction — if you need the "latest value per key" pattern, Pub/Sub is the wrong tool.

GCP lock-in. Pub/Sub's API is proprietary. There is an open-source emulator for local development, and the Pub/Sub Lite product offered cost optimization (though it was deprecated in 2024 in favour of standard Pub/Sub). But if you move off GCP, your Pub/Sub code needs a full rewrite.

Eventual delivery. Pub/Sub does not guarantee delivery latency. Under normal conditions, latency is tens of milliseconds. Under load or during internal rebalancing, it can spike. The SLA guarantees availability, not latency. For latency-sensitive applications, this indeterminism can be a problem.

No native stream processing. Pub/Sub is a messaging layer, not a stream processing engine. For windowed aggregations, joins, or complex event processing, you need Dataflow, Flink, or application-level code. This is a design choice, not a flaw — but Kafka Streams and ksqlDB spoil you into thinking your broker should do everything.

Ideal Use Cases

Event-driven microservices on GCP where simplicity and managed scaling are priorities.
Bursty ingest workloads (IoT telemetry, user clickstreams, log aggregation) where pre-provisioning capacity is impractical.
Pub/Sub + Dataflow pipelines for real-time stream processing.
Global event distribution where multi-region publishing is a requirement.
Teams with limited infrastructure expertise who need a messaging layer that requires near-zero operational investment.

Operational Reality

Running Pub/Sub is delightfully boring. You monitor subscription backlog (num_undelivered_messages) and oldest unacked message age (oldest_unacked_message_age). When backlog grows, you scale your consumers. When message age grows, you investigate why consumers are slow. That is essentially the entire operational story.

Monitoring is done through Cloud Monitoring (formerly Stackdriver). The built-in metrics are adequate for most use cases. Alerting on oldest_unacked_message_age is the most important operational practice — if this number climbs, your consumers are falling behind, and eventually messages will exceed the retention period and be lost.

Cost surprises usually come from three places: (1) multiple subscriptions multiplying data delivery costs, (2) retained acknowledged messages accumulating storage charges, and (3) message attribute sizes counting toward data volume. A message with a 100-byte body and 400 bytes of attributes is billed for 500 bytes.

Code Examples (Python)

from google.cloud import pubsub_v1
from concurrent.futures import TimeoutError
import json

project_id = "my-project"
topic_id = "order-events"
subscription_id = "order-processor"

# --- Publisher ---
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_id)

def publish_order_event(order_id: str, total: float, region: str):
    """Publish with ordering key for per-order ordering."""
    data = json.dumps({
        "orderId": order_id,
        "totalAmount": total,
        "region": region,
    }).encode("utf-8")

    # ordering_key ensures all events for the same order
    # are delivered in order to a single subscriber
    future = publisher.publish(
        topic_path,
        data,
        ordering_key=order_id,
        event_type="OrderPlaced",  # custom attribute for filtering
    )
    message_id = future.result()
    print(f"Published {message_id} for order {order_id}")


# --- Subscriber (streaming pull) ---
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)

def callback(message: pubsub_v1.subscriber.message.Message):
    """Process a single message. Ack on success, nack on failure."""
    try:
        event = json.loads(message.data.decode("utf-8"))
        event_type = message.attributes.get("event_type", "unknown")

        print(f"Received {event_type}: order {event['orderId']}")
        process_order(event)

        message.ack()
    except Exception as e:
        print(f"Processing failed: {e}")
        message.nack()  # Message will be redelivered after ack deadline


def consume():
    """Start streaming pull subscriber. Blocks until interrupted."""
    streaming_pull_future = subscriber.subscribe(
        subscription_path,
        callback=callback,
        flow_control=pubsub_v1.types.FlowControl(
            max_messages=100,          # Max outstanding messages
            max_bytes=10 * 1024 * 1024  # Max outstanding bytes (10 MB)
        ),
    )
    print(f"Listening on {subscription_path}...")

    try:
        streaming_pull_future.result()  # Blocks forever
    except TimeoutError:
        streaming_pull_future.cancel()
        streaming_pull_future.result()  # Wait for cleanup


def process_order(event: dict):
    pass

Verdict: Google Pub/Sub

Pub/Sub is the messaging service for teams who want to think about their application, not their infrastructure. Its simplicity-to-capability ratio is the best in the cloud-native messaging space. You trade fine-grained control (no partitions, no compaction, limited ordering control) for a service that scales effortlessly and requires almost no operational investment.

If you are on GCP, Pub/Sub is the default choice for event-driven architecture, and you need a specific reason not to use it. If you are choosing a cloud, Pub/Sub alone is not a reason to choose GCP — but if you are already there, it is one of the better reasons to stay.

Azure Event Hubs

Overview

Azure Event Hubs launched in 2014 as Microsoft's answer to high-throughput event ingestion. Where Google built a messaging service that hides its internals behind simplicity, Microsoft built a partitioned log service that wears its architecture on its sleeve. If you squint, Event Hubs looks a lot like Kafka. This is not a coincidence — the partitioned append-only log model is the same, and since 2018, Event Hubs has supported the Kafka wire protocol directly.

Event Hubs is positioned as the entry point for big data pipelines on Azure. It sits in front of Azure Stream Analytics, Azure Functions, Azure Data Lake, and the rest of Microsoft's data ecosystem. It is also a legitimate Kafka replacement for teams that want the Kafka programming model without operating Kafka infrastructure.

Architecture

Event Hubs is organised around a hierarchy of concepts that will be familiar if you have used Kafka.

Namespaces are the top-level container. A namespace maps to a cluster of brokers and defines the pricing tier (Basic, Standard, Premium, or Dedicated). Think of it as a Kafka cluster equivalent.

Event Hubs (confusingly, the resource shares its name with the service) are the equivalent of Kafka topics. Each event hub has a configurable number of partitions (2–32 on Standard, up to 2,000 on Dedicated).

Partitions are the unit of parallelism and ordering. Events within a partition are strictly ordered and appended to an immutable log. Producers can specify a partition key (hashed to a partition) or publish to a specific partition. This is the Kafka model, nearly 1:1.

Consumer groups define independent views of the partition log. Each consumer group maintains its own offset per partition. Events are not deleted after consumption — they are retained for a configurable period (1–90 days on Standard and Premium, up to 90 days on Dedicated). This is Kafka's consumer group model.

Capture is Event Hubs' killer feature for data pipeline use cases. It automatically writes events to Azure Blob Storage or Azure Data Lake Store in Avro format, at configurable time and size intervals. No consumer code needed. The events just show up in your data lake, partitioned by time. This is operationally lovely — you get a durable archive of every event with zero application code.

Kafka protocol compatibility (available on Standard tier and above) means you can point a Kafka client at Event Hubs and it works. Your existing Kafka producers, consumers, Kafka Connect connectors, and even Kafka Streams applications can target Event Hubs with configuration changes only. The compatibility is not 100% — some admin APIs are not supported, consumer group management has some differences, and compacted topics are not available — but for the core produce/consume workflow, it works well enough that teams have migrated off self-hosted Kafka to Event Hubs with minimal code changes.

Event Hubs Premium (added in 2021) provides dedicated compute resources within a shared infrastructure. It supports up to 100 partitions per event hub, offers better isolation than Standard tier, and has higher throughput and message size limits (1 MB vs 256 KB on Standard). Premium is positioned as "most of the benefits of Dedicated, without the commitment."

Event Hubs Dedicated gives you an entire cluster. You can have up to 2,000 partitions per event hub, retention up to 90 days, and throughput limited only by the cluster capacity (which you control). Dedicated is for organisations processing millions of events per second who need predictable performance and complete isolation.

Strengths

Kafka wire protocol compatibility. This is the headline feature, and it is genuinely useful. Teams with existing Kafka expertise and codebases can migrate to a managed service without rewriting clients. Kafka Connect works. Kafka Streams works (with caveats). The learning curve for Kafka engineers moving to Event Hubs is measured in hours, not weeks.

Capture to blob storage. Automatic archival to Azure Blob Storage in Avro format is effortlessly useful. No Lambda functions, no custom consumers, no cron jobs. Data lands in your data lake, properly partitioned, ready for batch processing with Spark, Databricks, or Synapse Analytics. For organisations that need both real-time processing and batch analytics on the same event streams, Capture closes a gap that Kafka requires third-party tools (Kafka Connect, S3 sink connector) to fill.

Azure ecosystem integration. Event Hubs feeds directly into Azure Functions (with scaling based on partition count), Stream Analytics (for SQL-like real-time queries), Azure Data Explorer (for log analytics), and Synapse Analytics (for big data processing). If your organisation is on Azure, the integration plumbing is already built.

Managed scaling on Premium/Dedicated tiers. Premium tier auto-scales processing units based on load. You do not manage broker instances or worry about disk capacity. On Dedicated, you manage cluster capacity but not individual brokers.

Long retention. Up to 90 days on Premium and Dedicated tiers. This is longer than most managed messaging services (Pub/Sub maxes at 31 days, SQS at 14 days) and makes Event Hubs viable for replay-heavy workloads.

Weaknesses

Partition limit on Standard tier. Standard tier limits you to 32 partitions per event hub. This caps your consumer parallelism and throughput. For high-throughput workloads, you need Premium (100 partitions) or Dedicated (2,000 partitions), which are significantly more expensive.

No log compaction. This is the most significant gap in the Kafka compatibility story. Kafka's compacted topics are foundational for the "event store as a database" pattern and for maintaining latest-state-per-key tables. Event Hubs simply does not support it. If you need compaction, you need actual Kafka (or Redpanda, or you build your own compaction pipeline to a database, which is ugly).

Azure lock-in. The Kafka protocol compatibility mitigates this somewhat — your client code is portable. But the Capture, monitoring, IAM, and networking integrations are Azure-specific. Moving from Event Hubs Capture to S3 requires new infrastructure.

Pricing complexity. Event Hubs pricing involves throughput units (Standard), processing units (Premium), or capacity units (Dedicated), plus per-million-events ingress charges, plus storage costs, plus Capture storage costs. Comparing the total cost to alternatives requires a spreadsheet. This is not unique to Event Hubs — Azure pricing is consistently the most opaque of the three major clouds — but it makes cost estimation harder than it should be.

Consumer group limit. Standard tier allows 20 consumer groups per event hub. Premium allows 100. This is generous for most use cases but can be limiting for organisations that use consumer groups heavily (one per microservice, one per analytics pipeline, one per testing environment, etc.).

Throughput units on Standard tier. Each throughput unit provides 1 MB/s ingress and 2 MB/s egress. You can have up to 40 throughput units with auto-inflate enabled. If you need more, you are on Premium or Dedicated. The throughput unit model requires capacity planning, which partially undermines the "managed service" value proposition.

Ideal Use Cases

High-throughput event ingestion for IoT, telemetry, clickstream, and log aggregation on Azure.
Kafka migration to managed infrastructure where teams want to retain Kafka clients and patterns without operating Kafka clusters.
Data pipeline ingestion where Capture provides automatic archival to the data lake.
Real-time + batch analytics where the same event stream feeds both Stream Analytics (real-time) and Synapse (batch).
Organisations with existing Azure investment where the ecosystem integration reduces glue code.

Operational Reality

Operating Event Hubs on Standard tier involves monitoring throughput unit utilization and partition health. Enable auto-inflate for throughput units to handle bursts. Monitor incoming/outgoing messages, throttled requests, and consumer lag.

On Premium and Dedicated tiers, the operational burden is lower — auto-scaling handles throughput management, and you focus on partition distribution and consumer health.

Monitoring uses Azure Monitor metrics. Key metrics: IncomingMessages, OutgoingMessages, ThrottledRequests (your canary for capacity issues), IncomingBytes/OutgoingBytes, and consumer lag (available through the Kafka consumer group protocol or the Event Hubs SDK's PartitionProperties.LastEnqueuedSequenceNumber minus your current offset).

Capture management is operationally simple — you configure the destination container, the time window (1–15 minutes), and the size window (10–500 MB), and Event Hubs handles the rest. Monitor for capture failures and be aware that small time windows with low throughput will produce many small Avro files, which is suboptimal for downstream Spark processing. A common pattern is to set a 5-minute capture window and run a periodic compaction job on the resulting files.

Partition management is the main planning exercise. Unlike Pub/Sub, partition count matters and affects performance. You cannot decrease partition count after creation (only increase it on Premium/Dedicated). Choosing the right initial partition count requires estimating your peak throughput and consumer parallelism.

Code Examples (Python)

from azure.eventhub import EventHubProducerClient, EventData, EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
import json

CONNECTION_STR = "Endpoint=sb://my-namespace.servicebus.windows.net/;..."
EVENTHUB_NAME = "order-events"

# --- Producer ---
def publish_order_events(events: list[dict]):
    """
    Publish events with partition keys for ordering.
    Uses batching to maximize throughput.
    """
    producer = EventHubProducerClient.from_connection_string(
        conn_str=CONNECTION_STR,
        eventhub_name=EVENTHUB_NAME,
    )

    with producer:
        # Create a batch scoped to a partition key
        # All events in this batch go to the same partition
        for event in events:
            event_data_batch = producer.create_batch(
                partition_key=event["orderId"]
            )
            event_data_batch.add(EventData(json.dumps(event)))
            producer.send_batch(event_data_batch)
            print(f"Sent event for order {event['orderId']}")


# --- Consumer with checkpointing ---
STORAGE_CONNECTION_STR = "DefaultEndpointsProtocol=https;..."
BLOB_CONTAINER_NAME = "eventhub-checkpoints"

def on_event(partition_context, event):
    """Process a single event and checkpoint."""
    body = event.body_as_str()
    order_event = json.loads(body)

    print(f"Partition {partition_context.partition_id}: "
          f"order {order_event['orderId']}")

    process_order(order_event)

    # Checkpoint after processing — stores offset in blob storage
    partition_context.update_checkpoint(event)


def on_error(partition_context, error):
    """Handle errors during event processing."""
    if partition_context:
        print(f"Error on partition {partition_context.partition_id}: {error}")
    else:
        print(f"Error: {error}")


def consume():
    """
    Start consuming from all partitions.
    BlobCheckpointStore persists consumer offsets in Azure Blob Storage,
    enabling consumer restarts without reprocessing.
    """
    checkpoint_store = BlobCheckpointStore.from_connection_string(
        STORAGE_CONNECTION_STR,
        BLOB_CONTAINER_NAME,
    )

    consumer = EventHubConsumerClient.from_connection_string(
        conn_str=CONNECTION_STR,
        consumer_group="$Default",
        eventhub_name=EVENTHUB_NAME,
        checkpoint_store=checkpoint_store,
    )

    with consumer:
        consumer.receive(
            on_event=on_event,
            on_error=on_error,
            starting_position="-1",  # Start from beginning
        )


# --- Using Kafka protocol instead ---
# pip install confluent-kafka
from confluent_kafka import Producer as KafkaProducer, Consumer as KafkaConsumer

def kafka_producer_example():
    """
    Same Event Hubs, but using the Kafka wire protocol.
    Your existing Kafka code works with a config change.
    """
    config = {
        'bootstrap.servers': 'my-namespace.servicebus.windows.net:9093',
        'security.protocol': 'SASL_SSL',
        'sasl.mechanism': 'PLAIN',
        'sasl.username': '$ConnectionString',
        'sasl.password': CONNECTION_STR,
    }

    producer = KafkaProducer(config)
    producer.produce(
        topic=EVENTHUB_NAME,
        key=b"ord-7829",
        value=json.dumps({"orderId": "ord-7829", "total": 149.99}).encode(),
    )
    producer.flush()


def process_order(event: dict):
    pass

Verdict: Azure Event Hubs

Event Hubs is the right answer for organisations that are committed to Azure and need high-throughput event ingestion. The Kafka protocol compatibility is its trump card — it gives you the Kafka programming model with managed infrastructure, which is a genuine value proposition for teams that know Kafka but do not want to run it.

Capture is the other standout feature. Automatic archival to blob storage, in a structured format, with no application code, solves a problem that every event-driven system eventually faces. If your architecture involves both real-time processing and batch analytics on the same event data, Event Hubs + Capture is one of the cleanest solutions available.

The limitations are real but predictable. No compaction means Event Hubs cannot fully replace Kafka for all use cases. The partition and consumer group limits on Standard tier push high-throughput workloads to more expensive tiers. And Azure's pricing model rewards patience and a good spreadsheet.

Comparing the Cloud-Native Options

Here is a direct comparison of all three cloud-native options covered in this chapter and the previous one:

Dimension	AWS SNS+SQS / EventBridge	Google Cloud Pub/Sub	Azure Event Hubs
Model	Queue + pub/sub + event bus	Topic + subscription	Partitioned log
Ordering	FIFO queues (300 TPS)	Ordering keys (~1 MB/s per key)	Per-partition (Kafka model)
Replay	EventBridge archive (limited)	Seek to timestamp (31 days)	Offset-based (up to 90 days)
Compaction	No	No	No
Kafka compat	No (use MSK)	No	Yes (wire protocol)
Auto-scaling	Fully automatic	Fully automatic	Throughput units / auto-inflate
Max retention	14 days (SQS) / configurable (EB)	31 days	90 days (Premium/Dedicated)
Max message size	256 KB	10 MB	1 MB (Premium) / 256 KB (Standard)
Dead lettering	SQS DLQ	Dead letter topic	No native DLQ (handle in consumer)
Ecosystem	Deepest AWS integration	GCP + Dataflow	Azure + Kafka ecosystem
Pricing model	Per-request (SQS) / per-event (EB)	Per-data-volume	Per-throughput-unit + per-event
Ops burden	Near zero	Near zero	Low (Standard) to near zero (Premium)

When to Choose Which

Choose AWS SNS+SQS/EventBridge when you are on AWS, want zero operational overhead, and your throughput requirements are moderate. The EventBridge rule engine is the most sophisticated content-based router among the three.

Choose Google Pub/Sub when you are on GCP, want the simplest possible mental model, or your workloads are bursty and unpredictable. Pub/Sub's auto-scaling is the most transparent of the three — you truly never think about capacity.

Choose Azure Event Hubs when you are on Azure, need high-throughput ordered event streams, want Kafka compatibility without operating Kafka, or need long retention with Capture for data lake integration.

Choose none of them when you need sub-millisecond latency, log compaction, cross-cloud portability, or you have the team and mandate to operate your own infrastructure. In those cases, look at Kafka, Redpanda, or NATS.

Final Verdict

Every cloud provider's managed messaging service is good enough for most workloads. That is a deliberately boring statement, and it is true. The differences between them matter at the margins — ordering guarantees, pricing models, ecosystem integration, replay capabilities — but the core proposition is the same: you get a durable, scalable messaging layer without operating it yourself.

The most important factor in choosing between them is which cloud you are already on. If you are on GCP, use Pub/Sub. If you are on Azure, use Event Hubs. If you are on AWS, use SNS/SQS/EventBridge. Cross-cloud messaging is a solvable problem, but it is not a problem you want to solve unless you genuinely have multi-cloud workloads.

Do not let the choice of messaging service drive your choice of cloud provider. That tail is far too small to wag that dog.

Redis Streams

Every few years, a technology that is good at one thing gets ambitious and tries to be good at a second thing. Sometimes this works brilliantly — PostgreSQL adding JSONB, for instance. Sometimes it produces something awkward. Redis Streams falls somewhere in the middle: it is a genuinely useful data structure that solves real problems, but it carries the DNA of an in-memory cache into a domain that traditionally demands durable, disk-backed storage. Understanding where that tension matters — and where it does not — is the key to using Redis Streams well.

Overview

Redis Streams were introduced in Redis 5.0, released in October 2018. The feature was designed and implemented by Salvatore Sanfilippo (antirez), Redis's creator, who described it as "a new data type modeling a log data structure in a more abstract way." If that sounds like he was thinking about Kafka, he was — though he was careful to position Streams as a Redis-native feature, not a Kafka replacement.

The design intent was to add a proper event log data structure to Redis that supported consumer groups, acknowledgement, and persistence. Before Streams, Redis developers cobbled together pub/sub (which has no persistence and no consumer groups), Lists (which have no fan-out and awkward consumer group semantics), and Sorted Sets (which work but are a contortion). Streams gave Redis a first-class log structure, and it is the right tool for a specific category of problems.

Redis is maintained by Redis Ltd. (formerly Redis Labs), which provides the commercial Redis Enterprise product. The open-source Redis project uses a dual-license model (RSALv2 and SSPLv1 since Redis 7.4 in 2024), and the community fork Valkey (maintained by the Linux Foundation) also supports Streams. For the purposes of this chapter, everything applies equally to Redis and Valkey unless noted.

Architecture

The Data Structure

A Redis Stream is, at its core, an append-only log of entries, each identified by a unique ID. The implementation uses a radix tree of macro-nodes, where each macro-node contains a listpack (a compact serialised list of entries). This structure is optimised for two access patterns: appending new entries (fast, O(1) amortised) and reading entries by ID range (fast, O(log N) for the seek plus O(M) for the range scan).

Each stream entry has:

An ID: By default, a millisecond timestamp plus a sequence number, formatted as <millisecondsTime>-<sequenceNumber> (e.g., 1699958537443-0). IDs are monotonically increasing and auto-generated by default, though you can provide custom IDs if they are greater than the last entry's ID. The timestamp-based ID is elegant — it means you can seek to a point in time without maintaining a separate index.
A body: A sequence of field-value pairs, like a flat hash. Every entry in a stream can have different fields, though in practice you will want consistency for consumer sanity.

The Command Vocabulary

Redis Streams expose a small, well-designed API:

XADD appends an entry to a stream. Returns the auto-generated ID. Optionally trims the stream (more on this below).

XADD orders * orderId ord-7829 totalAmount 149.99 region us-east-1

The * tells Redis to auto-generate the ID. The response is something like 1699958537443-0.

XLEN returns the number of entries in a stream. O(1). Useful for monitoring.

XRANGE and XREVRANGE read entries by ID range. This is how you read historical data:

XRANGE orders 1699958537443-0 + COUNT 10

XREAD reads new entries from one or more streams, optionally blocking until entries are available. This is the simple consumer model — no consumer groups, just "give me everything since this ID":

XREAD BLOCK 5000 STREAMS orders 1699958537443-0

XREADGROUP is the consumer group read. This is where Streams gets interesting:

XREADGROUP GROUP order-processors worker-1 COUNT 10 BLOCK 5000 STREAMS orders >

The > means "give me new entries that have not been delivered to any consumer in this group." You can also re-read entries that were delivered but not acknowledged by using a specific ID instead of >.

XACK acknowledges processing of one or more entries:

XACK orders order-processors 1699958537443-0

XCLAIM transfers ownership of a pending entry from one consumer to another. This is for failure recovery — if consumer A received a message and then died, consumer B can claim it:

XCLAIM orders order-processors worker-2 3600000 1699958537443-0

The 3600000 is the minimum idle time in milliseconds. XCLAIM will only claim entries that have been idle for at least this long, preventing you from stealing work from a consumer that is merely slow.

XAUTOCLAIM (added in Redis 6.2) automates the claim process by scanning the pending entries list for entries that have exceeded the idle threshold and claiming them in one operation. This is the command you actually want in production — it replaces the manual "scan PEL, filter by idle time, claim individually" loop:

XAUTOCLAIM orders order-processors worker-2 3600000 0-0 COUNT 10

XPENDING inspects the pending entries list — entries that have been delivered but not acknowledged:

XPENDING orders order-processors

This returns the total pending count, the range of pending IDs, and per-consumer counts. Invaluable for debugging consumer health.

XINFO provides metadata about streams, groups, and consumers. Use XINFO STREAM, XINFO GROUPS, and XINFO CONSUMERS for operational visibility.

Consumer Groups and the Pending Entries List

Consumer groups are the mechanism that transforms Redis Streams from a simple log into a workable message broker. A consumer group:

Maintains a last-delivered ID — the cursor tracking which entries have been dispatched to consumers in this group.
Maintains a pending entries list (PEL) — a list of entries that have been delivered to consumers but not yet acknowledged.
Distributes new entries across consumers within the group in a round-robin fashion (roughly — the distribution depends on which consumer calls XREADGROUP first).

The PEL is the crucial data structure. It tracks, for each pending entry: the entry ID, the consumer name, the delivery timestamp, and the delivery count. This enables:

At-least-once delivery: Entries stay in the PEL until acknowledged. If a consumer crashes, the entries remain pending and can be claimed by another consumer.
Failure detection: Entries that have been pending for a long time indicate a dead or stuck consumer.
Redelivery tracking: The delivery count tells you how many times an entry has been delivered, enabling dead-letter logic in your application code (Redis does not have native DLQ support for Streams — you implement it yourself).

Creating a consumer group:

XGROUP CREATE orders order-processors $ MKSTREAM

The $ means "start consuming from new entries only." Use 0 to start from the beginning of the stream.

Memory Management and Trimming

Redis Streams live in memory. This is simultaneously the source of their speed and their primary constraint. An unbounded stream will grow until Redis runs out of memory, at which point Bad Things Happen (eviction or OOM, depending on your configuration).

Trimming strategies:

MAXLEN caps the stream at a maximum number of entries. You can specify it on every XADD:

XADD orders MAXLEN ~ 1000000 * orderId ord-7829 totalAmount 149.99

The ~ is an important detail — it tells Redis to trim approximately to the max length, allowing it to trim only when it can remove an entire macro-node from the radix tree. This is significantly faster than exact trimming and is what you should use in production.

MINID trims entries with IDs less than the specified minimum. This is time-based trimming:

XADD orders MINID ~ 1699872137443 * orderId ord-7829 totalAmount 149.99

This removes entries older than the specified timestamp, which is often a more natural retention policy than a count-based limit.

In production, choose a trimming strategy based on your use case:

Use MAXLEN ~ when you care about bounding memory usage predictably.
Use MINID ~ when you care about retaining a time window of entries.
Run trimming on every XADD (with ~) or periodically from a maintenance process. Do not forget to trim. Seriously.

Persistence: RDB + AOF

Redis Streams are persisted through the same mechanisms as all Redis data structures:

RDB (Redis Database) snapshots are point-in-time dumps of the entire dataset to disk. They are taken periodically (configurable) and are compact and fast to load. The catch: data written between snapshots is lost on crash. For a message broker, this means messages can be lost. The window of potential loss equals the time since the last snapshot.

AOF (Append-Only File) logs every write operation. With appendfsync always, every write is fsynced to disk before returning to the client. This provides the strongest durability guarantee Redis can offer, at the cost of higher latency (every XADD waits for disk I/O). With appendfsync everysec (the default), you get at most one second of data loss on crash.

The durability reality check: Even with AOF appendfsync always, Redis's durability guarantees are weaker than Kafka's. Kafka replicates writes to multiple brokers before acknowledging them. Redis with AOF writes to a single disk on a single machine. Redis Sentinel and Redis Cluster add replication, but replication is asynchronous by default — the primary acknowledges the write before the replica receives it. You can use WAIT to block until replicas confirm, but this is per-command, not a cluster-wide guarantee.

If the phrase "we can tolerate losing the last second of messages" makes your compliance team nervous, Redis Streams is not your primary event store.

Redis Cluster and Streams

In a Redis Cluster deployment, each stream lives on a single shard (determined by the stream's key hash). This means:

A single stream's throughput is limited to a single Redis instance's throughput.
Consumer groups for a stream operate on a single shard.
You can scale by sharding across multiple stream keys (e.g., orders:{region} using hash tags).

Redis Cluster does not shard a single stream across multiple nodes. If you need a single logical stream with throughput beyond one Redis instance, you need to partition at the application level — which is exactly the problem Kafka solves with its partition model.

Strengths

Sub-millisecond latency. Redis Streams inherits Redis's in-memory performance. XADD and XREADGROUP operations complete in microseconds under normal conditions. If your use case demands the lowest possible latency for event production and consumption, Redis Streams is in a class shared only by NATS Core and Chronicle Queue. Kafka's median latency is measured in single-digit milliseconds; Redis operates in single-digit microseconds.

You already have Redis. This is the pragmatic argument, and it is the strongest one. If your infrastructure already includes Redis for caching, session storage, or rate limiting, adding Streams requires zero new infrastructure. No new clusters to provision, no new operational playbooks, no new monitoring dashboards. The marginal cost of adding an event streaming capability to an existing Redis deployment is almost nothing.

Simple, elegant API. The Streams command set is compact and well-designed. You can learn the core commands (XADD, XREADGROUP, XACK, XPENDING) in an hour. Compare this to the Kafka client API surface area or the RabbitMQ concept vocabulary. Redis Streams is refreshingly small.

Consumer groups that work. The consumer group implementation — with the PEL, XCLAIM, XAUTOCLAIM, and per-consumer tracking — is surprisingly complete. It provides at-least-once delivery, failure recovery, and per-consumer monitoring. It is not as feature-rich as Kafka's consumer group protocol (no automatic rebalancing, no cooperative sticky assignment), but for many use cases it is sufficient.

Built-in time-based indexing. The timestamp-based entry IDs mean you can efficiently query "give me all entries from the last five minutes" without maintaining a secondary index. This is a small thing, but it is a nice thing.

Lua scripting integration. You can interact with Streams in Lua scripts that execute atomically on the Redis server. This enables transactional patterns (read from one stream, write to another, atomically) that are difficult to achieve with external brokers.

Weaknesses

Memory-bound. Everything is in memory. A stream with a million entries of 1 KB each consumes roughly 1 GB of RAM (with overhead). At $10–20/GB/month for cloud instances, storing large event histories in Redis is expensive compared to disk-based brokers. Kafka can retain terabytes of events on cheap disk; Redis retains them in the most expensive tier of the storage hierarchy.

Not designed for high-durability event sourcing. The persistence model (RDB snapshots + AOF) provides reasonable durability for cache and session use cases but falls short of what regulated industries and event sourcing patterns require. Asynchronous replication means acknowledged writes can be lost during failover. There is no equivalent of Kafka's acks=all with min.insync.replicas=2.

Limited replay. You can read historical entries with XRANGE, but there is no concept of "reset consumer group offset to yesterday at noon" — you would need to use XREADGROUP with a specific entry ID, and the consumer group semantics around re-reading acknowledged entries are awkward. Kafka's consumer group offset reset is a first-class operation; in Redis Streams, it requires manual PEL management.

Single-threaded execution. Redis processes commands on a single thread (Redis 7's multi-threaded I/O handles network processing, but command execution remains single-threaded). A CPU-intensive Lua script or a large XRANGE scan blocks all other operations on that shard. For a general-purpose cache, this is rarely a problem. For a message broker handling thousands of consumers, it can become one.

No native dead letter queue. When a message cannot be processed, your application code must track delivery counts (available in the PEL) and move entries to a separate dead letter stream manually. This is not hard, but it is one more thing to implement and test, and one more thing that can have bugs.

No built-in schema registry. Stream entries are flat field-value pairs of strings. There is no schema enforcement, no schema evolution, no type system. Your producers and consumers agree on message formats through convention and hope. For small teams with good discipline, this is fine. For large organisations, it is a governance gap.

Consumer group rebalancing is manual. When a consumer in a group dies, its pending entries are not automatically redistributed. You must implement XAUTOCLAIM polling or a manual XCLAIM process. Kafka handles this automatically with its consumer group rebalancing protocol. In Redis, you build it yourself.

No cross-node stream. A stream lives on one Redis node. Throughput is bounded by that node's capacity. Horizontal scaling requires application-level partitioning across multiple stream keys, which moves partitioning complexity from the broker (where Kafka handles it) into your application (where you handle it).

Ideal Use Cases

Lightweight event bus between microservices. When your event throughput is moderate (thousands to low tens of thousands per second), your retention requirements are short (minutes to hours, not days to months), and your infrastructure already includes Redis. This is the sweet spot.

Real-time activity feeds. Social media timelines, notification streams, activity logs where recency matters and historical depth does not. Redis Streams' sub-millisecond latency and built-in trimming make it natural for "what happened in the last N minutes" queries.

Task queues with visibility. When you need a task queue with consumer groups, acknowledgement, and the ability to inspect pending work, Redis Streams is a compelling alternative to dedicated task queue systems (Celery, Sidekiq, Bull). The PEL gives you complete visibility into what is in flight.

Inter-service communication in a monorepo. When services are co-located and you want a simple event bus without the operational overhead of a dedicated broker. Redis Streams lets you add event-driven communication incrementally.

Rate-limited processing pipelines. Using consumer groups with controlled XREADGROUP COUNT, you can build processing pipelines that naturally throttle throughput. This is useful for feeding rate-limited APIs, managing database write pressure, or smoothing bursty workloads.

When NOT to Use It

Long-term event storage. If you need to retain events for weeks, months, or years, Redis Streams is the wrong tool. The memory cost is prohibitive, and the durability guarantees are insufficient.

Regulated environments requiring guaranteed delivery. If losing a message has compliance implications (financial transactions, healthcare records, audit logs), Redis Streams' persistence model is not strong enough. Use Kafka with acks=all, or a database-backed solution.

High-throughput event sourcing. If you are building a system where the event log is the source of truth and you need to replay from the beginning of time, you need Kafka, Pulsar, or a database. Redis Streams is a buffer, not a ledger.

Multi-datacenter replication. Redis Cluster does not support multi-datacenter deployment natively. Redis Enterprise offers active-active geo-replication, but it comes at a significant cost premium. If your event infrastructure must span regions, look elsewhere.

When you don't already have Redis. If Redis is not already in your infrastructure, deploying it solely for Streams is hard to justify when purpose-built brokers exist. The "you already have Redis" argument works in reverse — if you don't have it, the calculus changes.

Operational Reality

Operating Redis Streams means operating Redis, which means you are either running Redis yourself (on VMs or Kubernetes), using a managed service (AWS ElastiCache, GCP Memorystore, Azure Cache for Redis, Redis Cloud), or running Redis Enterprise.

Memory monitoring is the single most critical operational practice. A stream that grows faster than your trimming policy will consume all available memory. Monitor used_memory, stream_length (via XLEN), and set alerts aggressively. Redis's maxmemory-policy does not apply to Streams in a useful way — if Redis evicts your stream to free memory, you have lost your event log. Use noeviction policy and trim your streams explicitly.

Consumer health monitoring requires polling XPENDING regularly. Build a monitoring job that checks each consumer group's pending count and the idle time of the oldest pending entry. If a consumer has entries pending for longer than your expected processing time, it is dead or stuck, and you need to trigger XAUTOCLAIM from a healthy consumer.

Persistence configuration depends on your durability requirements. For streams where data loss is acceptable (real-time feeds, ephemeral notifications): RDB snapshots every 60 seconds are fine. For streams where data loss is merely tolerable: AOF with appendfsync everysec. For streams where you are pretending Redis is a durable message broker: AOF with appendfsync always, and accept the latency penalty — then seriously reconsider whether Redis is the right tool.

Upgrading is straightforward. Redis is a single binary with excellent backward compatibility. Streams are part of the core data model, not a plugin. Rolling upgrades in a Cluster or Sentinel deployment work as expected.

Code Examples

Python (redis-py)

import redis
import json
import time
import signal
import sys

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# --- Producer ---
def publish_order_event(order_id: str, total: float, region: str):
    """
    Append an event to the orders stream.
    MAXLEN ~ 100000 ensures the stream doesn't grow without bound.
    """
    entry_id = r.xadd(
        'orders',
        {
            'orderId': order_id,
            'totalAmount': str(total),  # Redis stores strings
            'region': region,
            'eventType': 'OrderPlaced',
            'timestamp': str(int(time.time() * 1000)),
        },
        maxlen=100000,
        approximate=True,
    )
    print(f"Published order {order_id} as {entry_id}")
    return entry_id


# --- Consumer group setup ---
def setup_consumer_group(stream: str, group: str):
    """Create a consumer group, starting from new entries."""
    try:
        r.xgroup_create(stream, group, id='$', mkstream=True)
        print(f"Created consumer group '{group}' on stream '{stream}'")
    except redis.ResponseError as e:
        if 'BUSYGROUP' in str(e):
            print(f"Consumer group '{group}' already exists")
        else:
            raise


# --- Consumer ---
def consume_orders(group: str, consumer_name: str):
    """
    Consume from a stream using consumer groups.
    Handles new messages, pending recovery, and graceful shutdown.
    """
    stream = 'orders'
    setup_consumer_group(stream, group)

    running = True
    def shutdown(sig, frame):
        nonlocal running
        running = False
    signal.signal(signal.SIGTERM, shutdown)
    signal.signal(signal.SIGINT, shutdown)

    # First, recover any pending entries from a previous crash
    recover_pending(stream, group, consumer_name)

    while running:
        try:
            # Read new entries (> means undelivered entries only)
            entries = r.xreadgroup(
                groupname=group,
                consumername=consumer_name,
                streams={stream: '>'},
                count=10,
                block=5000,  # Block for 5 seconds
            )

            if not entries:
                continue

            for stream_name, messages in entries:
                for msg_id, fields in messages:
                    try:
                        process_order(fields)
                        r.xack(stream, group, msg_id)
                    except Exception as e:
                        print(f"Failed to process {msg_id}: {e}")
                        # Don't ack — entry stays in PEL for recovery

        except redis.ConnectionError:
            print("Redis connection lost, reconnecting...")
            time.sleep(1)

    print("Consumer shut down gracefully")


def recover_pending(stream: str, group: str, consumer_name: str):
    """
    Claim and reprocess entries that were pending for too long.
    This handles the case where a previous consumer instance crashed.
    """
    while True:
        # Claim entries idle for more than 30 seconds
        claimed = r.xautoclaim(
            name=stream,
            groupname=group,
            consumername=consumer_name,
            min_idle_time=30000,  # 30 seconds
            start_id='0-0',
            count=10,
        )

        # xautoclaim returns (next_start_id, claimed_entries, deleted_ids)
        next_id, entries, deleted = claimed

        for msg_id, fields in entries:
            try:
                process_order(fields)
                r.xack(stream, group, msg_id)
                print(f"Recovered and processed pending entry {msg_id}")
            except Exception as e:
                print(f"Failed to recover {msg_id}: {e}")
                # Check delivery count for dead-letter logic
                pending_info = r.xpending_range(
                    stream, group, min=msg_id, max=msg_id, count=1
                )
                if pending_info and pending_info[0]['times_delivered'] > 5:
                    dead_letter(stream, group, msg_id, fields)

        if next_id == b'0-0' or next_id == '0-0' or not entries:
            break


def dead_letter(stream: str, group: str, msg_id: str, fields: dict):
    """Move a poison message to a dead letter stream."""
    r.xadd(f'{stream}:dead-letter', {
        **fields,
        'original_stream': stream,
        'original_id': msg_id,
        'dead_lettered_at': str(int(time.time() * 1000)),
    })
    r.xack(stream, group, msg_id)
    print(f"Dead-lettered {msg_id}")


def process_order(fields: dict):
    """Your business logic here."""
    print(f"Processing order {fields.get('orderId')}: "
          f"${fields.get('totalAmount')} in {fields.get('region')}")


# --- Monitoring helper ---
def stream_health(stream: str, group: str):
    """Print stream and consumer group health metrics."""
    length = r.xlen(stream)
    info = r.xinfo_groups(stream)

    print(f"\nStream '{stream}': {length} entries")
    for g in info:
        print(f"  Group '{g['name']}': "
              f"{g['pending']} pending, "
              f"{g['consumers']} consumers, "
              f"last-delivered: {g['last-delivered-id']}")

        consumers = r.xinfo_consumers(stream, g['name'])
        for c in consumers:
            print(f"    Consumer '{c['name']}': "
                  f"{c['pending']} pending, "
                  f"idle {c['idle']}ms")

Node.js (ioredis)

const Redis = require('ioredis');
const redis = new Redis();

// --- Producer ---
async function publishOrderEvent(orderId, total, region) {
  const id = await redis.xadd(
    'orders',
    'MAXLEN', '~', '100000',
    '*',
    'orderId', orderId,
    'totalAmount', String(total),
    'region', region,
    'eventType', 'OrderPlaced',
    'timestamp', String(Date.now()),
  );
  console.log(`Published order ${orderId} as ${id}`);
  return id;
}

// --- Consumer with consumer group ---
async function consume(group, consumerName) {
  // Create group if it doesn't exist
  try {
    await redis.xgroup('CREATE', 'orders', group, '$', 'MKSTREAM');
  } catch (err) {
    if (!err.message.includes('BUSYGROUP')) throw err;
  }

  console.log(`Consumer ${consumerName} starting in group ${group}`);

  while (true) {
    try {
      const results = await redis.xreadgroup(
        'GROUP', group, consumerName,
        'COUNT', '10',
        'BLOCK', '5000',
        'STREAMS', 'orders', '>'
      );

      if (!results) continue;

      for (const [stream, messages] of results) {
        for (const [id, fields] of messages) {
          // fields is a flat array: ['orderId', 'ord-7829', 'totalAmount', '149.99', ...]
          const event = {};
          for (let i = 0; i < fields.length; i += 2) {
            event[fields[i]] = fields[i + 1];
          }

          try {
            await processOrder(event);
            await redis.xack('orders', group, id);
          } catch (err) {
            console.error(`Failed to process ${id}: ${err.message}`);
            // Leave unacked for recovery
          }
        }
      }
    } catch (err) {
      if (err.message.includes('NOGROUP')) {
        console.error('Consumer group does not exist');
        break;
      }
      console.error(`Consumer error: ${err.message}`);
      await new Promise(resolve => setTimeout(resolve, 1000));
    }
  }
}

async function processOrder(event) {
  console.log(`Processing order ${event.orderId}: $${event.totalAmount}`);
}

// --- Run ---
// publishOrderEvent('ord-7829', 149.99, 'us-east-1');
// consume('order-processors', 'worker-1');

Verdict

Redis Streams is the messaging solution for pragmatists who already have Redis and do not need a purpose-built message broker. It is not a Kafka replacement, and calling it one does both technologies a disservice. It is a lightweight, high-performance event log that lives where your data already lives, and for the right use cases, it is the simplest path to event-driven communication.

The right use cases are: short-retention event buses, real-time activity feeds, task queues with observability, and inter-service communication at moderate scale. The wrong use cases are: long-term event storage, event sourcing as a system-of-record pattern, high-durability regulated workloads, and anything that needs more throughput than a single Redis node can provide.

The API is small and well-designed. Consumer groups work. The PEL provides genuine operational visibility. Sub-millisecond latency is real. And the total cost of adoption — when Redis is already in your stack — is a few hours of reading documentation and writing a consumer loop.

The honest framing is this: Redis Streams turns your cache into a capable lightweight event bus. Whether that is a good idea depends entirely on whether "lightweight event bus" is what you need. If it is, there is nothing simpler. If it is not, no amount of Redis affection will change the physics of in-memory storage or the mathematics of single-node throughput. Know what you need, and choose accordingly.

NATS and JetStream

Most message brokers accumulate complexity the way old houses accumulate extensions. A feature here, a subsystem there, and before long you have something that requires a team of three to operate and a configuration file the length of a novella. NATS took the opposite approach. It started simple, stayed simple, and bet that simplicity itself is the feature that matters most. Fifteen years into the experiment, the bet is paying off.

Overview

NATS was created by Derek Collier, who started the project around 2010 while at Apcera (a company that built a cloud platform, was ahead of its time, and no longer exists — as is the tradition). The original implementation was in Ruby, then rewritten in Go in 2012 for performance. The Go heritage is not incidental — NATS embodies Go's philosophy of doing less, doing it well, and fitting in a single binary.

Collier went on to found Synadia, the company that stewards NATS development. Synadia provides a commercial offering (Synadia Cloud and Synadia Control Plane) but the core NATS server is fully open-source under the Apache 2.0 license. The governance model is straightforward: Synadia employs most of the core maintainers, but the project is genuinely open and the community is active, if smaller than Kafka's or RabbitMQ's.

The NATS story has two chapters. Core NATS (the original) is a fire-and-forget pub/sub system with no persistence, no guaranteed delivery, and extraordinary speed. JetStream (added in NATS Server 2.2, released 2021) adds persistence, exactly-once semantics, and stream processing primitives on top of Core NATS. Understanding both — and the boundary between them — is essential.

Architecture

Core NATS: The Foundation

Core NATS is a publish-subscribe messaging system built on a single idea: subjects. A subject is a string like orders.placed or payments.us.processed. Publishers send messages to subjects. Subscribers express interest in subjects. The NATS server routes messages from publishers to subscribers based on subject matching. That is it. There is no queue, no log, no persistence. Messages are delivered to connected subscribers in real time. If no subscriber is connected, the message is gone.

This sounds like a limitation, and it is — but it is a deliberate limitation. Core NATS optimises for a specific point in the design space: the lowest possible latency with the simplest possible programming model, for workloads where ephemeral messaging is acceptable. Think of it as UDP for application messaging — fast, simple, no guarantees.

Publish-subscribe is the basic pattern. A publisher sends to a subject, and all subscribers on that subject receive the message:

Publisher → [orders.placed] → Subscriber A
                            → Subscriber B
                            → Subscriber C

Request-reply is built into the protocol. A requester publishes a message with a reply subject (an inbox), and a responder subscribes to that subject and sends a response. NATS handles the inbox creation and routing:

// Requester
msg, err := nc.Request("orders.validate", orderData, 2*time.Second)

// Responder
nc.Subscribe("orders.validate", func(msg *nats.Msg) {
    result := validateOrder(msg.Data)
    msg.Respond(result)
})

This is a natural fit for microservices that need synchronous-style communication over an asynchronous transport. Multiple responders can subscribe to the same subject, and NATS will route each request to one of them (load-balanced), making it a built-in service mesh primitive.

Queue groups provide load balancing across multiple instances of a subscriber. Subscribers that join the same queue group receive messages in a distributed fashion — each message goes to exactly one member of the group:

nc.QueueSubscribe("orders.placed", "order-processors", func(msg *nats.Msg) {
    processOrder(msg.Data)
})

Queue groups require no server-side configuration. The subscriber simply declares its group membership at subscription time. This is characteristic of NATS's design philosophy: push complexity to the edges, keep the server simple.

Subject-Based Routing and Wildcards

NATS subjects use dot-separated tokens. The routing model supports two wildcards:

* (single token wildcard): Matches exactly one token in the subject hierarchy.

orders.*           → matches orders.placed, orders.cancelled
                   → does NOT match orders.us.placed

> (multi-token wildcard): Matches one or more tokens at the end of a subject.

orders.>           → matches orders.placed, orders.us.placed,
                     orders.us.east.placed.confirmed

Subject-based routing with wildcards is powerful because it lets you build flexible routing topologies without server-side configuration. Want all events from the US East region? Subscribe to events.us.east.>. Want all order events regardless of region? Subscribe to orders.>. Want a specific event type across all regions? Subscribe to *.*.orders.placed. The subject hierarchy is your routing table, and it is defined by convention between publishers and subscribers.

This is a different paradigm from Kafka's topics + partitions, RabbitMQ's exchanges + bindings, or EventBridge's rules + patterns. It is simpler, which means it is faster to set up and easier to understand, but it also means you lack the more sophisticated routing features (content-based routing, header-based filtering, priority queues). Whether this is a limitation depends on whether you need those features.

JetStream: The Persistence Layer

JetStream was built to answer the question "what if NATS messages needed to survive a server restart?" It adds a persistence layer on top of Core NATS without changing the fundamental architecture.

Streams are the persistence primitive. A stream captures messages published to one or more subjects and stores them durably. Streams have configurable retention policies:

Limits-based: Retain up to N messages, N bytes, or N time duration. Oldest messages are discarded when limits are exceeded.
Interest-based: Retain messages only while there are active consumers. Once all consumers have acknowledged a message, it can be removed. This is closer to traditional queue semantics.
Work queue: Each message is consumed by exactly one consumer. Once acknowledged, it is removed. This is a single-consumer queue, not a log.

Streams can be stored in memory or on file (with optional compression). File-based streams survive restarts. Memory-based streams provide higher throughput at the cost of durability — the same trade-off Redis makes, but opt-in rather than the default.

Consumers are the subscription mechanism for JetStream. Unlike Core NATS subscriptions (which are ephemeral), JetStream consumers maintain state: their position in the stream, their acknowledgement records, and their delivery tracking.

Consumers can be:

Durable: Server-side state survives consumer disconnection. The consumer resumes from where it left off.
Ephemeral: State is discarded when the consumer disconnects.

Consumers can be:

Push-based: The server delivers messages to the consumer as they become available.
Pull-based: The consumer requests messages when ready. Pull consumers are preferred for most use cases because they provide natural backpressure.

Exactly-once semantics in JetStream are achieved through a combination of message deduplication at publish time (using a Nats-Msg-Id header and a configurable deduplication window) and double acknowledgement at consume time. The double-ack protocol ensures that neither the server nor the client process a message more than once, at the cost of additional round trips.

Acknowledgement modes for consumers:

AckExplicit: Consumer must acknowledge each message individually. The safe default.
AckAll: Acknowledging message N implicitly acknowledges all messages before N. Higher throughput, riskier.
AckNone: No acknowledgement required. Messages are considered delivered when sent. Fire-and-forget.

Clustering and Super-Clusters

NATS clustering is based on Raft consensus. A cluster of three or more NATS servers provides high availability and fault tolerance. JetStream data is replicated across cluster members (configurable replication factor of 1, 2, or 3).

Super-clusters connect multiple NATS clusters across regions or data centres using gateway connections. Gateways route messages between clusters transparently — a subscriber in Cluster A receives messages published in Cluster B without any application-level awareness. This creates a global messaging fabric with local-first performance (messages stay local when possible, cross the gateway only when needed).

Leaf nodes are NATS servers that connect to a cluster as a client rather than a full member. They extend the messaging fabric to edge locations, remote sites, or isolated environments without the overhead of full cluster membership. A leaf node in a factory floor, a retail store, or an IoT gateway can participate in the global NATS subject space while maintaining local autonomy.

                    ┌─────────────────┐
                    │  Super-Cluster   │
                    │                 │
   ┌──────┐        │  ┌──────┐      │        ┌──────┐
   │Cluster│◄──────►│  │Cluster│     │◄──────►│Cluster│
   │  US   │ Gateway│  │  EU  │     │ Gateway│  APAC │
   └──┬───┘        │  └──────┘      │        └──┬───┘
      │             └─────────────────┘           │
      │                                           │
   ┌──┴───┐                                   ┌──┴───┐
   │ Leaf  │                                   │ Leaf  │
   │ Node  │                                   │ Node  │
   │(Edge) │                                   │(Edge) │
   └──────┘                                   └──────┘

The leaf node architecture is NATS's genuine competitive advantage for edge computing. Deploying a 20 MB binary to an edge device that automatically connects to the nearest cluster, participates in subject-based routing, and gracefully handles disconnection and reconnection — this is something no other broker does as elegantly.

NATS KV and NATS Object Store

JetStream's persistence layer enables two additional abstractions:

NATS KV (Key-Value store) provides a distributed key-value store built on JetStream streams. It supports put, get, delete, watch (real-time notifications on key changes), and history (retrieve previous values of a key). It is not a replacement for Redis or etcd, but it is useful for configuration distribution, feature flags, and service discovery within the NATS ecosystem. One less dependency.

NATS Object Store provides storage for large objects (files, binaries, anything that does not fit in a single NATS message). Objects are chunked and stored in a JetStream stream. This is useful for distributing configuration files, ML models, or other artefacts through the same infrastructure that handles your messaging.

Both features follow NATS's philosophy: if you already have NATS, you should not need a separate system for these common patterns.

Strengths

Simplicity is the product. NATS is a single binary (the nats-server). The configuration file for a basic cluster fits on a screen. The client libraries are small and well-documented. The concept count is low: subjects, publish, subscribe, request, reply. JetStream adds streams, consumers, and acknowledgements. That is the entire vocabulary. Compare this to Kafka's brokers, ZooKeeper (or KRaft), topics, partitions, consumer groups, offsets, ISR, replication factors, segment files, log compaction, and the extensive configuration surface — NATS is refreshingly minimal.

Tiny footprint. The NATS server binary is roughly 20 MB. It starts in milliseconds. It runs on a Raspberry Pi. It runs in a 32 MB container. The memory footprint under moderate load is measured in tens of megabytes. Kafka requires a JVM, gigabytes of heap, and a meaningful amount of disk I/O infrastructure. NATS requires almost nothing. For edge computing, IoT, and resource-constrained environments, this is not a nice-to-have — it is a hard requirement.

Incredible latency. Core NATS message delivery is measured in microseconds on a local network. JetStream adds disk I/O and replication overhead, but end-to-end latency is still typically sub-millisecond for persisted messages. NATS consistently benchmarks as one of the fastest messaging systems available.

Leaf node architecture. The ability to extend the NATS mesh to edge locations with lightweight leaf nodes is architecturally elegant and practically useful. A leaf node at the edge publishes to local subjects, which are transparently routed to the cloud cluster. The edge node works independently during network partitions and reconnects seamlessly. This is a genuine differentiator.

Multi-tenancy built in. NATS accounts provide native multi-tenancy with subject-level isolation, import/export between accounts, and resource limits (connections, data, subscriptions). This is built into the server, not bolted on. For SaaS platforms and shared infrastructure, this reduces the need for separate broker deployments per tenant.

Single binary deployment. Download, configure, run. No JVM. No ZooKeeper. No additional dependencies. The upgrade process is "replace the binary and restart." The operational overhead is, in absolute terms, the lowest of any production-grade message broker.

Weaknesses

Smaller ecosystem. Kafka has hundreds of connectors, a stream processing framework (Kafka Streams), a SQL interface (ksqlDB), a schema registry, and a massive community producing blog posts, conference talks, and Stack Overflow answers. NATS has good client libraries for many languages and a growing tool ecosystem (the nats CLI is excellent), but the breadth of third-party integrations, monitoring tools, and community knowledge is significantly smaller. When you hit a novel problem with Kafka, someone has probably written a blog post about it. With NATS, you may be writing that blog post yourself.

JetStream maturity. JetStream was released in 2021. It is stable and production-ready, but it has not had the decade-plus of battle-testing that Kafka's persistence layer has. Edge cases in consumer acknowledgement, cluster failover, and stream recovery are still being discovered and fixed. The pace of improvement is fast, but "fast improvement" implies there were things to improve.

Community size vs Kafka. The NATS community is passionate and helpful, but it is orders of magnitude smaller than Kafka's. This affects hiring (fewer engineers know NATS), consulting (fewer firms specialise in it), and tooling (fewer third-party monitoring and management tools).

No built-in stream processing. NATS does not have a Kafka Streams equivalent. If you need windowed aggregations, stream joins, or complex event processing, you need an external framework (Flink, Benthos, custom code). JetStream provides the storage and delivery guarantees, but the processing logic is your responsibility.

No log compaction. Like Event Hubs and Pub/Sub, NATS JetStream does not support log compaction. The NATS KV store provides latest-value-per-key semantics, but it is not the same as Kafka's compacted topics, which maintain a full changelog of a table.

Message size limits. The default maximum message size is 1 MB (configurable up to 64 MB, though large messages are not what NATS is optimised for). For large payloads, you need the Object Store or a claim-check pattern.

JetStream resource planning. While Core NATS is effectively "just run it," JetStream requires thinking about storage capacity, replication factors, and retention policies. A JetStream stream with replication factor 3 on file storage consumes three times the disk space. This is expected, but it means JetStream is not quite as "zero planning" as Core NATS.

Operational Reality

This is the section where NATS shines brightest, and it is worth lingering on because operational simplicity is NATS's core value proposition.

Deployment is downloading a binary and running it. On Kubernetes, the NATS Helm chart creates a StatefulSet, a headless Service, and a ConfigMap. That is the entire deployment. Compare this to the Kafka Helm chart, which creates ZooKeeper (or KRaft controllers), brokers, PersistentVolumeClaims, ConfigMaps, headless Services, and optionally a schema registry, Connect workers, and a REST proxy.

Configuration is a single file. A production NATS cluster configuration is roughly 50 lines. Here is a representative example:

# nats-server.conf
listen: 0.0.0.0:4222

jetstream {
  store_dir: /data/jetstream
  max_mem: 1G
  max_file: 100G
}

cluster {
  name: production
  listen: 0.0.0.0:6222
  routes: [
    nats-route://nats-0:6222
    nats-route://nats-1:6222
    nats-route://nats-2:6222
  ]
}

accounts {
  ORDERS {
    jetstream: enabled
    users: [{ user: orders_svc, password: $ORDERS_PASSWORD }]
  }
}

Monitoring is excellent for a project this size. The NATS server exposes a monitoring endpoint (HTTP on port 8222 by default) with JSON endpoints for connections, routes, gateways, JetStream, and health. The nats CLI tool provides real-time dashboards, stream inspection, and consumer management. Prometheus exporters exist and work well. The information you need to assess cluster health is readily available without custom tooling.

Upgrades are rolling restarts. Replace the binary, restart each node in sequence. JetStream streams are replicated, so a single-node restart does not cause data unavailability (with replication factor ≥ 2). The NATS team maintains excellent backward compatibility — the protocol has been stable for years.

Troubleshooting benefits from the simplicity of the system. When something is wrong, the surface area of possible causes is small. The server logs are clear. The nats CLI can inspect subjects, streams, consumers, and cluster state interactively. The mental model fits in your head, which means debugging fits in your head. This is an underappreciated property. The fastest incident resolution comes not from the best tooling, but from the smallest gap between "something is wrong" and "I understand what could be wrong."

Scaling follows different patterns for Core NATS and JetStream:

Core NATS scales by adding servers to the cluster. Message routing is distributed, and the cluster handles millions of messages per second.
JetStream scales by distributing streams across cluster members. Each stream has a leader that handles writes, so write throughput for a single stream is bounded by a single server's I/O. For higher throughput, partition across multiple streams (application-level sharding).

Ideal Use Cases

Edge computing and IoT. The leaf node architecture, tiny footprint, and built-in reconnection handling make NATS the natural choice for edge-to-cloud messaging. A 20 MB binary on an edge gateway that publishes sensor data to telemetry.factory.line3.temperature and receives commands on commands.factory.line3.> — this is NATS's happy path.

Kubernetes-native microservices. NATS runs beautifully in Kubernetes. The StatefulSet deployment is minimal, the resource requirements are low, and the subject-based routing model maps naturally to service communication patterns. For teams that want an in-cluster message bus without the operational weight of Kafka, NATS is the answer.

Request-reply service communication. NATS's built-in request-reply pattern provides a lightweight alternative to HTTP-based service-to-service communication. It is faster than REST, simpler than gRPC (no code generation, no proto files), and naturally load-balanced across service instances through queue groups.

Real-time messaging and signalling. Chat systems, presence updates, live dashboards, gaming backends — any use case where sub-millisecond message delivery matters and message loss is tolerable (or handled at the application level). Core NATS is purpose-built for this.

Multi-region and multi-cloud messaging. The super-cluster and gateway architecture makes NATS one of the simplest options for building a global messaging fabric. Each region runs its own cluster, gateways connect them, and messages route transparently.

Lightweight event-driven architectures. For teams that do not need Kafka's throughput or storage capabilities but want more than a REST-based architecture, NATS + JetStream provides event-driven communication with persistence and delivery guarantees at a fraction of the operational cost.

Code Examples

Go

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/nats-io/nats.go"
	"github.com/nats-io/nats.go/jetstream"
)

// --- Core NATS: Pub/Sub and Request-Reply ---

func coreNATSExample() {
	nc, err := nats.Connect(nats.DefaultURL)
	if err != nil {
		log.Fatal(err)
	}
	defer nc.Close()

	// Simple pub/sub
	nc.Subscribe("orders.>", func(msg *nats.Msg) {
		fmt.Printf("Received on %s: %s\n", msg.Subject, string(msg.Data))
	})

	// Queue group subscription (load-balanced across group members)
	nc.QueueSubscribe("orders.placed", "order-processors", func(msg *nats.Msg) {
		fmt.Printf("Processing order: %s\n", string(msg.Data))
	})

	// Request-reply
	nc.Subscribe("orders.validate", func(msg *nats.Msg) {
		// Responder: validate the order and reply
		msg.Respond([]byte(`{"valid": true}`))
	})

	// Requester: send a request and wait for a reply
	resp, err := nc.Request("orders.validate",
		[]byte(`{"orderId": "ord-7829"}`),
		2*time.Second,
	)
	if err != nil {
		log.Printf("Request failed: %v", err)
	} else {
		fmt.Printf("Validation response: %s\n", string(resp.Data))
	}

	// Publish (fire-and-forget)
	order := map[string]interface{}{
		"orderId":     "ord-7829",
		"totalAmount": 149.99,
		"region":      "us-east-1",
	}
	data, _ := json.Marshal(order)
	nc.Publish("orders.placed", data)
}

// --- JetStream: Persistent Streams and Consumers ---

func jetStreamExample() {
	nc, err := nats.Connect(nats.DefaultURL)
	if err != nil {
		log.Fatal(err)
	}
	defer nc.Close()

	js, err := jetstream.New(nc)
	if err != nil {
		log.Fatal(err)
	}
	ctx := context.Background()

	// Create or update a stream
	stream, err := js.CreateOrUpdateStream(ctx, jetstream.StreamConfig{
		Name:      "ORDERS",
		Subjects:  []string{"orders.>"},
		Storage:   jetstream.FileStorage,
		Replicas:  3,
		Retention: jetstream.LimitsPolicy,
		MaxAge:    7 * 24 * time.Hour, // 7 days retention
		MaxBytes:  10 * 1024 * 1024 * 1024, // 10 GB
		Discard:   jetstream.DiscardOld,
	})
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Stream %s: %d messages\n", stream.CachedInfo().Config.Name,
		stream.CachedInfo().State.Msgs)

	// Publish with acknowledgement
	ack, err := js.Publish(ctx, "orders.placed", data,
		jetstream.WithMsgID("ord-7829-placed"), // For deduplication
	)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Published to stream %s, seq %d\n", ack.Stream, ack.Sequence)

	// Create a durable pull consumer
	consumer, err := js.CreateOrUpdateConsumer(ctx, "ORDERS", jetstream.ConsumerConfig{
		Durable:       "order-processor",
		AckPolicy:     jetstream.AckExplicitPolicy,
		FilterSubject: "orders.placed",
		MaxDeliver:    5,           // Max redelivery attempts
		AckWait:       30 * time.Second,
		DeliverPolicy: jetstream.DeliverAllPolicy,
	})
	if err != nil {
		log.Fatal(err)
	}

	// Consume messages
	iter, err := consumer.Messages()
	if err != nil {
		log.Fatal(err)
	}

	// Graceful shutdown
	sigCh := make(chan os.Signal, 1)
	signal.Notify(sigCh, os.Interrupt)

	go func() {
		for {
			msg, err := iter.Next()
			if err != nil {
				log.Printf("Consumer error: %v", err)
				return
			}

			var order map[string]interface{}
			if err := json.Unmarshal(msg.Data(), &order); err != nil {
				log.Printf("Invalid message: %v", err)
				msg.Term() // Terminate — do not redeliver
				continue
			}

			fmt.Printf("Processing order %s (attempt %d)\n",
				order["orderId"],
				msg.Headers().Get("Nats-Num-Delivered"))

			if err := processOrder(order); err != nil {
				msg.Nak() // Negative ack — redeliver
			} else {
				msg.Ack() // Success
			}
		}
	}()

	<-sigCh
	iter.Stop()
	fmt.Println("Shut down gracefully")
}

func processOrder(order map[string]interface{}) error {
	fmt.Printf("Order %s: $%.2f\n", order["orderId"], order["totalAmount"])
	return nil
}

Python (nats-py)

import asyncio
import json
import signal
import nats
from nats.js.api import StreamConfig, ConsumerConfig, RetentionPolicy, AckPolicy

async def main():
    nc = await nats.connect("nats://localhost:4222")
    js = nc.jetstream()

    # --- Create a stream ---
    await js.add_stream(
        StreamConfig(
            name="ORDERS",
            subjects=["orders.>"],
            retention=RetentionPolicy.LIMITS,
            max_age=7 * 24 * 60 * 60 * 1_000_000_000,  # 7 days in nanoseconds
            max_bytes=10 * 1024 * 1024 * 1024,           # 10 GB
            num_replicas=3,
        )
    )

    # --- Publish ---
    order = {
        "orderId": "ord-7829",
        "totalAmount": 149.99,
        "region": "us-east-1",
    }
    ack = await js.publish(
        "orders.placed",
        json.dumps(order).encode(),
        headers={"Nats-Msg-Id": "ord-7829-placed"},  # Deduplication
    )
    print(f"Published to {ack.stream}, seq {ack.seq}")

    # --- Subscribe with durable consumer ---
    sub = await js.pull_subscribe(
        "orders.placed",
        durable="order-processor",
        config=ConsumerConfig(
            ack_policy=AckPolicy.EXPLICIT,
            max_deliver=5,
            ack_wait=30,
        ),
    )

    running = True

    def shutdown():
        nonlocal running
        running = False

    loop = asyncio.get_event_loop()
    loop.add_signal_handler(signal.SIGINT, shutdown)

    while running:
        try:
            messages = await sub.fetch(batch=10, timeout=5)
            for msg in messages:
                try:
                    event = json.loads(msg.data.decode())
                    print(f"Processing order {event['orderId']}: "
                          f"${event['totalAmount']}")

                    await process_order(event)
                    await msg.ack()

                except Exception as e:
                    print(f"Processing failed: {e}")
                    await msg.nak()

        except nats.errors.TimeoutError:
            continue  # No messages available, loop back

    await nc.close()
    print("Consumer shut down")


async def process_order(event: dict):
    pass


# --- Core NATS: Simple pub/sub and request-reply ---
async def core_nats_example():
    nc = await nats.connect("nats://localhost:4222")

    # Subscribe with queue group
    async def handler(msg):
        data = json.loads(msg.data.decode())
        print(f"Received: {data}")

    await nc.subscribe("orders.placed", queue="order-processors", cb=handler)

    # Request-reply
    async def validator(msg):
        order = json.loads(msg.data.decode())
        response = {"valid": True, "orderId": order["orderId"]}
        await msg.respond(json.dumps(response).encode())

    await nc.subscribe("orders.validate", cb=validator)

    response = await nc.request(
        "orders.validate",
        json.dumps({"orderId": "ord-7829"}).encode(),
        timeout=2.0,
    )
    print(f"Validation: {response.data.decode()}")

    # Publish
    await nc.publish(
        "orders.placed",
        json.dumps({"orderId": "ord-7829", "total": 149.99}).encode(),
    )
    await asyncio.sleep(1)  # Let the subscriber process
    await nc.close()


if __name__ == "__main__":
    asyncio.run(main())

Verdict

NATS is the message broker for people who have operated other message brokers and are tired. Tired of JVM tuning. Tired of ZooKeeper. Tired of configuration files that require a manual. Tired of "simple" deployments that involve twelve Helm charts and a prayer.

That sounds like damning-with-faint-praise, but it is the highest compliment you can pay an infrastructure component. The best infrastructure is the infrastructure you do not think about, and NATS comes closer to that ideal than any other message broker in this book. It deploys in minutes, runs on resources that Kafka would consider a rounding error, and provides a programming model that a junior engineer can learn in an afternoon.

JetStream elevates NATS from "interesting but limited" to "serious contender." The addition of persistence, exactly-once semantics, and stream processing primitives means NATS can serve as the primary messaging layer for many workloads that previously required Kafka. It is not a Kafka replacement for all use cases — Kafka's throughput at scale, ecosystem breadth, and log compaction are not matched — but for the majority of teams whose event throughput is measured in thousands or tens of thousands per second rather than millions, NATS provides the same guarantees with dramatically less operational overhead.

The leaf node architecture is a genuine innovation. For organisations with edge computing requirements — IoT, retail, manufacturing, distributed offices — NATS provides a unified messaging fabric from edge to cloud that no other broker matches in simplicity or footprint.

The honest risk assessment: NATS is a smaller project with a smaller community. If you need a connector for every database, a managed offering on every cloud, and the confidence that comes from hundreds of conference talks about production deployments, Kafka is the safer bet. If you need a message broker that you can deploy, understand, operate, and debug with a small team, NATS is the bet worth making.

The best technology choice is the one that fits your team, your workload, and your tolerance for operational complexity. For a growing number of teams, that choice is NATS — not because it does the most, but because it demands the least while delivering what matters.

Apache ActiveMQ and Artemis

Every technology ecosystem has its veterans — the projects that were solving real problems before the current generation of engineers had written their first Hello, World. In the Java messaging world, Apache ActiveMQ is that veteran. It has been faithfully shuttling JMS messages since 2004, has survived multiple hype cycles, outlasted several "next big things," and remains in production at more enterprises than most people realise. It is not glamorous. It does not trend on Hacker News. It just works, mostly, for a very specific and enduring set of use cases.

Then there is Artemis — the younger, faster, architecturally superior successor that everyone agrees is the future but that still lives in the shadow of Classic's installed base. The relationship between the two is a case study in how difficult it is to deprecate enterprise software, even when you have built something unambiguously better.

This chapter covers both, because in practice you cannot understand one without the other.

Overview

ActiveMQ Classic

ActiveMQ was created in 2004 by a group of developers who needed an open-source JMS broker. At the time, the alternatives were proprietary and expensive — IBM MQ (then WebSphere MQ), TIBCO EMS, SonicMQ. The Java Message Service specification existed, but open-source implementations were thin on the ground. ActiveMQ filled that gap.

It became an Apache top-level project in 2005 and quickly established itself as the open-source JMS broker. If you were building enterprise Java applications in the mid-2000s and needed messaging, ActiveMQ was likely on your shortlist. It was embedded in Apache ServiceMix, used by Apache Camel, and became a cornerstone of the enterprise integration landscape.

Classic is built on a traditional architecture: a broker process that accepts connections, routes messages to queues and topics, and manages persistence through KahaDB (its default storage engine). It works. It has worked for twenty years. But the architecture has limits that become apparent at scale, and those limits are why Artemis exists.

Apache ActiveMQ Artemis

Artemis has a more interesting lineage than its name suggests. It started life as JBoss Messaging, was rewritten as HornetQ by the JBoss/Red Hat team, and was then donated to the Apache Foundation in 2015 to become the next generation of ActiveMQ. The person most associated with the project is Clebert Suconic, who led HornetQ and continued to shepherd Artemis.

The donation was not purely altruistic. Red Hat had a message broker (HornetQ) that was excellent but had a small community relative to ActiveMQ's name recognition. Apache had a message broker (ActiveMQ Classic) with massive name recognition but an aging architecture. The merger made sense on paper, and — unusually for these things — it has largely worked in practice.

Artemis is not a patched version of Classic. It is a ground-up rewrite with a fundamentally different architecture. The two share a name, a community, and a set of goals, but very little code.

Architecture

ActiveMQ Classic Architecture

Classic follows the traditional enterprise broker pattern. A JVM process runs the broker. Clients connect via one of several supported protocols. Messages are received, persisted (if durable), routed to the appropriate destination, and delivered to consumers. The storage engine (KahaDB by default, though JDBC-backed storage is available) handles persistence.

The architecture is straightforward but single-threaded in some critical paths. KahaDB uses a transaction journal plus B-tree index approach that works well at moderate volumes but can become a bottleneck under heavy load. The broker maintains an in-memory dispatch queue and pages to disk when memory limits are reached, which can introduce unpredictable latency spikes.

Classic's networking model supports "networks of brokers" — multiple broker instances connected in a mesh or tree topology to distribute load. This works, but it is operationally complex, the forwarding logic has historically been a source of subtle bugs, and the semantics of message ordering across a network of brokers are... best described as "approximate."

Artemis Architecture

Artemis takes a meaningfully different approach, and the differences matter.

Journal-Based Storage. Artemis uses an append-only journal for persistence — a design that should sound familiar to anyone who has spent time with Kafka or any modern log-structured storage system. The journal writes are sequential, which means they can saturate disk I/O bandwidth far more efficiently than the random I/O patterns of Classic's KahaDB. The journal supports both NIO (Java NIO file channels) and AIO (Linux asynchronous I/O via libaio) backends. On Linux with AIO, write performance is exceptional.

# Artemis journal configuration (broker.xml)
<journal-type>ASYNCIO</journal-type>
<journal-directory>data/journal</journal-directory>
<journal-min-files>2</journal-min-files>
<journal-pool-files>10</journal-pool-files>
<journal-file-size>10M</journal-file-size>
<journal-buffer-timeout>4000</journal-buffer-timeout>

Non-Blocking I/O. Artemis uses Netty for all network I/O, which means it can handle far more concurrent connections than Classic on the same hardware. The threading model is designed around a small number of I/O threads feeding work to a configurable thread pool, avoiding the thread-per-connection pattern that limited Classic.

Paging. When destinations exceed their configured memory limits, Artemis pages messages to disk transparently. Unlike Classic's approach (which could lead to unpredictable behaviour), Artemis paging is a first-class feature with well-defined semantics. Messages are paged in order, consumers drain the in-memory messages first, then paged messages are brought back in order. It is not magic — paging adds latency — but it is predictable.

Large Message Support. Messages that exceed a configurable threshold (default 100KB) are stored outside the journal in separate files. This prevents large payloads from bloating the journal and affecting throughput for smaller messages. A practical feature that reflects real-world usage patterns where the occasional 50MB message coexists with millions of 1KB messages.

Address Model. Artemis uses a unified address model that is more flexible than the traditional JMS queue/topic distinction. An address is a named endpoint. Attached to each address are one or more queues. The routing type determines behaviour:

Anycast: messages are distributed across queues in a round-robin fashion (queue semantics)
Multicast: messages are copied to all queues (topic semantics)

This model cleanly maps to JMS queues and topics, AMQP links, MQTT subscriptions, and STOMP destinations. It is one of the reasons Artemis can support so many protocols simultaneously without awkward impedance mismatches.

Protocol Support

This is one of ActiveMQ's genuine differentiators — both Classic and Artemis are protocol polyglots.

Classic Protocols

OpenWire: The native ActiveMQ wire protocol, used by the ActiveMQ JMS client. Efficient for Java-to-Java communication.
STOMP: Simple text-based protocol. Good for non-Java clients.
AMQP 1.0: Added later, but functional.
MQTT: For IoT use cases.
WebSocket: For browser-based clients.

Artemis Protocols

Core: Artemis's native high-performance protocol.
OpenWire: For backward compatibility with existing ActiveMQ Classic clients.
AMQP 1.0: First-class support, not an afterthought.
STOMP: Text-based simplicity.
MQTT: Versions 3.1, 3.1.1, and 5.0.
HornetQ: For migration from HornetQ installations.

Each protocol runs on its own acceptor (port), or multiple protocols can share a single port with auto-detection. The fact that you can have a Java JMS producer sending via OpenWire, a Python consumer receiving via AMQP, and an IoT device publishing via MQTT — all through the same broker, all interoperating on the same address — is genuinely useful in heterogeneous environments. It is also the kind of thing that makes debugging exciting in ways you did not ask for.

Clustering

Classic: Network of Brokers

Classic's clustering model involves connecting multiple brokers via "network connectors." Brokers forward messages to each other based on consumer demand. If broker A has a message for a queue, and broker B has a consumer for that queue, the message is forwarded.

This model has the advantage of conceptual simplicity and the disadvantage of operational complexity. Message ordering across the network is not guaranteed. Duplicate detection requires care. Network splits can lead to message duplication or loss, depending on your configuration. The "advisory message" system (used for internal broker-to-broker communication) can itself become a performance bottleneck. Enterprises have made this work, often with dedicated messaging teams who understand the failure modes intimately.

Artemis: Live-Backup Pairs and Clustering

Artemis takes a different approach to high availability and clustering:

Live-Backup Pairs. A primary (live) server has one or more backup servers. The backup replicates the journal from the primary. If the primary fails, the backup activates and takes over. Failover is automatic for clients using the Artemis or OpenWire client libraries. This is straightforward HA — no split-brain if you configure the quorum correctly, deterministic failover, and the backup has a warm copy of the data.

<!-- Primary broker configuration -->
<ha-policy>
  <replication>
    <primary>
      <group-name>my-pair</group-name>
      <vote-on-replication-failure>true</vote-on-replication-failure>
      <quorum-size>1</quorum-size>
    </primary>
  </replication>
</ha-policy>

<!-- Backup broker configuration -->
<ha-policy>
  <replication>
    <backup>
      <group-name>my-pair</group-name>
      <allow-failback>true</allow-failback>
    </backup>
  </replication>
</ha-policy>

Clustering (Symmetric Cluster). Multiple live-backup pairs can be connected in a cluster. Unlike Classic's network of brokers, Artemis clusters use a more formal message redistribution mechanism. Messages are redistributed between nodes when consumers exist on other nodes, and the redistribution delay is configurable to allow local consumers a chance to process messages first.

The clustering model works well but is not a replacement for something like Kafka's partition-based parallelism. Artemis clustering is designed for HA and moderate scale-out, not for the kind of massive horizontal scaling that Kafka enables. Know what it is designed for, and you will not be disappointed.

Strengths

JMS Compliance. If you need a JMS broker, ActiveMQ (either variant) is one of the most complete implementations available. Artemis passes the JMS TCK, supports JMS 2.0, and handles the full range of JMS features — selectors, message groups, scheduled delivery, last-value queues, transactions (including XA). This matters less than it used to, but in environments where JMS is a requirement (and there are more of these than Twitter would have you believe), it matters a lot.

Protocol Polyglot. Supporting AMQP, MQTT, STOMP, OpenWire, and a native protocol on the same broker is genuinely useful. You do not need separate infrastructure for your Java services, your Python scripts, and your IoT devices. One broker, multiple protocols, shared destinations.

Enterprise Integration. ActiveMQ is a natural fit with Apache Camel, Spring Boot, and the broader Java enterprise ecosystem. The integration is deep, well-documented, and battle-tested. Spring's JmsTemplate, Camel's JMS component, and Java EE's MDB (Message-Driven Bean) pattern all work seamlessly.

Maturity. Twenty years of production deployments have surfaced and fixed a lot of bugs. The failure modes are well-understood. The documentation, while not always exciting, is comprehensive. The mailing list archives are a treasure trove of solutions to problems you have not encountered yet.

Artemis Performance. Artemis is genuinely fast for a traditional broker. The journal-based storage, non-blocking I/O, and efficient threading model combine to deliver throughput and latency numbers that are competitive with anything in the traditional broker category. It will not match Kafka for raw throughput on append-only workloads, but for transactional, routed messaging it is excellent.

Weaknesses

Java Ecosystem Lock-In. ActiveMQ is a Java application. Its management tools are Java. Its plugin system is Java. Its best client library is Java. Yes, other protocols provide non-Java access, but the centre of gravity is firmly in the JVM world. If your organisation is primarily Python, Go, or Rust, ActiveMQ will feel like a foreign object.

Throughput Ceiling. For all of Artemis's improvements, it is still a traditional broker — every message passes through the broker, is persisted, and is dispatched. For workloads in the millions-of-messages-per-second range, you need Kafka, Redpanda, or a log-based system. Artemis targets the tens-of-thousands to low-hundreds-of-thousands messages per second range (per broker), which is plenty for most enterprise workloads but not enough for high-volume event streaming.

Community Momentum. The messaging conversation has shifted to Kafka, Pulsar, NATS, and cloud-native alternatives. ActiveMQ development continues, but the community is smaller and less active than it was a decade ago. Finding experienced ActiveMQ operators under the age of forty is increasingly challenging. This is not a technical problem, but it is a practical one.

Classic's Technical Debt. Classic's architecture shows its age. KahaDB has known performance limitations. The thread model does not scale well with connection count. The network of brokers feature, while functional, is complex and fragile. If you are starting a new project, there is no technical reason to choose Classic over Artemis.

Configuration Complexity. Artemis is configured via XML files (broker.xml, bootstrap.xml, etc.) that are comprehensive but verbose. The number of tunables is vast, and the defaults are not always optimal for your workload. You will spend time reading documentation about journal buffer sizes, thread pool configurations, and address settings. This is the price of flexibility.

Ideal Use Cases

Existing Java/JMS Ecosystems. If you have Java services that speak JMS, ActiveMQ (particularly Artemis) is a natural choice. The migration path from other JMS brokers (IBM MQ, TIBCO EMS) is well-documented.

Protocol Bridge. When you need a single broker that speaks AMQP, MQTT, STOMP, and JMS, Artemis is one of the few options that handles all of them competently.

Enterprise Integration Patterns. If your architecture involves Apache Camel routes, Enterprise Service Bus patterns, or traditional request-reply messaging, ActiveMQ was designed for this world.

Transactional Messaging. XA transactions, JMS transactions, and the full machinery of transactional messaging are first-class features. If your workflow requires "dequeue message, update database, commit both atomically," ActiveMQ has you covered.

Gradual Modernisation. Organisations moving from monolithic to distributed architectures can start with ActiveMQ (a familiar paradigm for enterprise Java developers) and evolve toward event streaming as their needs grow.

Operational Reality

Running ActiveMQ in production is not inherently difficult, but it does require the kind of care and feeding that any stateful system demands.

JMX Monitoring. Both Classic and Artemis expose management and monitoring via JMX. This is the primary monitoring interface, and if you are not running a JMX-aware monitoring stack, you will need to set one up. Prometheus exporters exist (the jmx_exporter from Prometheus works well) but require configuration to expose the metrics you actually care about.

Key metrics to watch:

Queue depth (message count per destination)
Enqueue/dequeue rates
Memory usage (broker heap, journal pages)
Connection count
Consumer count per destination
DLQ (dead-letter queue) depth — this is your canary

Memory Tuning. Artemis is a JVM application, which means you are in the business of tuning garbage collection. For latency-sensitive workloads, G1GC or ZGC with appropriately sized heaps is the starting point. The journal buffer size, page size, and global max size all interact with JVM heap settings in ways that are not immediately intuitive. The Artemis documentation covers this well, but expect to spend an afternoon with GC logs and a profiler during initial setup.

# Typical Artemis JVM settings for a production broker
JAVA_ARGS="-Xms4g -Xmx4g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=50 \
  -XX:+ParallelRefProcEnabled \
  -XX:+UseStringDeduplication \
  -Dhawtio.realm=activemq \
  -Dhawtio.offline=true"

Journal Management. The Artemis journal will grow as messages accumulate. Paging kicks in when in-memory limits are reached. If consumers fall behind and paged data grows unbounded, you will eventually run out of disk. Monitoring disk usage and setting address-level limits (with appropriate address-full-policy — PAGE, DROP, BLOCK, or FAIL) is essential.

Upgrades. ActiveMQ upgrades are generally straightforward — the project takes backward compatibility seriously. Artemis supports rolling upgrades within minor versions. Major version upgrades require more care but are well-documented.

Hawtio Console. Artemis ships with the Hawtio web console for management and monitoring. It is functional if somewhat dated in appearance. You can view queue depths, browse messages, send test messages, and manage addresses. It is adequate for development and debugging but should not be your primary production monitoring tool.

Code Examples

Java JMS 2.0 (Artemis)

import javax.jms.*;
import org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory;

public class ArtemisJmsExample {

    public static void main(String[] args) throws Exception {
        // Producer
        try (ActiveMQConnectionFactory factory =
                new ActiveMQConnectionFactory("tcp://localhost:61616");
             JMSContext context = factory.createContext()) {

            JMSProducer producer = context.createProducer();
            Queue queue = context.createQueue("orders");

            // Send with properties for routing/filtering
            TextMessage message = context.createTextMessage(
                "{\"orderId\": \"ord-7829\", \"amount\": 149.99}");
            message.setStringProperty("eventType", "OrderPlaced");
            message.setStringProperty("region", "eu-west");

            producer.send(queue, message);
            System.out.println("Sent: " + message.getText());
        }

        // Consumer
        try (ActiveMQConnectionFactory factory =
                new ActiveMQConnectionFactory("tcp://localhost:61616");
             JMSContext context = factory.createContext()) {

            Queue queue = context.createQueue("orders");

            // Selector: only receive EU orders
            JMSConsumer consumer = context.createConsumer(queue,
                "region = 'eu-west'");

            Message received = consumer.receive(5000);
            if (received instanceof TextMessage) {
                System.out.println("Received: " +
                    ((TextMessage) received).getText());
            }
        }
    }
}

Spring Boot with JMS

// Configuration
@Configuration
public class ArtemisConfig {

    @Bean
    public ConnectionFactory connectionFactory() {
        return new ActiveMQConnectionFactory("tcp://localhost:61616");
    }

    @Bean
    public JmsTemplate jmsTemplate(ConnectionFactory connectionFactory) {
        JmsTemplate template = new JmsTemplate(connectionFactory);
        template.setDefaultDestinationName("events");
        return template;
    }
}

// Producer Service
@Service
public class OrderEventPublisher {

    private final JmsTemplate jmsTemplate;

    public OrderEventPublisher(JmsTemplate jmsTemplate) {
        this.jmsTemplate = jmsTemplate;
    }

    public void publishOrderPlaced(Order order) {
        jmsTemplate.convertAndSend("orders", order, message -> {
            message.setStringProperty("eventType", "OrderPlaced");
            message.setStringProperty("correlationId",
                UUID.randomUUID().toString());
            return message;
        });
    }
}

// Consumer - Message-Driven POJO
@Component
public class OrderEventConsumer {

    private static final Logger log =
        LoggerFactory.getLogger(OrderEventConsumer.class);

    @JmsListener(destination = "orders",
                 selector = "eventType = 'OrderPlaced'")
    public void handleOrderPlaced(Order order,
                                   @Header("correlationId") String corrId) {
        log.info("Processing order {} (correlationId: {})",
            order.getOrderId(), corrId);
        // Process the order event
    }
}

AMQP 1.0 (Python, using python-qpid-proton)

from proton import Message
from proton.handlers import MessagingHandler
from proton.reactor import Container

class OrderProducer(MessagingHandler):
    """Sends order events to Artemis via AMQP 1.0."""

    def __init__(self, url, queue, messages):
        super().__init__()
        self.url = url
        self.queue = queue
        self.messages = messages
        self.sent = 0

    def on_start(self, event):
        conn = event.container.connect(self.url)
        self.sender = event.container.create_sender(conn, self.queue)

    def on_sendable(self, event):
        while event.sender.credit and self.sent < len(self.messages):
            msg = Message(
                body=self.messages[self.sent],
                properties={"event_type": "OrderPlaced"},
                content_type="application/json"
            )
            event.sender.send(msg)
            self.sent += 1

        if self.sent == len(self.messages):
            event.sender.close()
            event.connection.close()


class OrderConsumer(MessagingHandler):
    """Receives order events from Artemis via AMQP 1.0."""

    def __init__(self, url, queue):
        super().__init__()
        self.url = url
        self.queue = queue

    def on_start(self, event):
        conn = event.container.connect(self.url)
        event.container.create_receiver(conn, self.queue)

    def on_message(self, event):
        print(f"Received: {event.message.body}")
        print(f"Event type: {event.message.properties.get('event_type')}")


if __name__ == "__main__":
    # Artemis AMQP port is 5672 by default
    url = "amqp://localhost:5672"

    orders = [
        '{"orderId": "ord-001", "amount": 99.99}',
        '{"orderId": "ord-002", "amount": 249.50}',
    ]

    Container(OrderProducer(url, "orders", orders)).run()
    Container(OrderConsumer(url, "orders")).run()

Classic vs Artemis: The Migration Question

If you are running Classic in production, the question is not whether to migrate to Artemis but when. Classic is in maintenance mode — it receives security fixes and critical bug fixes, but active development has moved to Artemis. The ActiveMQ project has been clear about this.

The migration is not trivial but it is well-supported:

OpenWire Compatibility. Artemis speaks OpenWire, so existing ActiveMQ Classic clients can connect to Artemis without code changes. This is the most important migration enabler.
Configuration Translation. The configuration models are different (Classic uses activemq.xml, Artemis uses broker.xml), but there is a migration tool and documentation that maps concepts between the two.
Feature Parity. Artemis supports nearly all Classic features, with some differences in semantics. Virtual topics from Classic map to Artemis's address model. Network of brokers maps to Artemis clustering (with different semantics). Message groups, selectors, and scheduled delivery all have Artemis equivalents.
Behavioural Differences. Some things work differently enough to require testing. Message priority handling, redelivery semantics, and memory management all differ in detail. Plan for a testing phase.

The practical advice: if Classic is working and you have no pressing issues, plan the migration but do not rush it. If you are starting a new project, use Artemis. There is no remaining reason to start with Classic.

Verdict

ActiveMQ Artemis is a thoroughly competent message broker that does not get the attention it deserves. It is fast, reliable, supports more protocols than almost anything else in the space, and has the kind of deep JMS support that enterprises actually need. It is not trying to be Kafka — it is not a distributed log, it is not designed for massive horizontal scaling, and it does not pretend to be a streaming platform.

What it is is a very good traditional message broker. If your use case involves routing messages between services, supporting mixed protocols, transactional messaging, or enterprise integration patterns, Artemis is a strong choice. If you are already in the Java ecosystem, it is an excellent one.

The honest risk is organisational, not technical. The community is smaller than Kafka's or RabbitMQ's. Hiring expertise is harder. The project moves at an enterprise pace, which means stability but slower innovation. If you choose Artemis, you are betting on a project with solid technical foundations and a dedicated (if small) maintainer community.

For new Java enterprise projects that need a message broker: Artemis is a top-tier choice. For polyglot microservices architectures: consider NATS or RabbitMQ first. For event streaming at scale: Kafka or Redpanda. For legacy Classic installations: start planning the Artemis migration, and take comfort in the fact that it is one of the smoother migration paths in the messaging world.

ActiveMQ may not be exciting, but in infrastructure, "exciting" is rarely a compliment.

ZeroMQ

Every other chapter in this section describes a broker — a server process that sits between producers and consumers, accepting messages, storing them, and routing them to the right place. ZeroMQ is not that. ZeroMQ is a library. There is no server. There is no broker. There is no daemon to install, configure, monitor, or page you about at 3 AM. This is simultaneously its greatest strength and the thing that confuses people most about it.

ZeroMQ describes itself as "sockets on steroids," which is the kind of tagline that either intrigues you or makes you nervous, depending on your relationship with socket programming. The more precise description is: ZeroMQ is a high-performance asynchronous messaging library that gives you smart sockets with built-in patterns for common distributed computing problems. It handles the transport (TCP, IPC, inproc, multicast), the framing, the reconnection, the buffering, and the routing. You handle everything else.

That "everything else" is the part they put in smaller font.

Overview

Philosophy

ZeroMQ's design philosophy is radical in the messaging world: the network is the broker. Instead of routing messages through a central server, ZeroMQ embeds the messaging logic directly in your application. Each endpoint is both a sender and a receiver, and the library handles the messy details of connection management, message framing, and I/O multiplexing.

This is not a new idea — it is how BSD sockets work, conceptually. What ZeroMQ adds on top of raw sockets is intelligence: automatic reconnection, message queuing, fan-out patterns, load balancing, and a clean API that abstracts away the worst of systems-level network programming. It is the difference between hand-rolling HTTP on top of TCP and using a library that handles the protocol for you — except ZeroMQ gives you building blocks rather than a finished protocol.

The design principles are:

No broker. The fastest message is the one that does not pass through an intermediary.
Smart endpoints, dumb network. The complexity lives in the application, not in infrastructure.
Patterns, not protocols. ZeroMQ provides socket types that encode messaging patterns (request-reply, publish-subscribe, etc.) rather than implementing a specific application protocol.
Zero-copy where possible. Performance is a first-order design concern, not an afterthought.

History

ZeroMQ was created by iMatix Corporation, primarily by Pieter Hintjens and Martin Sustrik. Hintjens was one of those rare figures in open source — a brilliant programmer, a gifted writer, and an absolute force of nature in community building. His book ZeroMQ: Messaging for Many Applications (commonly known as "the zguide") is not just the best ZeroMQ documentation; it is one of the best pieces of technical writing in the distributed systems space. It is also free to read online.

The project started around 2007, with the first stable release in 2010. The core library (libzmq) is written in C++, which contributes to its performance characteristics and its ability to provide bindings for virtually every programming language in existence.

Hintjens passed away in 2016, but the project continues under community stewardship. The ZeroMQ RFC process (based on the Collective Code Construction Contract, or C4, which Hintjens created) is itself noteworthy as a model for open-source governance.

Martin Sustrik, the other co-creator, left the ZeroMQ project and went on to create nanomsg (and later, nng) — spiritual successors that we will cover briefly at the end of this chapter. The split was philosophical and somewhat acrimonious, which is the natural state of open-source projects started by people with strong opinions.

Architecture

Socket Types

ZeroMQ's architecture is defined by its socket types. Each socket type encodes a specific messaging pattern, and the combinations of socket types create the distributed computing patterns you build with.

REQ/REP (Request-Reply)

The simplest pattern. A REQ socket sends a request and then blocks until it receives a reply. A REP socket receives a request and then sends a reply. Strictly synchronous, strictly alternating. The socket types enforce the send-receive-send-receive cadence — if you try to send two messages in a row on a REQ socket, ZeroMQ will complain.

REQ ──── request ────> REP
REQ <──── reply ────── REP

Useful for RPC-style communication. Fragile in practice because if either side dies mid-exchange, the surviving socket enters a confused state. This is why you rarely use raw REQ/REP in production — you use DEALER/ROUTER instead.

PUB/SUB (Publish-Subscribe)

A PUB socket sends messages to all connected SUB sockets. SUB sockets subscribe to specific message prefixes (or receive everything). Messages flow one way only — PUB to SUB.

PUB ──── msg ────> SUB (subscribed to "orders.")
    ──── msg ────> SUB (subscribed to "")  // all messages
    ──── msg ────> SUB (subscribed to "payments.")

The subscription filtering happens at the publisher side (in recent versions), which means unmatched messages are not sent over the wire. This is important for performance but means the publisher bears the CPU cost of filtering.

A critical subtlety: PUB/SUB has no delivery guarantees. If a subscriber is slow, messages are dropped (after the high-water mark is reached). If a subscriber connects after a message was sent, that message is gone. There is no persistence, no replay, no acknowledgment. This is by design.

PUSH/PULL (Pipeline)

PUSH sends messages to connected PULL sockets using round-robin load balancing. Messages flow one way. This is the pattern for distributing work across a pool of workers.

PUSH ──── task ────> PULL (worker 1)
     ──── task ────> PULL (worker 2)
     ──── task ────> PULL (worker 3)

No acknowledgment. No redelivery. If a worker dies after receiving a task, that task is lost. You build your own retry logic, or you accept the loss. PUSH/PULL is for throughput, not reliability.

DEALER/ROUTER (Advanced Request-Reply)

DEALER and ROUTER are the asynchronous, more flexible versions of REQ and REP. A DEALER socket does asynchronous round-robin send and fair-queued receive. A ROUTER socket prepends an identity frame to each message, allowing you to route replies back to specific peers.

This pair is the basis for most real-world ZeroMQ architectures. You can build broker-like intermediaries, load balancers, and complex routing topologies using DEALER/ROUTER combinations. The trade-off is complexity — you are managing identity frames and routing logic yourself.

PAIR

One-to-one bidirectional communication. Used almost exclusively for inter-thread communication within a single process. Not useful for network communication.

The Wire Protocol: ZMTP

ZeroMQ defines its own wire protocol, ZMTP (ZeroMQ Message Transport Protocol). The current version is ZMTP 3.1. It handles:

Connection handshake and version negotiation
Security mechanism negotiation (NULL, PLAIN, CURVE)
Message framing (messages are composed of one or more frames)
Heartbeating (via socket options, not protocol-level)

ZMTP is simple and efficient. Messages are length-prefixed frames, with a flag byte indicating whether more frames follow (multipart messages). There is no content-type, no headers (beyond what you put in the frames), and no metadata. If you need application-level framing, you build it yourself — typically with Protocol Buffers, MessagePack, or JSON in the message payload.

Transport Mechanisms

ZeroMQ supports multiple transports, selectable via URL scheme:

tcp://: TCP sockets. The workhorse for network communication.
ipc://: Unix domain sockets. Faster than TCP for same-machine communication. Not available on Windows.
inproc://: In-process (inter-thread). The fastest option — essentially lock-free queue passing pointers between threads.
pgm:// and epgm://: Pragmatic General Multicast. For reliable multicast scenarios. Rarely used in practice.

The ability to use the same API for inter-thread, inter-process, and inter-machine communication is one of ZeroMQ's most elegant features. You can prototype with inproc://, test with ipc://, and deploy with tcp:// — same code, different connection string.

Strengths

Performance. ZeroMQ is breathtakingly fast. With no broker in the path, message latency is bounded by network latency plus a small constant for framing and buffering. Throughput in the millions of messages per second is achievable with inproc:// transport, and hundreds of thousands per second over TCP is routine. For latency-sensitive applications, the absence of a broker hop is not a micro-optimisation — it is a fundamental architectural advantage.

Zero-Copy. ZeroMQ uses zero-copy techniques where possible, particularly for large messages. The zmq_msg_t API allows you to pass data without copying it through the library's internals. Combined with the inproc:// transport, this enables inter-thread messaging with virtually zero overhead.

No Broker to Manage. You cannot have a broker outage if you do not have a broker. You do not need to provision broker instances, monitor their health, manage their storage, plan their capacity, or debug their garbage collection pauses. The operational simplicity of not having infrastructure to manage is profound, and it is often underappreciated by people who have never been woken up by a PagerDuty alert about a broker running out of disk space.

Polyglot Bindings. The C/C++ core library has bindings for essentially every language that matters: Python (pyzmq), Java (JeroMQ or JZMq), Go, Rust, Node.js, C#, Ruby, Erlang, Haskell — the list goes on. The API is consistent across languages, so patterns you learn in Python translate directly to Go.

Embeddable. ZeroMQ is a library, not a service. You can embed it in desktop applications, mobile apps, embedded systems, or anywhere else you can link a C library. This makes it suitable for use cases where running a broker is impractical or impossible.

Pattern Building Blocks. The socket types are composable. You can combine them to build sophisticated distributed computing patterns — brokerless pub-sub with discovery, load-balanced worker pools, multi-hop request routing, service-oriented architectures. The zguide documents dozens of these patterns with working code. The building-block approach gives you more flexibility than any broker, at the cost of more responsibility.

Weaknesses

No Persistence. When a message is sent, it lives in memory buffers at the sender, receiver, or in transit. If a process crashes, buffered messages are lost. If a subscriber is offline, messages sent during its absence are gone. There is no journal, no log, no replay. If you need persistence, you build it yourself (by adding a database, a WAL, or — ironically — a broker).

No Delivery Guarantees. ZeroMQ provides at-most-once delivery by default, and achieving at-least-once requires careful application-level design (acknowledgments, retries, idempotency). Exactly-once is, as always, a distributed systems fairy tale, but ZeroMQ does not even try to approximate it at the library level. You get "best effort" and a pat on the back.

"Some Assembly Required." This is the fundamental trade-off. ZeroMQ gives you the lego bricks; you build the house. Need service discovery? Build it. Need dead-letter handling? Build it. Need message schemas? Build it. Need monitoring? Build it. Need authentication beyond basic CURVE? Build it. For small teams or simple use cases, this assembly cost can exceed the cost of just running a broker.

Slow Subscriber Problem. In PUB/SUB, if a subscriber cannot keep up, the publisher's send buffer fills up and messages are dropped. There is no backpressure mechanism beyond the high-water mark (which just controls when dropping begins). For some use cases this is fine — you genuinely want to drop stale market data and only process the latest quote. For others, it is a data loss bug that you discover under load at the worst possible time.

Discovery and Topology Management. ZeroMQ sockets need to know where to connect. There is no built-in service discovery, no registry, no DNS-based resolution beyond what TCP gives you. In static environments this is fine. In dynamic environments (containers, auto-scaling groups), you need an external discovery mechanism. This is a solved problem (DNS, Consul, etcd), but it is one more thing you have to solve.

Debugging Complexity. When something goes wrong in a broker-based system, you can inspect the broker — look at queue depths, examine messages, check consumer lag. With ZeroMQ, the "broker" is distributed across every process. Debugging a misbehaving PUB/SUB topology requires instrumenting your application, because there is no central point to inspect. The library is opaque by design.

Ideal Use Cases

Inter-Process Communication. ZeroMQ's ipc:// and inproc:// transports make it exceptional for communication between processes or threads on the same machine. If you are building a pipeline of processes that need to pass data efficiently, ZeroMQ is the obvious choice.

High-Frequency Trading Infrastructure. The latency characteristics of ZeroMQ — no broker hop, zero-copy, minimal framing overhead — make it popular in financial technology. Market data distribution, order routing, and inter-component communication in trading systems frequently use ZeroMQ or its derivatives.

Custom Protocols. If you are building a bespoke distributed system with specific communication requirements, ZeroMQ provides the transport layer without imposing an application protocol. You design your protocol; ZeroMQ handles the plumbing.

Embedded and Edge Computing. When you cannot or do not want to run a broker — on embedded devices, in edge computing nodes, in desktop applications — ZeroMQ gives you distributed messaging without infrastructure.

Internal Microservice Communication. Within a controlled network where you have stable addressing and can tolerate the operational overhead of managing topologies, ZeroMQ can provide extremely efficient service-to-service communication.

When to Avoid It

When you need persistence. If messages must survive process restarts, use a broker.

When you need delivery guarantees. If losing a message is unacceptable and you do not want to build your own acknowledgment/retry system, use a broker.

When operational simplicity matters. If your team does not want to build and maintain custom messaging infrastructure, use a broker. This is not a criticism — it is a perfectly rational trade-off.

When you need observability out of the box. If you want dashboards showing message rates, consumer lag, and queue depths without building them yourself, use a broker.

Operational Reality

The operational reality of ZeroMQ is paradoxical: there is less infrastructure to manage but more application-level complexity to get right.

Deployment. There is nothing to deploy besides your application and the linked library. This is genuinely liberating. No broker cluster to provision, no configuration to manage, no state to back up. Your CI/CD pipeline deploys your application, and the messaging comes along for free.

Monitoring. There is nothing to monitor besides your application. ZeroMQ does not expose metrics. There is no dashboard. If you want to know message rates, queue depths, or error counts, you instrument your application code. Most teams end up building a small monitoring layer that tracks messages sent/received per socket and exports to Prometheus or similar.

Failure Handling. ZeroMQ handles transient network failures gracefully — sockets automatically reconnect, and messages are queued during brief disconnections. Permanent failures (process crashes, machine failures) result in message loss for anything in the send/receive buffers. Your application needs to handle this, typically through heartbeating, timeouts, and application-level acknowledgments.

Security. ZeroMQ supports CurveZMQ (based on CurveCP and NaCl/libsodium) for encryption and authentication. It is actually a well-designed security mechanism — mutual authentication, perfect forward secrecy, and resistance to replay attacks. The downside is that it requires distributing and managing public keys, which is its own operational challenge.

Versioning and Compatibility. ZMTP includes version negotiation, so different versions of ZeroMQ can interoperate to a degree. However, the binding libraries vary in quality and version support. PyZMQ is excellent. Some other language bindings lag behind. JeroMQ (the pure Java implementation) and JZMq (the JNI wrapper) have different performance characteristics and compatibility guarantees.

Code Examples

# publisher.py
import zmq
import json
import time

context = zmq.Context()
socket = context.socket(zmq.PUB)
socket.bind("tcp://*:5556")

# Give subscribers time to connect (the slow joiner problem)
time.sleep(1)

events = [
    {"type": "OrderPlaced", "orderId": "ord-001", "amount": 99.99},
    {"type": "PaymentProcessed", "orderId": "ord-001", "status": "success"},
    {"type": "OrderPlaced", "orderId": "ord-002", "amount": 249.50},
    {"type": "OrderShipped", "orderId": "ord-001", "carrier": "DHL"},
]

for event in events:
    # Topic is the prefix used for subscription filtering
    topic = event["type"]
    payload = json.dumps(event)
    socket.send_multipart([
        topic.encode("utf-8"),
        payload.encode("utf-8")
    ])
    print(f"Published: {topic}")

socket.close()
context.term()

# subscriber.py
import zmq
import json

context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.connect("tcp://localhost:5556")

# Subscribe to OrderPlaced events only
socket.subscribe(b"OrderPlaced")

print("Listening for OrderPlaced events...")

while True:
    try:
        topic, payload = socket.recv_multipart()
        event = json.loads(payload.decode("utf-8"))
        print(f"Received [{topic.decode()}]: {event}")
    except KeyboardInterrupt:
        break

socket.close()
context.term()

Python (pyzmq) — Pipeline (PUSH/PULL)

# ventilator.py — distributes tasks to workers
import zmq
import json

context = zmq.Context()
sender = context.socket(zmq.PUSH)
sender.bind("tcp://*:5557")

# Wait for workers to connect
input("Press Enter when workers are ready...")

tasks = [
    {"task_id": i, "work": f"process_item_{i}"}
    for i in range(100)
]

for task in tasks:
    sender.send_json(task)

print(f"Sent {len(tasks)} tasks")
sender.close()
context.term()

# worker.py — pulls tasks, processes them, pushes results
import zmq
import time
import os

context = zmq.Context()

receiver = context.socket(zmq.PULL)
receiver.connect("tcp://localhost:5557")

sender = context.socket(zmq.PUSH)
sender.connect("tcp://localhost:5558")

pid = os.getpid()
print(f"Worker {pid} ready")

while True:
    task = receiver.recv_json()
    # Simulate work
    time.sleep(0.01)
    result = {
        "task_id": task["task_id"],
        "worker": pid,
        "status": "complete"
    }
    sender.send_json(result)
    print(f"Worker {pid} completed task {task['task_id']}")

# sink.py — collects results
import zmq

context = zmq.Context()
receiver = context.socket(zmq.PULL)
receiver.bind("tcp://*:5558")

completed = 0
while completed < 100:
    result = receiver.recv_json()
    completed += 1
    if completed % 10 == 0:
        print(f"Completed {completed}/100 tasks")

print("All tasks complete")
receiver.close()
context.term()

C — Request-Reply

// server.c — REP socket
#include <zmq.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    void *context = zmq_ctx_new();
    void *responder = zmq_socket(context, ZMQ_REP);
    zmq_bind(responder, "tcp://*:5555");

    printf("Server listening on port 5555...\n");

    while (1) {
        char buffer[256];
        int size = zmq_recv(responder, buffer, sizeof(buffer) - 1, 0);
        if (size == -1) break;
        buffer[size] = '\0';

        printf("Received: %s\n", buffer);

        // Process and reply
        const char *reply = "{\"status\": \"ok\"}";
        zmq_send(responder, reply, strlen(reply), 0);
    }

    zmq_close(responder);
    zmq_ctx_destroy(context);
    return 0;
}

// client.c — REQ socket
#include <zmq.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    void *context = zmq_ctx_new();
    void *requester = zmq_socket(context, ZMQ_REQ);
    zmq_connect(requester, "tcp://localhost:5555");

    for (int i = 0; i < 10; i++) {
        char request[128];
        snprintf(request, sizeof(request),
                 "{\"action\": \"lookup\", \"id\": %d}", i);

        printf("Sending: %s\n", request);
        zmq_send(requester, request, strlen(request), 0);

        char reply[256];
        int size = zmq_recv(requester, reply, sizeof(reply) - 1, 0);
        reply[size] = '\0';
        printf("Reply: %s\n", reply);
    }

    zmq_close(requester);
    zmq_ctx_destroy(context);
    return 0;
}

Compile with:

gcc -o server server.c -lzmq
gcc -o client client.c -lzmq

nanomsg and nng: The Spiritual Successors

After Martin Sustrik departed the ZeroMQ project, he created nanomsg — a reimagining of ZeroMQ's ideas with a cleaner C API, a more permissive license (MIT vs ZeroMQ's LGPL), and a focus on simplicity. The socket types are called "scalability protocols" and are formally specified as RFCs.

nanomsg addressed several ZeroMQ pain points:

Simpler API (fewer footguns)
No context object (sockets are self-contained)
Better error handling semantics
Pluggable transports

However, nanomsg never achieved ZeroMQ's community size or ecosystem breadth. It was technically interesting but practically niche.

nng (nanomsg Next Generation), also by Sustrik and later maintained by Garrett D'Amore, is the successor to nanomsg. It features an asynchronous I/O model, better thread safety, and a more modern architecture. nng is the most actively maintained of the three, but its community remains small relative to ZeroMQ.

The practical advice: if you are starting a new project and the ZeroMQ philosophy appeals to you, evaluate nng alongside ZeroMQ. nng has a cleaner API and a more modern design. ZeroMQ has a vastly larger ecosystem, more documentation, and more people who know how to use it. In most cases, the ecosystem advantage wins.

Verdict

ZeroMQ is not a message broker, and judging it as one misses the point entirely. It is a networking library that solves a different problem: how do you build distributed applications with efficient, pattern-based messaging without introducing infrastructure dependencies?

If you need a library for blazing-fast inter-process communication, if you are building custom distributed systems, if you are working in an environment where running a broker is impractical, or if you want absolute control over your messaging topology — ZeroMQ is exceptional. It has earned its reputation in HFT, scientific computing, and infrastructure tooling.

If you need persistence, delivery guarantees, consumer groups, dead-letter queues, or operational visibility without building them yourself — ZeroMQ is the wrong tool. Not because it is bad, but because those features require the very infrastructure it was designed to eliminate.

The "some assembly required" warning is real and should be taken seriously. ZeroMQ gives you the components to build anything, but building anything is exactly what you will have to do. For teams with strong systems programming skills and specific performance requirements, this is empowering. For teams that want to focus on business logic and treat messaging as a commodity, it is a burden.

Choose ZeroMQ when you know exactly what you need and are prepared to build it. Choose a broker when you want someone else to have already built it for you. There is no shame in either choice, only in choosing the wrong one for your situation and then complaining about it on the internet.

Pieter Hintjens built something genuinely original. Read the zguide, even if you never use ZeroMQ in production. It will make you a better distributed systems engineer.

Redpanda

There is a particular category of startup pitch that goes like this: "It's like [widely adopted thing], but we rewrote it in [faster language] and removed [thing everyone complains about]." Most of these pitches produce vapourware. Occasionally, one produces something genuinely good. Redpanda is in the latter category.

Redpanda's proposition is simple and audacious: a Kafka-compatible streaming platform, rewritten from scratch in C++, with no JVM, no ZooKeeper, and dramatically simpler operations. It is the kind of project that makes Kafka administrators feel a complex mix of excitement and professional anxiety.

Overview

Redpanda was founded as Vectorized, Inc. by Alexander Gallego in 2019. Gallego, previously at Akamai, had experience with high-performance C++ systems and a conviction that Kafka's performance limitations were largely artefacts of the JVM and ZooKeeper, not fundamental to the log-based streaming model. The company rebranded to Redpanda Data in 2021.

The core thesis: Kafka's design is sound — append-only logs, consumer groups, partitioned topics — but its implementation leaves performance on the table. The JVM introduces GC pauses. ZooKeeper adds operational complexity and a coordination bottleneck. The multi-threaded architecture contends on shared data structures. A C++ implementation using a thread-per-core model (via the Seastar framework) could deliver the same semantics with lower latency, higher throughput, and fewer moving parts.

This is not just a theory. Redpanda has delivered on enough of this promise to be taken seriously, while also encountering the predictable challenges of building a compatible replacement for a system with a decade-long head start.

The company has raised significant venture capital funding, launched a managed cloud offering, and attracted adoption from organisations that want Kafka's model without Kafka's operational overhead. Whether Redpanda will become a dominant platform or remain an excellent niche alternative is still an open question, but it has already influenced the broader ecosystem — Kafka's own move away from ZooKeeper (KRaft) was, shall we say, coincidentally well-timed.

Architecture

Thread-Per-Core (Seastar)

Redpanda is built on the Seastar framework, the same C++ framework that powers ScyllaDB. Seastar's model is thread-per-core: each CPU core runs a single thread (called a "shard"), and each shard owns its own memory, its own network connections, and its own data. Shards communicate via explicit message passing, not shared memory.

This eliminates lock contention, reduces context switching, and makes performance predictable. A 16-core machine runs 16 independent shards, each processing its own subset of partitions. There is no global lock, no shared heap, and no garbage collector deciding that now would be a great time to pause for 200 milliseconds.

The practical result is lower tail latency. Kafka's P99 latency can spike during GC pauses or under load as threads contend on shared data structures. Redpanda's P99 is more consistent because the architecture eliminates the primary sources of jitter.

Raft Consensus (No ZooKeeper)

Where Kafka historically depended on ZooKeeper for metadata management, leader election, and coordination, Redpanda uses Raft consensus internally. Every partition has a Raft group. Metadata is managed by an internal Raft group. There is no external dependency.

This is significant operationally. ZooKeeper was Kafka's most notorious operational pain point — a separate distributed system with its own failure modes, its own monitoring requirements, and its own capacity planning needs. "We upgraded ZooKeeper and the Kafka cluster fell over" is a story that has been told at more post-mortems than anyone would like to admit.

Redpanda runs as a single binary. Start the binary, join the cluster, done. This is not just marketing simplicity — it materially reduces the surface area for operational failures.

It is worth noting that Kafka has been addressing this with KRaft (Kafka Raft), which replaces ZooKeeper with an internal Raft-based metadata quorum. KRaft reached production readiness in Kafka 3.3 and ZooKeeper mode was formally deprecated. The competitive pressure from Redpanda was almost certainly a factor in accelerating this work, even if no one will say so on the record.

Storage

Redpanda uses a custom storage engine optimised for append-only workloads. Data is written to a write-ahead log, and the storage layer is designed to exploit sequential I/O patterns. The system is self-tuning — it detects the underlying storage characteristics (SSD vs HDD, local vs network-attached) and adjusts its I/O scheduling accordingly.

Tiered Storage allows older segments to be offloaded to object storage (S3, GCS, Azure Blob Storage). This is the same concept as Kafka's tiered storage (KIP-405), and it addresses the cost problem of keeping large retention periods on fast local storage. Hot data stays on local SSDs; cold data moves to cheap object storage and is fetched on demand.

Tiered storage is a significant feature for cost management. If you are retaining 30 days of data but most reads are from the last 2 hours, paying for SSD storage for the full 30 days is wasteful. Tiered storage lets you size local disks for the hot set and let S3 handle the rest.

Self-Tuning

Redpanda performs automatic tuning during startup: it benchmarks the disks, detects the CPU topology, configures I/O scheduling parameters, and sets buffer sizes. The rpk (Redpanda Keeper) CLI tool includes a tuning mode that configures the operating system as well — setting CPU governor, disabling IRQ balancing, configuring huge pages.

# Auto-tune the system
sudo rpk redpanda tune all

# Check tuning status
rpk redpanda tune list

This is a direct response to one of Kafka's operational realities: achieving optimal performance requires manual tuning of dozens of parameters (JVM heap, GC settings, num.io.threads, log.flush.interval.messages, OS-level settings). Redpanda's position is that the system should figure this out itself. The auto-tuning is not perfect — you may still need to adjust settings for unusual workloads — but it dramatically reduces the time from installation to acceptable performance.

Kafka API Compatibility

This is Redpanda's most strategically important feature: it implements the Kafka wire protocol. Kafka clients — the Java client, librdkafka, franz-go, Sarama, confluent-kafka-python — connect to Redpanda as though it were Kafka. No code changes required.

What Works

Core produce/consume operations. The bread and butter. Producing and consuming messages with standard Kafka clients works as expected.
Consumer groups. Group coordination, offset management, rebalancing — all implemented.
Transactions. Exactly-once semantics (EOS) with idempotent producers and transactional consumers.
ACLs. Kafka's authorisation model is supported.
Schema Registry. Redpanda includes a built-in Schema Registry that is compatible with the Confluent Schema Registry API. Avro, Protobuf, and JSON Schema are supported.
Admin API. Topic creation, configuration changes, partition reassignment.
Kafka Connect. Because Connect uses the standard Kafka protocol, existing connectors work with Redpanda. Though Redpanda does not bundle Connect itself — you run the Kafka Connect framework pointed at Redpanda.

What Does Not (or Did Not, or Requires Caveats)

Kafka Streams. Kafka Streams is a client library that uses internal topics and specific protocol features. It works with Redpanda, but compatibility is "best effort" rather than guaranteed. Simple Streams applications work fine; complex topologies with state stores may hit edge cases.
Exactly matching Kafka's behaviour in all edge cases. The protocol specification leaves room for implementation-defined behaviour. Redpanda aims for compatibility, but subtle differences exist — particularly in error handling, retry semantics, and edge cases around partition rebalancing. Most applications will never encounter these differences. If you are relying on undocumented Kafka behaviour, test carefully.
MirrorMaker 2. Works, but Redpanda also provides its own migration tooling for moving data from Kafka to Redpanda.
Some newer Kafka protocol versions. Redpanda tracks Kafka protocol versions but sometimes lags the latest release by a few months. Check the compatibility matrix for your specific Kafka client version.

The practical reality is that for most applications, "just point your Kafka clients at Redpanda" works. This is an extraordinary engineering achievement and the primary reason Redpanda has gained traction. The switching cost is near zero for the common case.

Console

Redpanda Console (formerly Kowl, which Redpanda acquired) is a web-based UI for managing and monitoring Redpanda (and Kafka) clusters. It provides:

Topic browsing and message inspection
Consumer group monitoring (lag, offsets)
Schema Registry management
ACL management
Cluster health overview

The Console is noticeably more polished than most open-source Kafka UIs. It is included with Redpanda and also works with vanilla Kafka clusters, which is a clever way to let people evaluate Redpanda's ecosystem before migrating.

Strengths

Lower Latency. Redpanda's thread-per-core architecture delivers consistently lower tail latency than Kafka. For workloads where P99 latency matters — real-time applications, interactive systems, latency-sensitive pipelines — this is a genuine advantage. The improvement is most pronounced under load, where Kafka's GC pauses and thread contention become noticeable and Redpanda's architecture avoids them entirely.

Simpler Operations. Single binary. No ZooKeeper. No JVM tuning. Self-tuning I/O. This is not just a convenience — it reduces the mean time to recovery, lowers the skill barrier for operations teams, and shrinks the surface area for configuration-related outages. If you have ever spent a day debugging a Kafka cluster that fell over because ZooKeeper ran out of ephemeral nodes, you will appreciate the simplicity viscerally.

Kafka Compatibility. Using existing Kafka clients with no code changes is a massive advantage for adoption. It means you can evaluate Redpanda without rewriting anything, migrate incrementally, and fall back to Kafka if needed. The risk of trying Redpanda is low, which makes the decision to evaluate it easy.

Resource Efficiency. Redpanda typically requires fewer nodes than Kafka for the same workload, and does not need the additional ZooKeeper nodes. The C++ implementation uses memory more efficiently than the JVM — no GC overhead, no object header overhead, deterministic allocation. For cloud deployments, fewer nodes means lower costs.

Built-in Schema Registry. Having the Schema Registry integrated (rather than requiring a separate Confluent Schema Registry deployment) simplifies the architecture. One fewer service to deploy, monitor, and manage.

rpk CLI. The rpk command-line tool is well-designed. Topic management, cluster configuration, performance testing, and system tuning are all accessible from a single, coherent CLI. It is the kind of developer experience that suggests the team has actually used their own product.

Weaknesses

Younger Ecosystem. Kafka has been in production since 2011. It has been battle-tested at LinkedIn, Netflix, Uber, and thousands of other organisations. The Kafka ecosystem includes Kafka Streams, ksqlDB, Kafka Connect with hundreds of connectors, and a vast body of operational knowledge. Redpanda is significantly younger. It has been in production at real companies, but the breadth and depth of real-world experience is smaller. Edge cases that Kafka has encountered and fixed over a decade may still be lurking in Redpanda.

Smaller Community. The community is growing but still a fraction of Kafka's. Fewer blog posts, fewer Stack Overflow answers, fewer consultants, fewer books. When you hit a problem, you are more likely to be the first person to encounter it. The Redpanda team is responsive (their Slack community is active), but community-sourced knowledge is thinner.

Enterprise Features Behind License. Redpanda Community Edition is open source (BSL, which converts to Apache 2.0 after four years). Enterprise features — continuous data balancing, tiered storage (in some configurations), audit logging, role-based access control beyond basic ACLs — require an Enterprise license. This is a reasonable business model, but it means the "free" Redpanda does not include everything you might need for a production deployment. Evaluate which features you need before committing.

Kafka Compatibility Is Not Identity. Despite the impressive compatibility, Redpanda is not Kafka. Behaviour differs in edge cases. Tools that rely on Kafka internals (rather than the public API) may not work. Monitoring tools that use Kafka's JMX metrics will not work — Redpanda exports Prometheus metrics natively (which is arguably better, but different). If you have extensive Kafka-specific operational tooling, migration requires adapting that tooling.

Limited Ecosystem Integration. Kafka Streams and ksqlDB work with Redpanda to varying degrees, but they are not the focus of Redpanda's development. If your architecture relies heavily on Kafka Streams state stores, interactive queries, or ksqlDB materialised views, test thoroughly before committing. Redpanda's own answer to stream processing is to integrate with Flink, Benthos, or other external processors.

Ideal Use Cases

Latency-Sensitive Streaming. When P99 latency matters and Kafka's GC pauses are a problem you have actually measured (not just feared), Redpanda is a compelling alternative.

Operational Simplicity. For teams that want Kafka's semantics without the operational complexity — particularly smaller teams without dedicated Kafka operators — Redpanda's single-binary, self-tuning design is attractive.

Kafka Migration. If you are running Kafka and are frustrated by operational overhead, Redpanda offers a relatively low-risk migration path. The Kafka protocol compatibility means you can test with your actual applications before committing.

Cost Optimisation. If you are running Kafka in the cloud and the node count (Kafka brokers plus ZooKeeper nodes) is driving costs, Redpanda's resource efficiency can provide meaningful savings.

New Streaming Projects. If you are starting a new project that needs Kafka-style semantics, Redpanda is worth evaluating alongside Kafka. The simpler operations and lower resource requirements can accelerate time to production.

Operational Reality

Deployment

Redpanda supports deployment on bare metal, virtual machines, Kubernetes (via a Helm chart and a Kubernetes operator), and Docker. The Kubernetes operator handles cluster provisioning, scaling, and upgrades.

# Install Redpanda on Linux
curl -1sLf \
  'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' \
  | sudo -E bash
sudo apt install redpanda

# Start and configure
sudo rpk redpanda tune all
sudo systemctl start redpanda

# Or with Docker
docker run -d --name=redpanda \
  -p 9092:9092 -p 8081:8081 -p 8082:8082 -p 9644:9644 \
  docker.redpanda.com/redpandadata/redpanda:latest \
  redpanda start --smp 1 --memory 1G --overprovisioned

Monitoring

Redpanda exports Prometheus metrics natively on port 9644. No JMX exporter, no custom configuration. The metrics cover:

Throughput (bytes/messages in/out per topic/partition)
Latency (produce/fetch latency histograms)
Storage (disk usage, log segment counts)
Consumer groups (committed offsets, lag)
Raft (leader elections, replication lag)
Internal metrics (scheduler utilisation, memory allocation)

Grafana dashboards are available from Redpanda and the community. The metrics naming follows Prometheus conventions, which means they integrate cleanly with existing Prometheus/Grafana stacks.

The "Just Replace Kafka" Migration Path

The migration story is approximately:

Deploy a Redpanda cluster alongside your existing Kafka cluster.
Use MirrorMaker 2 or Redpanda's own migration tool to replicate topics from Kafka to Redpanda.
Point consumers at Redpanda, verify they work.
Point producers at Redpanda.
Decommission Kafka.

Steps 3 and 4 are where you discover whether "Kafka compatible" means compatible enough for your applications. In most cases, it does. But "most cases" provides cold comfort if you are the exception.

The recommended approach is to run both systems in parallel for a meaningful period, comparing outputs, monitoring for discrepancies, and building confidence. Do not do a big-bang migration on a Friday afternoon.

Redpanda Cloud and BYOC

Redpanda Cloud is the fully managed offering — Redpanda runs the infrastructure, you use it as a service. Standard managed service model.

BYOC (Bring Your Own Cloud) is more interesting: the data plane runs in your cloud account (your VPC, your machines), while Redpanda manages the control plane. This addresses the concern that some organisations have about sending data through a third party's infrastructure while still offloading operational management. It is a pragmatic middle ground for security-conscious organisations.

Benchmarks and Performance Claims

Redpanda publishes benchmarks showing superior throughput and latency compared to Kafka. These benchmarks are produced by Redpanda, which is the first thing you should note. Vendor benchmarks are a literary genre, not a scientific discipline.

That said, independent benchmarks generally confirm two things:

Latency is consistently lower, particularly P99 and P999 tail latency. This is the thread-per-core architecture paying off, and the advantage is most visible under load.
Throughput is competitive to superior, depending on workload. For produce-heavy workloads with small messages, Redpanda often outperforms Kafka. For very large messages or workloads that are primarily sequential reads from well-cached data, the difference narrows.

The benchmarks you should trust are the ones you run yourself, on your hardware, with your workload patterns, at your scale. Redpanda provides rpk tooling for load testing:

# Produce benchmark
rpk topic produce benchmark-topic --compression none \
  --num 1000000 --batch-size 100

# Consume benchmark
rpk topic consume benchmark-topic --num 1000000

Do not make a production decision based on someone else's benchmark, including Redpanda's.

Code Examples

The entire point of Kafka API compatibility is that existing Kafka clients work unchanged. These examples use standard Kafka client libraries, connecting to Redpanda instead of Kafka.

Python (confluent-kafka-python)

from confluent_kafka import Producer, Consumer, KafkaError
import json

# ---------- Producer ----------

producer_config = {
    'bootstrap.servers': 'localhost:9092',  # Redpanda address
    'client.id': 'order-service',
    'acks': 'all',
    'enable.idempotence': True,
}

producer = Producer(producer_config)

def delivery_callback(err, msg):
    if err:
        print(f"Delivery failed: {err}")
    else:
        print(f"Delivered to {msg.topic()}[{msg.partition()}] "
              f"@ offset {msg.offset()}")

event = {
    "type": "OrderPlaced",
    "orderId": "ord-7829",
    "amount": 149.99,
    "currency": "USD"
}

producer.produce(
    topic="orders",
    key="ord-7829",
    value=json.dumps(event).encode("utf-8"),
    callback=delivery_callback
)
producer.flush()

# ---------- Consumer ----------

consumer_config = {
    'bootstrap.servers': 'localhost:9092',  # Same Redpanda address
    'group.id': 'payment-service',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,
}

consumer = Consumer(consumer_config)
consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                continue
            print(f"Error: {msg.error()}")
            break

        event = json.loads(msg.value().decode("utf-8"))
        print(f"Processing: {event}")

        # Manual commit after successful processing
        consumer.commit(asynchronous=False)
finally:
    consumer.close()

Java (standard Kafka client)

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.*;
import java.util.*;
import java.time.Duration;

public class RedpandaExample {

    public static void main(String[] args) {
        // Producer — note: connecting to Redpanda, no code change needed
        Properties producerProps = new Properties();
        producerProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
            "localhost:9092");
        producerProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
            StringSerializer.class.getName());
        producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
            StringSerializer.class.getName());
        producerProps.put(ProducerConfig.ACKS_CONFIG, "all");
        producerProps.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

        try (KafkaProducer<String, String> producer =
                new KafkaProducer<>(producerProps)) {

            ProducerRecord<String, String> record = new ProducerRecord<>(
                "orders", "ord-7829",
                "{\"type\":\"OrderPlaced\",\"orderId\":\"ord-7829\"}");

            producer.send(record, (metadata, exception) -> {
                if (exception != null) {
                    System.err.println("Send failed: " + exception);
                } else {
                    System.out.printf("Sent to %s[%d]@%d%n",
                        metadata.topic(),
                        metadata.partition(),
                        metadata.offset());
                }
            });
        }

        // Consumer — also unchanged
        Properties consumerProps = new Properties();
        consumerProps.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
            "localhost:9092");
        consumerProps.put(ConsumerConfig.GROUP_ID_CONFIG,
            "payment-service");
        consumerProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
            StringDeserializer.class.getName());
        consumerProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
            StringDeserializer.class.getName());
        consumerProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
            "earliest");

        try (KafkaConsumer<String, String> consumer =
                new KafkaConsumer<>(consumerProps)) {

            consumer.subscribe(List.of("orders"));

            while (true) {
                ConsumerRecords<String, String> records =
                    consumer.poll(Duration.ofMillis(1000));

                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("Received: key=%s value=%s "
                        + "partition=%d offset=%d%n",
                        record.key(), record.value(),
                        record.partition(), record.offset());
                }

                consumer.commitSync();
            }
        }
    }
}

Go (franz-go)

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/twmb/franz-go/pkg/kgo"
)

func main() {
	// Producer
	client, err := kgo.NewClient(
		kgo.SeedBrokers("localhost:9092"), // Redpanda
		kgo.DefaultProduceTopic("orders"),
		kgo.RequiredAcks(kgo.AllISRAcks()),
	)
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	record := &kgo.Record{
		Key:   []byte("ord-7829"),
		Value: []byte(`{"type":"OrderPlaced","orderId":"ord-7829"}`),
	}

	results := client.ProduceSync(context.Background(), record)
	for _, pr := range results {
		if pr.Err != nil {
			log.Printf("Produce error: %v", pr.Err)
		} else {
			fmt.Printf("Produced to %s[%d]@%d\n",
				pr.Record.Topic,
				pr.Record.Partition,
				pr.Record.Offset)
		}
	}

	// Consumer
	consumerClient, err := kgo.NewClient(
		kgo.SeedBrokers("localhost:9092"),
		kgo.ConsumeTopics("orders"),
		kgo.ConsumerGroup("payment-service"),
	)
	if err != nil {
		log.Fatal(err)
	}
	defer consumerClient.Close()

	for {
		fetches := consumerClient.PollFetches(context.Background())
		fetches.EachRecord(func(r *kgo.Record) {
			fmt.Printf("Consumed: key=%s value=%s partition=%d offset=%d\n",
				string(r.Key), string(r.Value),
				r.Partition, r.Offset)
		})
		consumerClient.AllowRebalance()
	}
}

The point of these examples is not the code — it is that the code is identical to what you would write for Kafka. The only difference is the bootstrap server address. That is the entire value proposition of Kafka API compatibility, demonstrated in three languages.

Verdict

Redpanda is the most credible Kafka alternative on the market. It delivers on its core promises: lower latency, simpler operations, and genuine Kafka compatibility. The single-binary deployment, self-tuning behaviour, and elimination of ZooKeeper are not just marketing talking points — they represent a meaningful reduction in operational burden.

The question is not whether Redpanda is good. It is. The question is whether it is good enough to justify switching from Kafka, or compelling enough to choose over Kafka for a new project.

For new projects: Redpanda is a strong default choice, particularly for smaller teams. The operational simplicity alone may justify the slightly smaller ecosystem. If you do not need Kafka Streams or ksqlDB, and you are not betting on Confluent's specific enterprise features, Redpanda gives you the same programming model with less operational overhead.

For existing Kafka deployments: migrate if you have a concrete problem that Redpanda solves — operational complexity that is consuming your team, latency requirements that Kafka cannot meet, cost pressure from your node count. Do not migrate because Redpanda is newer or because C++ sounds faster. Migration has costs, and "it might be better" is not a business case.

For enterprises with compliance requirements: evaluate the enterprise license carefully. The BSL licensing model means the community edition has restrictions on offering Redpanda as a managed service (which affects cloud providers, not end users), but enterprise features like audit logging and advanced RBAC may require a commercial agreement.

Redpanda has done something rare: it has built a genuinely competitive alternative to a dominant platform without requiring users to learn a new API. That alone deserves respect. Whether it becomes the default choice for stream processing or remains a compelling alternative depends on execution, ecosystem growth, and whether "Kafka without the JVM" is a big enough differentiator to overcome Kafka's massive installed base.

For now, Redpanda is the answer to "I wish Kafka were simpler to run." If that is your wish, your wish has been granted. Check the fine print, as always.

Memphis

Every few years, a new messaging project arrives with a fresh perspective on developer experience. The pitch usually goes something like: "Message brokers are too complicated. What if we made one that developers actually enjoy using?" Memphis is (or was) one of these projects, and it is worth examining both for what it tried to do and for the cautionary tale it represents about the lifecycle of newer open-source infrastructure projects.

Let us be direct from the start: Memphis has had a turbulent trajectory. The company behind it (Memphis.dev) pivoted, the open-source project's future became uncertain, and as of the time of writing, its long-term viability is a genuine question. We will cover the technology honestly — it had real ideas worth understanding — but we will also be honest about the risks. If you are evaluating Memphis for a new project, read the entire chapter, especially the end.

Overview

Memphis was created by Memphis.dev (originally based in Israel) with the stated goal of making message brokers accessible to developers who were not distributed systems specialists. The founding insight was that Kafka, RabbitMQ, and Pulsar all required significant operational knowledge and architectural understanding before they could be used productively. Memphis aimed to lower that barrier.

The project launched around 2021-2022 and gained initial traction through a developer-friendly GUI, built-in schema management, dead-letter handling, and a focus on making common tasks — producing messages, consuming them, handling failures — simple by default rather than simple after reading three books and attending a conference talk.

Memphis was built on top of NATS JetStream, which is a significant architectural decision. Rather than building a storage engine, replication protocol, and networking layer from scratch, Memphis used NATS JetStream as its foundation and added a developer experience layer on top. This gave it the performance and reliability characteristics of NATS while allowing the Memphis team to focus on UX, tooling, and higher-level features.

The Pivot and the Present

Memphis.dev as a company underwent significant changes. The open-source project's development slowed. The company pivoted toward different products. The GitHub repository's commit frequency dropped. Community activity declined.

This is not unusual in the world of venture-backed open-source infrastructure. Companies need revenue, and developer experience layers over existing technology are difficult to monetise when the underlying technology (NATS JetStream) is freely available and well-documented. The business model challenge — "we make NATS easier to use, please pay us" — proved difficult to sustain.

As of this writing, Memphis should be evaluated with caution. The technology works, the ideas are sound, but the project's future maintenance and development are uncertain. If you are building production systems with a multi-year horizon, you need confidence that your messaging infrastructure will be maintained, patched, and improved. That confidence is easier to have with NATS, Kafka, or RabbitMQ than with Memphis.

We cover it here because the ideas deserve examination, and because understanding why projects like Memphis emerge (and struggle) tells you something important about the messaging landscape.

Architecture

Built on NATS JetStream

Memphis is, architecturally, a layer on top of NATS JetStream. If you have read Chapter 17 on NATS, you already understand the foundation:

NATS provides the core publish-subscribe messaging, connection management, and clustering.
JetStream adds persistence, exactly-once delivery, consumer groups, and stream management.
Memphis adds a developer experience layer: a GUI, schema management, dead-letter handling, SDK abstractions, and operational tooling.

The Memphis broker is a modified NATS server with additional components:

A metadata store (using JetStream internally)
A REST API for management operations
A web-based GUI
Schema enforcement logic
Dead-letter station management
SDK-facing protocol handling

Messages produced to Memphis flow through to JetStream streams. Consumers reading from Memphis are reading from JetStream consumers. The Memphis layer intercepts, decorates, and manages the lifecycle, but the heavy lifting — storage, replication, delivery guarantees — is JetStream.

Stations

Memphis uses the concept of a "station" rather than "topic" or "queue." A station is the fundamental unit of message organization. Under the hood, a station maps to a JetStream stream with specific configuration.

Stations have:

Retention policy: time-based, size-based, or message-count-based
Replication factor: how many nodes store copies (inherits from JetStream)
Schema enforcement: optional schema validation on produce
Dead-letter station: where failed messages go after retry exhaustion

The station abstraction is Memphis's primary organisational concept, and it is intentionally simpler than the topic/partition/consumer-group hierarchy of Kafka or the exchange/binding/queue model of RabbitMQ. For developers coming to messaging for the first time, this simplicity is a genuine benefit.

Dead-Letter Station

One of Memphis's more thoughtful features. When a consumer fails to process a message after a configurable number of retries, the message is moved to a dead-letter station (DLS). The DLS is visible in the GUI, where operators can:

Inspect failed messages
See the failure reason (if the consumer SDK reports it)
Resend messages to the original station
Drop messages

This is functionality that you can build with Kafka or RabbitMQ, but Memphis makes it a first-class feature with built-in tooling. For teams that have experienced the joy of discovering that a consumer has been silently failing for three days because nobody checked the dead-letter queue, this is appreciated.

Poison Message Detection

Related to dead-letter handling, Memphis tracks messages that repeatedly cause consumer failures. These "poison messages" — messages that crash consumers, trigger exceptions, or time out repeatedly — are flagged in the UI and can be automatically routed to the DLS.

In a typical Kafka setup, a poison message can be insidious: it arrives, the consumer crashes, the consumer restarts, it reads the same message (because the offset was not committed), crashes again, and enters a restart loop. Memphis's poison message detection breaks this cycle by identifying the problematic message and removing it from the normal processing path.

Key Features

Schema Management (Schemaverse)

Memphis included built-in schema management called "Schemaverse." Schemas could be attached to stations, and the broker would validate messages against the schema on produce. Supported formats included:

Protobuf
JSON Schema
GraphQL (an unusual choice that reflected Memphis's developer-experience focus)
Avro

Schema enforcement at the broker level — rather than relying on client-side validation or a separate Schema Registry — simplifies the architecture. You do not need a separate Confluent Schema Registry deployment; schema management is part of the broker.

The trade-off is coupling: your schema management is tied to your broker. With Kafka + Confluent Schema Registry, you can swap brokers without losing your schema management. With Memphis, the schema management goes where Memphis goes.

GUI

The Memphis GUI is perhaps the feature that best embodied the project's philosophy. It provided:

Real-time station monitoring (message rates, consumer lag)
Message browsing and inspection
Schema management
Dead-letter station management
User and permission management
System health overview

For a developer who has just set up their first message broker and wants to understand what is happening, a well-designed GUI is enormously valuable. The Kafka ecosystem has various UIs (Kafka UI, AKHQ, Confluent Control Center), but none are as tightly integrated with the broker as Memphis's GUI was with Memphis.

SDK Design

Memphis SDKs were designed for simplicity. A minimal producer/consumer in Memphis required fewer lines of code and fewer concepts than the equivalent in Kafka or RabbitMQ. The SDKs handled connection management, reconnection, and basic error handling internally.

This is a double-edged sword. Simple SDKs are great for getting started and terrible for debugging production issues. When the SDK handles reconnection transparently, you do not know it is happening. When it manages consumer acknowledgment internally, you may not realise that your processing guarantees depend on SDK configuration you never examined.

Strengths

Developer Experience. This was Memphis's genuine differentiator. The combination of a clean GUI, simple SDKs, built-in dead-letter handling, and schema management created a "batteries included" experience that lowered the barrier to entry for event-driven architecture. For teams where nobody has operated a message broker before, this matters.

Built-in Observability. Station metrics, consumer lag, poison message detection, and dead-letter monitoring were all available out of the box. No Prometheus exporter to configure, no Grafana dashboards to import. You could see what was happening in your messaging system by opening a browser.

Low Learning Curve. Memphis required understanding one concept (stations) to get started, versus Kafka's topics/partitions/consumer-groups/offsets or RabbitMQ's exchanges/bindings/queues/acknowledgments. For educational purposes and rapid prototyping, this was valuable.

Poison Message Handling. The automatic detection and routing of problematic messages is a genuinely useful feature that most messaging systems leave to the user to implement.

NATS Foundation. By building on NATS JetStream, Memphis inherited solid performance characteristics, clustering, and persistence without having to build them from scratch. NATS is a proven, well-engineered system.

Weaknesses

Project Viability. This is the elephant in the room. A messaging system is foundational infrastructure — it is not something you swap out easily. Adopting a project with uncertain maintenance means accepting the risk that you may need to migrate to something else in the future. Migration is always more expensive than anyone estimates.

Smaller Community. Even at its peak, Memphis's community was small relative to Kafka, RabbitMQ, or even NATS. Fewer users means fewer bug reports, fewer contributed fixes, less operational knowledge shared in blog posts and conference talks, and fewer people on your team who have experience with it.

NATS Dependency. Memphis's strength (building on NATS) was also a constraint. Memphis was limited by what JetStream provided. Features that required going beyond JetStream's capabilities were difficult to implement. And the question "why not just use NATS JetStream directly?" was always lurking — if you are going to learn the underlying system anyway when things go wrong, why not start there?

Enterprise Features. Advanced features — SSO integration, role-based access control, advanced monitoring — were positioned as enterprise/commercial features. For a project in Memphis's situation, this creates a tension: the open-source version needs to be compelling enough to drive adoption, but the commercial version needs to be differentiated enough to generate revenue.

Scale Ceiling. Memphis targeted small-to-medium workloads. For high-throughput, large-scale event streaming, Kafka, Redpanda, or even raw NATS JetStream were better choices. Memphis's abstraction layer added overhead that, while negligible at moderate scale, became noticeable at high volumes.

Ideal Use Cases

Teams New to Event-Driven Architecture. If your team has never operated a message broker and you want to start exploring event-driven patterns, Memphis (or something like it) lowers the entry barrier. The GUI and simplified concepts help build understanding before graduating to more complex systems.

Rapid Prototyping. When you need messaging for a prototype or proof of concept and you want to focus on application logic rather than infrastructure, Memphis's quick setup and simple SDKs are helpful.

Development and Testing Environments. Memphis's low resource requirements and easy setup make it suitable for local development and CI/CD testing, even if production uses a different system.

Internal Tooling. For internal applications with moderate scale requirements and a premium on developer productivity, Memphis's trade-offs (simplicity over scale) can be the right ones.

Operational Reality

Deployment

Memphis could be deployed via Docker, Docker Compose, Kubernetes (Helm chart), or as a managed cloud service (Memphis Cloud, when it was operational).

# Docker Compose deployment (the simplest path)
curl -s https://memphisdev.github.io/memphis-docker/docker-compose.yml \
  -o docker-compose.yml
docker compose up -d

# Kubernetes via Helm
helm repo add memphis https://k8s.memphis.dev/charts/
helm install memphis memphis/memphis --namespace memphis --create-namespace

The Docker Compose deployment spun up the Memphis broker, the GUI (on port 9000 by default), and the required metadata storage. For development, it was genuinely simple. For production, the Kubernetes deployment was the recommended path.

Monitoring

Monitoring was primarily through the built-in GUI. For external monitoring, Memphis exposed metrics that could be scraped by Prometheus. However, the monitoring story was less mature than Kafka's or NATS's extensive metric ecosystems. You had the GUI (good for humans) and basic Prometheus metrics (adequate for alerting), but deep operational introspection required understanding the underlying NATS JetStream internals.

What Happens When Things Go Wrong

This is where Memphis's abstraction layer became a liability. When a station was slow, was it a Memphis issue or a JetStream issue? When consumers were lagging, was the bottleneck in Memphis's SDK layer or in the underlying NATS consumer? Debugging required understanding both Memphis's layer and NATS JetStream, which undermined the "simplicity" value proposition.

Experienced operators would often bypass Memphis's tooling and use NATS's native monitoring (nats CLI, JetStream management API) to diagnose issues. This is the inevitable fate of abstraction layers: they help until they do not, and then you need to understand what is underneath.

Memphis vs NATS JetStream Directly

This is the question that haunted Memphis from the start: when does the abstraction pay for itself?

Choose Memphis when:

Your team has no NATS experience and wants a gentler on-ramp
The built-in GUI, dead-letter handling, and schema management save you from building these features yourself
Your workload is moderate and the abstraction overhead is negligible
Developer experience is a higher priority than operational transparency

Choose NATS JetStream directly when:

Your team can invest time learning NATS (it is not that hard)
You want full control over configuration and behaviour
You need maximum performance without abstraction overhead
You want the confidence of a large, active, well-maintained project
You are building production systems with a multi-year horizon

The honest assessment: for most teams, learning NATS JetStream directly is the better long-term investment. The learning curve is modest, the documentation is good, and you are investing in knowledge of a system that will be maintained and improved for years to come. Memphis's developer experience advantages — while real — are most valuable in the first few weeks of adoption. After that, you need operational depth, and that depth comes from understanding the underlying system.

Code Examples

Python

import asyncio
from memphis import Memphis, MemphisError, MemphisConnectError

async def producer_example():
    memphis = Memphis()
    try:
        await memphis.connect(
            host="localhost",
            username="root",
            password="memphis",
            account_id=1
        )

        producer = await memphis.producer(
            station_name="orders",
            producer_name="order-service"
        )

        event = {
            "type": "OrderPlaced",
            "orderId": "ord-7829",
            "amount": 149.99,
            "currency": "USD"
        }

        # Memphis SDK handles serialization
        await producer.produce(
            message=event,
            headers={"event_type": "OrderPlaced"}
        )
        print("Message produced successfully")

    except (MemphisError, MemphisConnectError) as e:
        print(f"Error: {e}")
    finally:
        await memphis.close()


async def consumer_example():
    memphis = Memphis()
    try:
        await memphis.connect(
            host="localhost",
            username="root",
            password="memphis",
            account_id=1
        )

        consumer = await memphis.consumer(
            station_name="orders",
            consumer_name="payment-service",
            consumer_group="payment-group"
        )

        # Callback-based consumption
        async def message_handler(messages, error, context):
            if error:
                print(f"Consumer error: {error}")
                return

            for msg in messages:
                try:
                    data = msg.get_data()
                    print(f"Received: {data}")
                    # Acknowledge successful processing
                    await msg.ack()
                except Exception as e:
                    print(f"Processing error: {e}")
                    # Message will be redelivered (and eventually
                    # sent to DLS if retries are exhausted)

        # Start consuming
        consumer.consume(message_handler)

        # Keep the consumer running
        await asyncio.sleep(60)

    except (MemphisError, MemphisConnectError) as e:
        print(f"Error: {e}")
    finally:
        await memphis.close()


if __name__ == "__main__":
    asyncio.run(producer_example())
    asyncio.run(consumer_example())

Node.js

const { Memphis } = require("memphis-dev");

async function producerExample() {
  const memphis = new Memphis();

  try {
    await memphis.connect({
      host: "localhost",
      username: "root",
      password: "memphis",
      accountId: 1,
    });

    const producer = await memphis.producer({
      stationName: "orders",
      producerName: "order-service",
    });

    const event = {
      type: "OrderPlaced",
      orderId: "ord-7829",
      amount: 149.99,
      currency: "USD",
    };

    await producer.produce({
      message: Buffer.from(JSON.stringify(event)),
      headers: { event_type: "OrderPlaced" },
    });

    console.log("Message produced successfully");
  } catch (error) {
    console.error("Producer error:", error);
  } finally {
    await memphis.close();
  }
}

async function consumerExample() {
  const memphis = new Memphis();

  try {
    await memphis.connect({
      host: "localhost",
      username: "root",
      password: "memphis",
      accountId: 1,
    });

    const consumer = await memphis.consumer({
      stationName: "orders",
      consumerName: "payment-service",
      consumerGroup: "payment-group",
    });

    consumer.on("message", (message, context) => {
      const data = JSON.parse(message.getData().toString());
      console.log("Received:", data);

      // Acknowledge the message
      message.ack();
    });

    consumer.on("error", (error) => {
      console.error("Consumer error:", error);
    });

    // The consumer runs until explicitly stopped
    console.log("Consumer listening...");
  } catch (error) {
    console.error("Setup error:", error);
    await memphis.close();
  }
}

// Run
producerExample().then(() => consumerExample());

Note the relative simplicity compared to Kafka client code. No partition configuration, no serializer classes, no consumer group rebalancing callbacks. Memphis's SDK handles these details internally. Whether this simplicity is a benefit (less boilerplate) or a liability (less control) depends on your needs and your comfort with implicit behaviour.

The Broader Lesson

Memphis represents a pattern worth understanding: the developer experience layer over infrastructure. Several projects have attempted this in the messaging space — making brokers more accessible, wrapping complexity in friendly UIs, providing opinionated defaults.

These projects face a structural challenge. The underlying infrastructure (NATS, in Memphis's case) is itself improving its developer experience. NATS added better documentation, a management CLI, and monitoring tools. As the foundation improves, the value of the abstraction layer shrinks. And the abstraction layer has a maintenance cost that the underlying project does not bear.

This does not mean developer experience does not matter — it profoundly does. But it suggests that developer experience improvements are more sustainable when they are part of the core project rather than a separate layer on top. The NATS team investing in better documentation and tooling is more durable than a separate company building a wrapper.

Verdict

Memphis had genuinely good ideas. The developer experience focus, the built-in dead-letter handling, the schema management, the GUI — these were real improvements over the status quo of "here is a broker, good luck." The project demonstrated that messaging infrastructure could be more approachable without being less capable.

However, good ideas are necessary but not sufficient. A messaging system is infrastructure that you commit to for years. It needs sustained development, an active community, responsive security patching, and a credible roadmap. Memphis's uncertain trajectory makes it a risky choice for production systems today.

If you are attracted to what Memphis offered, the practical recommendation is:

Use NATS JetStream directly. You get the same underlying engine, better long-term viability, and a larger community. The learning curve is modest.
Build the missing pieces. If you want dead-letter handling, build it on top of JetStream (it is a few dozen lines of code). If you want a GUI, use NATS's management tools or third-party dashboards.
Watch the space. Developer experience in messaging is an unsolved problem. The next project that tackles it might come from within an established broker's ecosystem rather than as a separate layer, and that would be more durable.

Memphis was a worthwhile experiment. It asked the right questions about developer experience in event-driven architecture. The answers it provided were good but may not outlast the project itself. In infrastructure, the best technology does not always win — the best-maintained technology does. Choose accordingly.

Solace PubSub+

Every messaging technology in this book has an origin story that reveals something about its priorities. Kafka was born from LinkedIn's data firehose. RabbitMQ emerged from the telecom world's need for reliable message queuing. Solace started life as a hardware appliance company selling purpose-built messaging boxes to financial institutions. That origin — silicon, not software — shapes everything about what Solace PubSub+ is today, for better and for worse.

Solace occupies a peculiar position in the messaging landscape. It is simultaneously one of the most feature-rich platforms available and one of the least discussed in developer communities. You will not find it dominating Hacker News threads or Stack Overflow questions. But walk into a large bank, a global logistics company, or an automotive manufacturer's integration team, and there is a reasonable chance Solace is quietly moving millions of messages per second beneath the surface. It is the enterprise dark horse — technically impressive, commercially significant, and almost entirely invisible to the broader developer consciousness.

This chapter examines Solace PubSub+ honestly: what it does well, what it does not, and whether its enterprise-oriented design is a strength or a limitation depending on your context.

Overview

What It Is

Solace PubSub+ is a multi-protocol event broker that supports publish-subscribe messaging, message queuing, request-reply, and event streaming. It is available as a software broker, a managed cloud service (PubSub+ Cloud), and — uniquely among modern brokers — a hardware appliance. The platform's distinguishing feature is its "event mesh" vision: the ability to interconnect multiple brokers across data centres, clouds, and edge locations into a unified messaging fabric.

Brief History

Solace Systems was founded in 2001 in Ottawa, Canada, by Craig Betts. The original premise was that general-purpose servers running software-based message brokers could not deliver the latency and throughput that financial services firms needed. Solace's answer was purpose-built hardware: custom ASICs and FPGAs designed specifically for message routing. If you wanted single-digit microsecond message latency in 2005, your options were essentially Solace appliances, 29West (later acquired by Informatica), or writing your own kernel-bypass solution and hoping your team included someone who understood FPGA programming.

The hardware appliance business was profitable but inherently limited in addressable market. Not everyone needs — or can afford — custom messaging hardware. Around 2014-2015, Solace began a strategic pivot: they ported the broker's functionality to a software-only deployment. The software broker maintained the same API, protocol support, and management interfaces as the appliance but ran on commodity hardware and, crucially, in virtual machines and containers.

In 2019, Solace launched PubSub+ Cloud, their fully managed service offering. This completed the transformation from niche hardware vendor to multi-modal messaging platform. The company rebranded from "Solace Systems" to simply "Solace" and pushed heavily into the "event mesh" narrative — positioning itself as the connective tissue for enterprise event-driven architectures spanning multiple clouds and on-premises data centres.

Solace remains a private company, which means limited public financial data. They have raised venture funding and claim thousands of enterprise customers. The customer base skews heavily toward financial services, logistics, healthcare, and large-scale manufacturing — industries where multi-protocol support, guaranteed delivery, and enterprise-grade management are table stakes, not nice-to-haves.

Who Runs It

Solace the company develops and maintains PubSub+ as a commercial product. There is no open-source core in the Apache Kafka sense. The software broker has a free "Standard" edition with capacity limits, and there are "Enterprise" and "Mission Critical" tiers with progressively more features and capacity. PubSub+ Cloud has a free tier as well, which is genuinely useful for development and small-scale testing.

This is important to understand: Solace is a commercial product first. The community edition exists to lower the adoption barrier, but the business model is enterprise licensing and cloud subscriptions. If you are philosophically committed to open-source infrastructure, Solace is probably not for you. If you are pragmatically committed to solving enterprise messaging problems and have a procurement department, keep reading.

Architecture

The Broker

At its core, PubSub+ is a message broker — a process that accepts connections from clients, receives messages, and routes them to interested consumers. So far, standard fare. What distinguishes Solace architecturally is the breadth of messaging patterns it supports natively and the efficiency of its internal routing engine.

The broker is written in C/C++, not Java. This is relevant because it means PubSub+ does not suffer from JVM garbage collection pauses. For workloads that require consistent low-latency message delivery — financial market data, real-time control systems — GC pauses are not a theoretical concern but a concrete operational headache. Solace avoids this category of problem entirely.

The broker process manages:

Client connections across multiple protocols simultaneously (more on this below)
Message routing using a topic-based hierarchy with wildcard subscription support
Persistent message storage for guaranteed delivery
Message replay for re-consuming historical messages
Queue management for competing consumer patterns

A single broker instance can handle tens of thousands of concurrent client connections and millions of messages per second, depending on message size and delivery guarantees. The hardware appliance version pushes these numbers significantly higher — Solace claims sub-microsecond latency on their appliance, which is believable given the dedicated FPGA routing path.

Message VPNs

One of Solace's more useful architectural concepts is the Message VPN (Virtual Private Network). A Message VPN is a virtual partition of the broker that provides complete isolation of messaging resources: topics, queues, client connections, and access controls.

Think of it as multi-tenancy at the broker level. A single PubSub+ broker can host multiple Message VPNs, each with its own:

Client username/password database or LDAP/RADIUS integration
Topic space (the same topic name in two different VPNs refers to different logical destinations)
Queues and subscriptions
Rate limits and connection quotas
ACLs and access profiles

This is genuinely valuable in enterprise environments where a single messaging infrastructure needs to serve multiple teams, applications, or business units with isolation guarantees. You can run development, staging, and production traffic on the same broker cluster by separating them into different VPNs — though whether you should is a separate discussion about blast radius and risk appetite.

Guaranteed and Direct Messaging

Solace distinguishes between two fundamental messaging modes, and understanding this distinction is essential:

Direct Messaging is fire-and-forget. The producer sends a message, the broker routes it to all matching subscribers, and if a subscriber is not connected or cannot keep up, the message is lost. Direct messaging is fast — it avoids the overhead of persistence, acknowledgement, and retry — and is appropriate for data that is time-sensitive but not individually critical. Market data ticks, sensor readings, and real-time metrics are classic direct messaging use cases. If you miss one price update, the next one arrives in milliseconds and supersedes it.

Guaranteed Messaging provides once-and-only-once delivery semantics. Messages are persisted to queues, acknowledged by consumers, and redelivered on failure. The broker writes messages to disk (or to the hardware appliance's memory), maintains delivery state per consumer, and ensures that every message reaches its intended destination even if consumers disconnect or crash.

The two modes can coexist on the same broker, which is useful because most real-world systems have a mix of "best effort is fine" and "every message matters" requirements. A trading system might use direct messaging for streaming quotes and guaranteed messaging for order execution confirmations.

Topic Hierarchy and Subscriptions

Solace uses a hierarchical topic structure with levels separated by slashes:

orders/region/EMEA/currency/EUR
orders/region/APAC/currency/JPY
sensors/building-7/floor-3/temperature

Subscribers can use wildcards:

* matches a single level: orders/region/*/currency/EUR matches all EUR orders regardless of region
> matches one or more levels at the end: sensors/building-7/> matches all sensors in building 7

This is more expressive than Kafka's flat topic names and simpler than RabbitMQ's exchange/binding/routing-key model. The hierarchical topic system allows fine-grained subscription without requiring message filtering at the consumer level. Consumers receive only the messages that match their subscription, which reduces bandwidth and processing overhead.

The broker maintains a subscription routing table that maps topic patterns to connected consumers. The routing engine evaluates incoming messages against this table and delivers copies to all matching subscribers. On the hardware appliance, this routing is performed by custom silicon; on the software broker, it is an optimised C++ implementation.

High Availability and Redundancy

PubSub+ supports an active-standby redundancy model for high availability. Two brokers (or appliances) operate as a redundancy pair:

The primary broker handles all client traffic and message routing
The backup broker maintains a synchronised copy of all persistent state
If the primary fails, the backup takes over with minimal message loss

The failover is automatic and typically completes in seconds. Clients using the Solace SDK (with configured host lists) reconnect automatically. This is simpler than Kafka's partition-leader model or Pulsar's bookie architecture, but it comes with a trade-off: the active-standby model does not provide horizontal scaling for message routing. A single broker (or redundancy pair) handles all traffic. You scale by adding more Message VPNs, using event mesh to distribute load across sites, or — honestly — buying a bigger box.

For the hardware appliance, "buying a bigger box" is literal. Solace sells appliance models with progressively more capacity. For the software broker, you scale vertically (more CPU, more memory, faster disks) or distribute horizontally using DMR.

Protocol Support

This is where Solace genuinely shines, and where its enterprise heritage becomes an unambiguous advantage.

PubSub+ natively supports:

SMF (Solace Message Format): Solace's proprietary binary protocol. Highest performance, lowest latency, most features. Used by Solace's native SDKs for Java, C, C#, JavaScript, and others.
MQTT 3.1.1 and 5.0: Full MQTT broker, including QoS 0, 1, and 2, retained messages, and last will and testament. Useful for IoT device connectivity.
AMQP 1.0: The OASIS standard messaging protocol. Interoperates with any AMQP 1.0 client library.
REST: Simple HTTP POST for producing messages and webhooks for consuming them. Not the highest performance, but universally accessible.
JMS 1.1 and 2.0: Full JMS provider implementation. Drop-in replacement for ActiveMQ, IBM MQ, or TIBCO EMS for Java applications.
WebSocket: For browser-based messaging applications, often used with the JavaScript SDK.

All of these protocols can operate simultaneously on a single broker instance. An MQTT IoT device can publish a temperature reading to sensors/building-7/floor-3/temperature, and a JMS Java application subscribed to sensors/building-7/> will receive it seamlessly. The broker handles protocol translation internally.

This multi-protocol capability is not a superficial feature. Many enterprise environments have decades of messaging infrastructure across multiple technologies: legacy JMS applications, modern microservices, IoT devices, partner integrations over REST. Solace can genuinely serve as a single broker that speaks all of these protocols, replacing multiple dedicated systems with one platform.

Whether you want to concentrate that much messaging infrastructure into a single vendor is a reasonable question. But the capability is real.

Event Mesh and Dynamic Message Routing

The Event Mesh Concept

Solace's "event mesh" is the company's most ambitious architectural concept and its primary differentiation from other brokers. An event mesh is a network of interconnected PubSub+ brokers spanning multiple data centres, clouds, and edge locations. Messages published to any broker in the mesh are automatically routed to any other broker where there are matching subscriptions.

The idea is compelling: a global, multi-cloud messaging fabric where applications publish and subscribe to topics without caring about where other applications are running. A service in AWS publishing to orders/region/EMEA/new has its messages automatically delivered to a subscriber running on-premises in Frankfurt, without the publisher or subscriber knowing or caring about the routing path.

Dynamic Message Routing (DMR)

The mechanism that makes this work is Dynamic Message Routing. When a broker in the mesh receives a client subscription, it propagates that subscription to neighbouring brokers. When a message matching that subscription is published on any broker in the mesh, the message is forwarded hop-by-hop through the mesh to the broker where the subscriber is connected.

DMR is "dynamic" because:

Subscription propagation is automatic — no manual configuration of message bridges or forwarding rules
Routes are established and torn down as clients connect and disconnect
The mesh adapts to topology changes (broker additions, removals, failures)

This is a genuine differentiator. Building equivalent functionality with Kafka requires MirrorMaker 2 or Confluent Cluster Linking, both of which are topic-level replication tools rather than subscription-aware routing. With Solace, you do not replicate entire topics between clusters; you route individual messages based on actual consumer interest. This is more efficient for scenarios where only a subset of messages in a topic are needed at a remote site.

The event mesh concept is powerful but comes with caveats:

Latency: Inter-broker message forwarding adds latency proportional to the number of hops and the network distance between brokers. A message routed from Singapore to London through Frankfurt involves real physics.
Complexity: Debugging message routing in a multi-site mesh is harder than debugging a single broker. "Why did this message not arrive?" becomes "Which broker in the mesh was supposed to route it, and did the subscription propagate correctly?"
Vendor lock-in: An event mesh is inherently a Solace-to-Solace technology. You are building your global messaging architecture around a single vendor's product. If that vendor's pricing changes, or your technical needs diverge from their roadmap, extraction is painful.

Strengths

Genuine Multi-Protocol Support

Not a half-baked MQTT bolt-on or a JMS compatibility shim. Solace's protocol support is native, tested, and complete. If you have a heterogeneous environment — and most enterprises do — this eliminates an entire category of integration headaches.

The Event Mesh Vision

For organisations operating across multiple clouds and data centres, the event mesh concept is genuinely ahead of the competition. No other broker offers comparable built-in multi-site, subscription-aware message routing. Whether you need it is one question. Whether anyone else offers it as a first-class feature is not — they do not.

Enterprise Feature Set

Access control, Message VPNs, rate limiting, quota management, audit logging, LDAP integration, redundancy groups, monitoring APIs — Solace has the full complement of features that enterprise procurement and security teams require. These are features that you build yourself (poorly) on top of Kafka, or buy through Confluent's enterprise tier. Solace includes them in the platform.

Hardware Appliance Option

For ultra-low-latency use cases where software brokers — no matter how well optimised — introduce unacceptable jitter, the hardware appliance option is unique in the market. FPGA-based message routing with deterministic sub-microsecond latency is a capability you simply cannot get from software. If you need it, you need it, and Solace is the only mainstream vendor that offers it.

Consistent Low Latency

Even the software broker, written in C/C++ without JVM overhead, delivers consistent low-latency performance. The absence of garbage collection pauses means latency distributions are tighter than JVM-based brokers, which matters for P99 and P99.9 SLAs.

Weaknesses

Proprietary Core Protocol

SMF is Solace's native protocol and the one that delivers the best performance and most complete feature set. It is proprietary. Using SMF means using Solace's SDKs, which means coupling your application code to Solace's libraries. You can use AMQP or MQTT to reduce this coupling, but you lose features and potentially performance in the process.

This is the fundamental tension with Solace: the best experience is the most proprietary experience. The standards-based experience is available but second-tier.

Pricing Opacity

Solace does not publish pricing for its Enterprise or Mission Critical tiers. The cloud service has published pricing, but the self-managed software broker pricing requires "contact sales." In an industry that has moved toward transparent, publicly listed pricing (Confluent, AWS, most SaaS products), this is an irritant that suggests the price is either high, highly variable, or both.

If you are an enterprise with a procurement team, this is business as usual. If you are a startup trying to evaluate total cost of ownership, the lack of transparent pricing is a significant obstacle to even beginning the evaluation.

Smaller Community

Compare the size of the Kafka, RabbitMQ, or even NATS community — Stack Overflow questions, blog posts, conference talks, third-party tools, open-source integrations — with Solace's community, and the difference is stark. This matters practically: when you hit an obscure problem at 2 AM, the probability of finding a relevant Stack Overflow answer or blog post is much lower for Solace than for Kafka.

Solace has a community portal, documentation, and sample code. Their documentation is actually quite good. But the volume of community-generated knowledge is a fraction of what the open-source brokers enjoy.

Vendor Dependency

If you build your architecture around Solace's event mesh, Message VPNs, and SMF protocol, you are deeply coupled to a single vendor. More deeply than using Kafka (which has Redpanda, multiple managed offerings, and an open protocol) or RabbitMQ (which is open source). This is not necessarily fatal — plenty of enterprises run on proprietary infrastructure successfully — but it should be an eyes-open decision.

Scaling Model

The active-standby redundancy model does not scale horizontally for a single Message VPN's throughput the way Kafka's partition model does. You scale by distributing across an event mesh, which adds latency and complexity, or by vertical scaling. For workloads that require massive throughput at a single site, Kafka's horizontal partitioning model is more natural.

Ideal Use Cases

Large Enterprises with Multi-Protocol Needs

If your environment includes legacy JMS applications, modern microservices, IoT devices, and partner integrations over REST, Solace can genuinely serve as a single messaging platform. Replacing four different brokers with one — even a proprietary one — reduces operational burden, simplifies monitoring, and eliminates inter-broker bridges.

Hybrid Cloud and Multi-Cloud

The event mesh concept is purpose-built for organisations that operate across multiple clouds and on-premises data centres. If your architecture spans AWS, Azure, and an on-premises data centre, and you need messages to flow transparently between them, Solace offers this as a core capability rather than a bolted-on afterthought.

Financial Services

Solace's origins in financial services are evident in its feature set: low-latency messaging, guaranteed delivery, hardware appliance option for ultra-low-latency paths, Message VPN isolation for regulatory compartmentalisation. Banks and trading firms are a natural fit and represent a significant portion of Solace's customer base.

Event-Driven Architecture Consolidation

Enterprises that have accumulated multiple messaging systems over the years — IBM MQ here, ActiveMQ there, a Kafka cluster in the corner, an MQTT broker for IoT — and want to consolidate onto a single platform. Solace's protocol support makes gradual migration feasible: you can move applications one at a time without rewriting all of them simultaneously.

Operational Reality

PubSub+ Manager

PubSub+ Manager is the web-based management interface. It provides configuration management for Message VPNs, queues, topics, and client connections. The interface is enterprise-grade — comprehensive, if not beautiful. Compared to Kafka's "management is a CLI tool and some third-party web UIs" situation, PubSub+ Manager is a significant step up for day-to-day operations.

You can also manage everything through a RESTful management API (SEMP — Solace Element Management Protocol), which is well-documented and scriptable. Infrastructure-as-code through Terraform is supported via a Solace-maintained Terraform provider.

Monitoring

The broker exposes metrics through SEMP, and Solace provides integration with Prometheus, Grafana, Datadog, and Splunk. The metrics are comprehensive: message rates, queue depths, client connections, replication lag, and resource utilisation.

One area where Solace genuinely excels is in client-level visibility. You can see individual client connections, their subscription sets, message rates, and connection metadata. When debugging "why is consumer X not receiving messages?", this level of visibility is invaluable.

Cloud vs Self-Managed

PubSub+ Cloud is the managed service option and removes the operational burden of running brokers. It is available on AWS, Azure, and GCP. The service handles broker provisioning, redundancy, software updates, and monitoring. For teams that do not want to operate messaging infrastructure, this is the obvious choice.

Self-managed deployments give you more control over configuration, networking, and placement but require your team to handle broker lifecycle management, upgrades, and monitoring infrastructure. Solace provides Docker images, Kubernetes Helm charts, and VMware templates for deployment.

The Free Tier

PubSub+ Cloud's free tier and the free software broker edition deserve mention. The cloud free tier provides a single broker with limited capacity — sufficient for development, prototyping, and learning. The software broker's Standard edition is free for up to a generous connection and queue limit.

This lowers the barrier to evaluation significantly. You can deploy a PubSub+ broker in Docker, connect to it with multiple protocols, and evaluate the feature set without talking to a sales representative. The experience is smooth, the documentation is clear, and you can form a reasonable opinion of the platform without spending money. That said, production deployments will almost certainly require a paid tier, and that is where the "contact sales" pricing conversation begins.

Code Examples

Java (JCSMP — Solace's Native API)

import com.solacesystems.jcsmp.*;

public class SolaceProducerExample {

    public static void main(String[] args) throws JCSMPException {
        // Create session properties
        JCSMPProperties properties = new JCSMPProperties();
        properties.setProperty(JCSMPProperties.HOST, "tcp://localhost:55555");
        properties.setProperty(JCSMPProperties.VPN_NAME, "default");
        properties.setProperty(JCSMPProperties.USERNAME, "admin");
        properties.setProperty(JCSMPProperties.PASSWORD, "admin");

        // Create session
        JCSMPSession session = JCSMPFactory.onlyInstance()
            .createSession(properties);
        session.connect();

        // Create a producer (XMLMessageProducer)
        XMLMessageProducer producer = session.getMessageProducer(
            new JCSMPStreamingPublishCorrelatingEventHandler() {
                @Override
                public void responseReceivedEx(Object correlationKey) {
                    System.out.println("Message acknowledged: "
                        + correlationKey);
                }

                @Override
                public void handleErrorEx(Object correlationKey,
                        JCSMPException cause, long timestamp) {
                    System.err.println("Message failed: " + cause);
                }
            });

        // Create and send a persistent message
        Topic topic = JCSMPFactory.onlyInstance()
            .createTopic("orders/region/EMEA/currency/EUR");

        TextMessage message = JCSMPFactory.onlyInstance()
            .createMessage(TextMessage.class);
        message.setText("{\"type\":\"OrderPlaced\","
            + "\"orderId\":\"ord-7829\","
            + "\"amount\":149.99,"
            + "\"currency\":\"EUR\"}");
        message.setDeliveryMode(DeliveryMode.PERSISTENT);
        message.setCorrelationKey("ord-7829");

        producer.send(message, topic);
        System.out.println("Message sent to " + topic.getName());

        // Cleanup
        session.closeSession();
    }
}

import com.solacesystems.jcsmp.*;

public class SolaceConsumerExample {

    public static void main(String[] args) throws JCSMPException {
        JCSMPProperties properties = new JCSMPProperties();
        properties.setProperty(JCSMPProperties.HOST, "tcp://localhost:55555");
        properties.setProperty(JCSMPProperties.VPN_NAME, "default");
        properties.setProperty(JCSMPProperties.USERNAME, "admin");
        properties.setProperty(JCSMPProperties.PASSWORD, "admin");

        JCSMPSession session = JCSMPFactory.onlyInstance()
            .createSession(properties);
        session.connect();

        // Bind to a queue for guaranteed messaging
        Queue queue = JCSMPFactory.onlyInstance()
            .createQueue("orders-queue");

        ConsumerFlowProperties flowProperties =
            new ConsumerFlowProperties();
        flowProperties.setEndpoint(queue);
        flowProperties.setAckMode(JCSMPProperties.SUPPORTED_MESSAGE_ACK_CLIENT);

        FlowReceiver consumer = session.createFlow(
            new XMLMessageListener() {
                @Override
                public void onReceive(BytesXMLMessage message) {
                    if (message instanceof TextMessage) {
                        String text = ((TextMessage) message).getText();
                        System.out.println("Received: " + text);
                    }
                    // Acknowledge the message
                    message.ackMessage();
                }

                @Override
                public void onException(JCSMPException exception) {
                    System.err.println("Consumer error: " + exception);
                }
            },
            flowProperties
        );

        consumer.start();
        System.out.println("Consumer listening on " + queue.getName());

        // Keep running
        try {
            Thread.sleep(Long.MAX_VALUE);
        } catch (InterruptedException e) {
            consumer.close();
            session.closeSession();
        }
    }
}

The JCSMP API is verbose in a way that will feel familiar to anyone who has worked with JMS or other enterprise Java messaging APIs. It is not pretty, but it is explicit about what is happening. Every connection property, delivery mode, and acknowledgement behaviour is visible in the code, which is a feature when you are debugging production issues at 3 AM.

Python (solace-pubsubplus)

import solace.messaging
from solace.messaging.messaging_service import MessagingService
from solace.messaging.config.transport_security_strategy import TLS
from solace.messaging.resources.topic import Topic
from solace.messaging.publisher.persistent_message_publisher import (
    PersistentMessagePublisher
)

# Configure the messaging service
broker_props = {
    "solace.messaging.transport.host": "tcp://localhost:55555",
    "solace.messaging.service.vpn-name": "default",
    "solace.messaging.authentication.scheme.basic.username": "admin",
    "solace.messaging.authentication.scheme.basic.password": "admin",
}

messaging_service = MessagingService.builder() \
    .from_properties(broker_props) \
    .build()

messaging_service.connect()

# Create a persistent publisher
publisher = messaging_service \
    .create_persistent_message_publisher_builder() \
    .build()
publisher.start()

# Publish a message
topic = Topic.of("orders/region/EMEA/currency/EUR")
message_body = '{"type":"OrderPlaced","orderId":"ord-7829"}'

outbound_message = messaging_service.message_builder() \
    .with_application_message_id("ord-7829") \
    .build(message_body)

publisher.publish(outbound_message, topic)
print(f"Message published to {topic}")

# Cleanup
publisher.terminate()
messaging_service.disconnect()

from solace.messaging.messaging_service import MessagingService
from solace.messaging.resources.queue import Queue
from solace.messaging.receiver.persistent_message_receiver import (
    PersistentMessageReceiver
)

broker_props = {
    "solace.messaging.transport.host": "tcp://localhost:55555",
    "solace.messaging.service.vpn-name": "default",
    "solace.messaging.authentication.scheme.basic.username": "admin",
    "solace.messaging.authentication.scheme.basic.password": "admin",
}

messaging_service = MessagingService.builder() \
    .from_properties(broker_props) \
    .build()

messaging_service.connect()

# Create a persistent receiver bound to a queue
queue = Queue.durable_exclusive_queue("orders-queue")

receiver = messaging_service \
    .create_persistent_message_receiver_builder() \
    .build(queue)
receiver.start()

print("Consumer listening...")

# Blocking receive loop
while True:
    message = receiver.receive_message(timeout=5000)
    if message is not None:
        payload = message.get_payload_as_string()
        print(f"Received: {payload}")
        receiver.ack(message)

The Python SDK is more modern than the Java JCSMP API — builder pattern, cleaner naming — but still carries the weight of enterprise messaging abstractions. It is not as terse as a NATS or Redis client, but it exposes the full capabilities of the broker.

REST Producer

# Publish a message via REST
# This works with any HTTP client — no SDK required

curl -X POST \
  "http://localhost:9000/orders/region/EMEA/currency/EUR" \
  -H "Content-Type: application/json" \
  -H "Solace-Delivery-Mode: persistent" \
  -d '{"type":"OrderPlaced","orderId":"ord-7829","amount":149.99}'

The REST interface is Solace's secret weapon for quick integrations. Any system that can make an HTTP POST can publish messages to Solace. No SDK, no library dependency, no protocol-specific knowledge required. The URL path maps to the topic hierarchy. For webhook-style integrations and systems written in languages without a Solace SDK, this is invaluable.

The trade-off is performance and feature completeness. REST is HTTP, which means TCP connection overhead, HTTP header overhead, and no persistent session state. For high-throughput producers, the native SDK over SMF is orders of magnitude more efficient. For low-volume integrations and quick scripts, REST is perfect.

Verdict

Solace PubSub+ is a genuinely capable messaging platform that suffers primarily from being in the wrong narrative at the wrong time. The industry's attention has been captured by open-source, developer-community-driven projects — Kafka, NATS, RabbitMQ — and Solace's enterprise-first, sales-driven go-to-market does not generate blog posts, conference talks, or Twitter enthusiasm. This is a marketing problem, not a technology problem.

The technology is solid. Multi-protocol support is best-in-class. The event mesh concept is architecturally sound and genuinely ahead of the competition. The hardware appliance option is unique. The operational tooling is mature. For the right use case — large enterprise, multi-protocol environment, hybrid cloud, financial services — Solace is a strong choice that solves real problems that other brokers either cannot solve or solve only with significant additional effort.

The concerns are equally real. Vendor lock-in with SMF is meaningful. Pricing opacity is frustrating. The smaller community means less collective knowledge and fewer third-party integrations. The active-standby scaling model is less flexible than Kafka's horizontal partitioning. And building your global messaging architecture around a single commercial vendor requires a level of trust in that vendor's long-term viability and pricing stability.

The practical recommendation:

If you are a large enterprise with existing multi-protocol messaging infrastructure and you need to consolidate or extend it across clouds, Solace deserves a serious evaluation. It may be the only platform that can genuinely replace multiple brokers with one.
If you are building a new system on a single cloud provider, your cloud provider's native messaging services (SQS/SNS, Google Pub/Sub, Azure Event Hubs) are probably simpler and cheaper. Solace adds value when you span multiple environments.
If you are a startup or small team, Solace is likely overkill. The free tier is great for learning, but the enterprise feature set is designed for enterprise problems. Use NATS or RabbitMQ and revisit when your problems are enterprise-scale.
If you need the hardware appliance for ultra-low-latency, there is no alternative in the managed broker space. Solace or a custom-built solution are your options.

Solace PubSub+ is the answer to "I need one messaging platform that speaks every protocol, runs everywhere, and has enterprise governance built in." If that is your question, Solace is likely the best answer available. If that is not your question, there are simpler, cheaper, more community-supported options. Know your question before evaluating the answer.

Chronicle Queue

We have now left the territory of general-purpose message brokers and entered the domain of people who measure latency in microseconds and consider garbage collection a personal affront. Chronicle Queue is not a message broker in any conventional sense. It is a Java library for inter-process and inter-thread communication that happens to be extraordinarily fast, and it exists because one man — Peter Lawrey — decided that the JVM's standard approach to memory management was an obstacle rather than a feature.

If you are building a web application that processes a few thousand events per second, Chronicle Queue is not for you. Close this chapter and go back to Kafka or RabbitMQ. If you are building a trading system where the difference between 10 microseconds and 100 microseconds is the difference between profit and loss, keep reading. This is your chapter.

Overview

What It Is

Chronicle Queue is a persisted, low-latency messaging library for Java. It provides an append-only journal (queue) that one or more writer threads can write to and one or more reader threads can read from, with typical latencies measured in single-digit microseconds. Messages are persisted to memory-mapped files and survive process restarts. Multiple processes on the same machine can share a queue through the filesystem.

It is not a network service. There is no broker process, no port to connect to, no cluster to configure. Chronicle Queue is a library that you embed in your Java application. Communication happens through shared access to files on disk. If this sounds primitive, you are not wrong — it is closer to how Unix pipes work than how Kafka works — and that simplicity is precisely why it is so fast.

Brief History

Chronicle Queue was created by Peter Lawrey, founder of Chronicle Software (originally Higher Frequency Trading Ltd, a name that rather gives away the target market). Lawrey is a figure well known in the Java performance community — the sort of person who has opinions about CPU cache line sizes and knows what sun.misc.Unsafe does without consulting the documentation.

The project's origin story is straightforward: financial trading firms needed to pass messages between components of a trading system with minimal, predictable latency. The JVM's standard toolbox — BlockingQueue, ConcurrentLinkedQueue, NIO channels — was insufficient because:

Garbage collection pauses. Any solution that allocates objects in the Java heap is subject to GC pauses, which are unpredictable and can range from milliseconds to seconds. In a trading system processing market data, a 50-millisecond GC pause means 50 milliseconds of missed price updates. That is an eternity.
Serialization overhead. Converting Java objects to byte arrays and back is expensive. Standard serialization frameworks (Java serialization, Kryo, Protobuf) add latency.
No persistence. In-memory queues are fast but lose data on process restart. Logging to disk typically involves system calls, buffer management, and blocking I/O.

Chronicle Queue addresses all three of these problems through a single mechanism: memory-mapped files accessed off-heap.

The open-source version (Chronicle Queue Community) is available under the Apache License 2.0 with some limitations. Chronicle Queue Enterprise, which adds replication across machines and additional features, is commercially licensed. Chronicle Software operates as a consulting and licensing business, selling to financial institutions and other latency-sensitive organisations.

Architecture

Memory-Mapped Files

The central idea behind Chronicle Queue is embarrassingly simple: use the operating system's virtual memory system as your message store.

When you create a Chronicle Queue, it creates files on disk (one per "cycle" — typically one file per day, configurable). These files are memory-mapped into the process's address space using MappedByteBuffer (or, more precisely, Chronicle's own memory-mapping implementation that bypasses some of the JDK's limitations). Once mapped, reading and writing to the queue is a memory operation — you write bytes to a memory address, and the operating system handles flushing those bytes to disk asynchronously.

This means:

No explicit I/O calls. Writing a message is a memory copy, not a write() system call. The OS page cache handles persistence.
No serialization to intermediate buffers. You write directly to the memory-mapped region.
No garbage collection impact. The data lives outside the Java heap, in off-heap memory managed by the OS. The GC does not know about it and does not need to scan it.
Persistence is free. The memory-mapped files are files — they survive process restarts. When you restart your application, you can resume reading from where you left off.

The append-only structure means concurrent writers use a simple sequencing mechanism (a CAS operation on the write position) rather than locks. Readers maintain their own read positions independently and never block writers.

File Structure and Rolling

Chronicle Queue organises data into rolling files. By default, a new file is created for each day (the "daily" roll cycle). Each file contains a sequence of messages (called "excerpts" in Chronicle terminology), each preceded by a small header containing the message length and metadata.

chronicle-queue/
  20260320.cq4       # Today's file
  20260319.cq4       # Yesterday's file
  20260318.cq4       # Day before
  metadata.cq4t      # Index and metadata

The .cq4 files are the actual data files. The .cq4t file contains indexing information that allows efficient seeking to specific positions. Old files can be deleted to reclaim disk space — Chronicle Queue supports configurable retention.

Each excerpt in the file has a 4-byte header followed by the message data:

[4-byte header: length + metadata flags]
[message bytes]
[4-byte header: length + metadata flags]
[message bytes]
...

The header uses specific bit patterns to signal different states: a complete message, a message being written (not yet committed), padding, and end-of-file markers. Readers spin on the header word, waiting for it to transition from "being written" to "complete" — a form of busy-waiting that avoids the overhead of thread parking and notification.

Lock-Free Design

Chronicle Queue uses no locks for its primary read and write paths. Writers coordinate through compare-and-swap operations on the write position. Readers are completely independent — they simply read from their current position and advance forward.

This lock-free design means:

No contention between readers and writers. A slow reader does not block writers.
No contention between multiple readers. Each reader has its own position.
Minimal contention between multiple writers. The CAS on the write position serialises writes, but each write is a fast memory copy, so contention is brief.

The practical result is that Chronicle Queue scales well with the number of readers and tolerates multiple writers without significant degradation, as long as the machine's memory bandwidth is not saturated.

Chronicle Wire — The Serialisation Format

Chronicle Wire is the serialisation framework that Chronicle Queue uses by default. It is worth discussing because it is tightly integrated with the queue and contributes significantly to performance.

Wire supports multiple encoding formats:

Binary Wire: A compact binary format optimised for speed. Field names are encoded as numeric codes (determined at compile time or first use), and values are written in their native binary representation. Integers are written as 4 or 8 bytes, not as decimal strings.
Text Wire: A human-readable YAML-like format, useful for debugging and testing.
Raw Wire: No framing at all — just raw bytes. Maximum performance, minimum convenience.
JSON Wire: For interoperability with non-Java systems, though at this point you might question why you are using Chronicle Queue.

The key performance feature of Wire is that it can serialise and deserialise Java objects directly to and from off-heap memory without creating intermediate byte arrays. A Marshallable object writes its fields directly to the memory-mapped region. On the read side, fields are read directly from the mapped memory into local variables. No temporary objects are created, no byte arrays are allocated, and the garbage collector remains blissfully unaware that anything happened.

This is what "zero-copy" means in the Chronicle context: the data path from the writer's Java fields to the persistent store on disk involves no copying into intermediate buffers. The fields are written to the memory-mapped address, the OS eventually flushes the page to disk, and the reader reads from the same (or a different mapping of the same) memory-mapped address.

How It Achieves Microsecond Latency

It is worth enumerating specifically why Chronicle Queue is fast, because each design choice contributes to the overall latency profile:

Off-heap storage. Data never touches the Java heap. The GC never scans it, never moves it, never pauses to collect it. This eliminates the single largest source of latency variability in Java applications.
Memory-mapped I/O. Writing a message is a memory copy to a mapped region, not a system call. The OS handles persistence asynchronously. There is no fsync() on the critical path (by default — you can enable it at the cost of latency).
No serialisation overhead. Chronicle Wire writes fields directly to memory. There is no intermediate byte[] allocation, no serialisation framework overhead, no object allocation.
Lock-free algorithms. No mutex acquisition, no thread parking, no context switches in the fast path. Writers CAS on the write position. Readers busy-wait on the header word.
Sequential access patterns. The append-only structure means all writes are sequential, which is optimal for both memory and disk hardware. There is no random I/O, no seeking, no B-tree traversal.
CPU cache friendliness. Sequential writes and reads keep data in L1/L2 cache. The small header-plus-data format means multiple messages fit in a single cache line.

The combination of these factors produces typical write latencies of 1-2 microseconds for small messages (under a few hundred bytes) and read latencies that are similarly low. The 99th percentile latency is typically within 2-3x of the median, which is remarkable for a JVM-based system. For comparison, Kafka's producer latency is measured in milliseconds, and even Kafka's acks=0 fire-and-forget mode is orders of magnitude slower.

Garbage-Free Operation

"Garbage-free" is not marketing hyperbole — it is a specific, measurable claim. In a properly written Chronicle Queue application, the steady-state allocation rate in the Java heap is zero (or near zero). No objects are created during message writing or reading. This means:

The GC has nothing to collect
There are no young generation collections interrupting your application
There are certainly no full GC pauses
Latency is determined by your code and the hardware, not by the runtime

Achieving true garbage-free operation requires discipline on the application side. If your message handler allocates a HashMap for every message, you have re-introduced the problem that Chronicle Queue was designed to avoid. Chronicle provides tooling (-verbose:gc analysis, allocation profiling) to help identify and eliminate allocations.

Replication: Chronicle Queue Enterprise

The open-source Chronicle Queue is a single-machine library. If you need data to be replicated to another machine — for disaster recovery, failover, or cross-site distribution — you need Chronicle Queue Enterprise.

Enterprise replication works by:

A primary writer appends to a Chronicle Queue on the local filesystem
A replication agent reads new excerpts and sends them over the network to one or more replicas
Replicas append the received excerpts to their local Chronicle Queue files
Consumers on the replica machine read from their local queue

The replication is asynchronous by default, which means there is a small window of data loss if the primary fails before replicated data is acknowledged. Synchronous replication is available but adds network round-trip latency to the write path — defeating the purpose of using Chronicle Queue for many use cases.

Enterprise also adds:

Encryption at rest and in transit
Access control for queue operations
Monitoring and metrics via JMX and Prometheus
Time-based and size-based retention management
Delta compression for replication traffic

The licensing cost is not public and is negotiated per customer. For the target market (financial institutions with more money than patience for open-source support), this is expected. For everyone else, it is a barrier.

Strengths

Sub-Microsecond Latency

For small messages on modern hardware, Chronicle Queue delivers write latencies under 1 microsecond. This is not a marketing benchmark; it is a measurable property of the library under normal operation. The combination of memory-mapped I/O, off-heap storage, and lock-free algorithms produces latency numbers that are simply unachievable with any network-based message broker.

Garbage-Free Operation

In a world where JVM garbage collection is the bane of low-latency Java applications, Chronicle Queue's ability to operate without generating garbage is a fundamental advantage. Deterministic latency on the JVM is possible — Chronicle Queue proves it — but it requires staying off-heap.

Deterministic Performance

The gap between median and tail latency is small. P99 is typically within 2-3x of median. For systems where tail latency matters — and in finance, it always matters — this predictability is as valuable as the raw speed.

Java Native

If your team writes Java (or Kotlin, or any JVM language), Chronicle Queue integrates naturally. It is a library, not a service. No operational overhead, no cluster management, no network hops. Add a Maven dependency, create a queue, start writing. The learning curve is primarily about understanding the off-heap and garbage-free programming discipline.

Persistence Without Performance Penalty

Unlike in-memory queues that lose data on restart, Chronicle Queue persists everything to disk via memory mapping — but the write path is a memory operation, not a disk operation. You get persistence without paying for it on the write path. The OS handles flushing to disk asynchronously, and the memory-mapped files survive process crashes (data that has been written to the mapped region is safe even if the process is killed, because it is in the OS page cache).

Weaknesses

Single-Machine Focus

The open-source version is a single-machine library. There is no built-in networking, no clustering, no distributed anything. If you need to pass messages between machines, you either use Chronicle Queue Enterprise (commercial), build your own network layer on top, or use a different tool. For many modern architectures — microservices running on multiple nodes, cloud-native applications, Kubernetes deployments — this is a fundamental limitation.

Java Only

Chronicle Queue is a Java library. It is deeply, inextricably Java. The off-heap memory management, the Wire serialisation format, the Marshallable interface — these are JVM constructs. If your system includes Python services, Go services, or anything not running on a JVM, Chronicle Queue cannot help you with inter-service communication.

There are some projects that provide non-JVM readers for Chronicle Queue's file format, but they are not first-class, not fully featured, and not what you want to bet a production system on.

Commercial License for Essential Features

Replication, encryption, access control — features that most production systems need — are behind the Enterprise commercial license. The open-source version is genuinely useful for single-machine inter-thread and inter-process communication, but the moment you need data on more than one machine, you are paying. This is a perfectly reasonable business model, but it limits the addressable use cases for the free version.

Not a Distributed System

Chronicle Queue does not do leader election, does not do consensus, does not do distributed transactions, does not do partition tolerance. It is a very fast file on a very specific machine. If that machine fails, your queue is unavailable (unless you have Enterprise replication configured). There is no automatic failover, no partition reassignment, no self-healing cluster.

This is not a bug — it is a deliberate design choice. Distributed consensus adds latency, and Chronicle Queue's entire reason for existence is minimising latency. But it means you are responsible for building reliability around it: replication (Enterprise), monitoring, failover procedures, and the knowledge that a single disk failure can take your queue offline.

Learning Curve for Garbage-Free Programming

Writing garbage-free Java is a skill that most Java developers have never needed to learn. The natural Java idiom — create objects, let the GC clean them up — is exactly what you cannot do in a Chronicle Queue application. This means:

Object pools instead of new
Primitive fields instead of boxed types
Pre-allocated buffers instead of dynamic allocation
Avoiding standard library classes that allocate internally (String concatenation, HashMap, ArrayList)

The tooling is available (allocation profiling, -verbose:gc monitoring), but the programming discipline is significant. Teams adopting Chronicle Queue for the first time should budget for a learning curve, especially if they are not already experienced in low-latency Java programming.

Ideal Use Cases

High-Frequency Trading

This is Chronicle Queue's home turf. Trading systems that need to process market data, calculate signals, and generate orders with deterministic sub-millisecond latency. The pattern is typically: market data comes in from the exchange (via a network interface), is written to a Chronicle Queue, picked up by a strategy component, which writes orders to another Chronicle Queue, which is read by an order gateway and sent to the exchange. Each queue hop adds single-digit microseconds.

Low-Latency Financial Systems

Beyond trading, any financial system where latency matters: risk calculation engines, position management systems, real-time pricing engines, order management systems. These systems often have the same requirements — fast, deterministic, persistent — and the same JVM ecosystem.

Inter-Thread and Inter-Process Communication

On a single machine, Chronicle Queue is an excellent inter-process communication mechanism. If you have multiple JVM processes that need to share a message stream — a producer process and multiple consumer processes, for example — Chronicle Queue provides this through the filesystem with better performance and persistence guarantees than named pipes, Unix sockets, or network loopback.

Audit and Journalling

The append-only, persistent nature of Chronicle Queue makes it a natural fit for audit logging and event journalling. Write every state change, every decision, every external interaction to a Chronicle Queue. The performance overhead is negligible (microseconds per write), the data is durable, and you have a complete, ordered, replayable record of everything that happened.

When NOT to Use It

Distributed systems. If you need messages to flow between machines as a core requirement, not an afterthought, use a distributed message broker.
Multi-language environments. If your services are written in Python, Go, and Java, Chronicle Queue only serves the Java components.
Cloud-native architectures. Kubernetes pods with ephemeral storage are not a natural fit for memory-mapped files with specific filesystem requirements.
General-purpose messaging. If your latency requirements are measured in milliseconds rather than microseconds, Kafka, NATS, or RabbitMQ are simpler, more flexible, and more broadly supported.

Comparison with Aeron

Chronicle Queue and Aeron (covered in the next chapter) are frequently mentioned together and occasionally confused. They occupy adjacent but distinct niches:

Aspect	Chronicle Queue	Aeron
Primary use	Persistent journalling, inter-process messaging	Network messaging, IPC
Transport	Filesystem (memory-mapped files)	UDP, IPC (shared memory)
Persistence	Built-in (the queue is a file)	Optional (Aeron Archive)
Network support	Enterprise only	Built-in (UDP unicast/multicast)
Latency focus	Microsecond writes, GC-free	Nanosecond IPC, microsecond network
Replication	Enterprise feature	Aeron Cluster (Raft-based)
Philosophy	Persistent log as the primitive	Transport as the primitive

The choice between them often comes down to the primary access pattern. If your dominant need is "write an ordered, persistent log that multiple local processes can read," Chronicle Queue is the natural fit. If your dominant need is "send messages between processes with minimal latency, potentially over a network," Aeron is the natural fit.

Many low-latency systems use both: Aeron for network transport between machines, and Chronicle Queue for persistent journalling and local inter-process communication. They are complementary, not competitive.

Code Examples

Basic Producer and Consumer

import net.openhft.chronicle.queue.ChronicleQueue;
import net.openhft.chronicle.queue.ExcerptAppender;
import net.openhft.chronicle.queue.ExcerptTailer;

public class BasicExample {

    public static void main(String[] args) {
        String queuePath = "/tmp/chronicle-example";

        // Producer
        try (ChronicleQueue queue = ChronicleQueue.single(queuePath)) {
            ExcerptAppender appender = queue.acquireAppender();

            // Write a simple text message
            appender.writeText("Hello, Chronicle Queue");

            // Write structured data using a lambda
            appender.writeDocument(wire -> {
                wire.write("type").text("OrderPlaced");
                wire.write("orderId").text("ord-7829");
                wire.write("amount").float64(149.99);
                wire.write("currency").text("EUR");
                wire.write("timestamp").int64(System.nanoTime());
            });

            System.out.println("Messages written");
        }

        // Consumer
        try (ChronicleQueue queue = ChronicleQueue.single(queuePath)) {
            ExcerptTailer tailer = queue.createTailer();

            // Read the text message
            String text = tailer.readText();
            System.out.println("Read: " + text);

            // Read structured data
            tailer.readDocument(wire -> {
                String type = wire.read("type").text();
                String orderId = wire.read("orderId").text();
                double amount = wire.read("amount").float64();
                String currency = wire.read("currency").text();
                long timestamp = wire.read("timestamp").int64();

                System.out.printf("Order: %s %s %.2f %s at %d%n",
                    type, orderId, amount, currency, timestamp);
            });
        }
    }
}

Using Marshallable Objects (Garbage-Free)

import net.openhft.chronicle.queue.ChronicleQueue;
import net.openhft.chronicle.queue.ExcerptAppender;
import net.openhft.chronicle.queue.ExcerptTailer;
import net.openhft.chronicle.wire.SelfDescribingMarshallable;

public class MarshalExample {

    // Define a message type — no garbage on write or read
    public static class OrderEvent extends SelfDescribingMarshallable {
        private String type;
        private String orderId;
        private double amount;
        private String currency;
        private long timestampNanos;

        // Setters return 'this' for fluent usage
        public OrderEvent type(String type) {
            this.type = type;
            return this;
        }

        public OrderEvent orderId(String orderId) {
            this.orderId = orderId;
            return this;
        }

        public OrderEvent amount(double amount) {
            this.amount = amount;
            return this;
        }

        public OrderEvent currency(String currency) {
            this.currency = currency;
            return this;
        }

        public OrderEvent timestampNanos(long ts) {
            this.timestampNanos = ts;
            return this;
        }

        // Getters
        public String type() { return type; }
        public String orderId() { return orderId; }
        public double amount() { return amount; }
        public String currency() { return currency; }
        public long timestampNanos() { return timestampNanos; }
    }

    public static void main(String[] args) {
        String queuePath = "/tmp/chronicle-marshal-example";

        // Reusable event object — allocated once, reused forever
        OrderEvent event = new OrderEvent();

        try (ChronicleQueue queue = ChronicleQueue.single(queuePath)) {
            ExcerptAppender appender = queue.acquireAppender();

            // Write 1,000,000 messages with zero garbage
            for (int i = 0; i < 1_000_000; i++) {
                event.type("OrderPlaced")
                     .orderId("ord-" + i)   // Note: String concat DOES allocate.
                     .amount(149.99 + i)     // In a truly GC-free system,
                     .currency("EUR")        // you would use a pre-allocated
                     .timestampNanos(System.nanoTime()); // StringBuilder.

                appender.writeDocument(event);
            }

            System.out.println("Wrote 1,000,000 events");
        }

        // Read them back
        OrderEvent readEvent = new OrderEvent(); // Reusable read object

        try (ChronicleQueue queue = ChronicleQueue.single(queuePath)) {
            ExcerptTailer tailer = queue.createTailer();

            int count = 0;
            while (tailer.readDocument(readEvent)) {
                count++;
                if (count % 250_000 == 0) {
                    System.out.printf("Read %d events, latest: %s %s%.2f%n",
                        count, readEvent.type(),
                        readEvent.currency(), readEvent.amount());
                }
            }
            System.out.printf("Total events read: %d%n", count);
        }
    }
}

Note the pattern: create a reusable object, populate it for each write, and reuse a read object for each read. This is the garbage-free discipline in action. The SelfDescribingMarshallable base class provides efficient serialisation to and from Chronicle Wire format, writing fields directly to the memory-mapped region.

The comment about String concatenation is deliberate. True garbage-free programming in Java requires vigilance about every allocation, including implicit ones from string operations, autoboxing, and iterator creation. Most applications will accept a small amount of allocation in non-critical paths and focus garbage-free discipline on the hot path.

Chronicle Wire Standalone

import net.openhft.chronicle.bytes.Bytes;
import net.openhft.chronicle.wire.Wire;
import net.openhft.chronicle.wire.WireType;

public class WireExample {

    public static void main(String[] args) {
        // Allocate a reusable buffer — off-heap
        Bytes<?> bytes = Bytes.elasticByteBuffer();

        // Write using binary wire (fast, compact)
        Wire wire = WireType.BINARY.apply(bytes);
        wire.write("eventType").text("PriceUpdate");
        wire.write("symbol").text("AAPL");
        wire.write("bid").float64(178.52);
        wire.write("ask").float64(178.55);
        wire.write("timestamp").int64(System.nanoTime());

        System.out.println("Serialized size: " + bytes.readRemaining()
            + " bytes");

        // Read it back
        Wire readWire = WireType.BINARY.apply(bytes);
        String eventType = readWire.read("eventType").text();
        String symbol = readWire.read("symbol").text();
        double bid = readWire.read("bid").float64();
        double ask = readWire.read("ask").float64();
        long ts = readWire.read("timestamp").int64();

        System.out.printf("%s: %s bid=%.2f ask=%.2f%n",
            eventType, symbol, bid, ask);

        // Convert to text wire for debugging
        Bytes<?> textBytes = Bytes.elasticByteBuffer();
        Wire textWire = WireType.TEXT.apply(textBytes);
        textWire.write("eventType").text("PriceUpdate");
        textWire.write("symbol").text("AAPL");
        textWire.write("bid").float64(178.52);
        textWire.write("ask").float64(178.55);

        System.out.println("Text format:\n" + textBytes);

        // Cleanup
        bytes.releaseLast();
        textBytes.releaseLast();
    }
}

Chronicle Wire deserves attention because it is the mechanism by which Chronicle Queue avoids the serialisation tax that most messaging systems pay. The binary format is compact (no field name strings in the output, just numeric codes), the encoding is direct (native byte order, no endian conversion for same-architecture communication), and the allocation is zero (off-heap buffers, reused).

Verdict

Chronicle Queue is a precision instrument. It does one thing — low-latency, persistent, local messaging on the JVM — and does it better than anything else available. The sub-microsecond write latency, garbage-free operation, and deterministic performance profile are not theoretical claims but empirically verified properties of the library in real production systems.

The precision comes with constraints. It is Java-only. It is single-machine by default. It is not a distributed system. The enterprise features that most production deployments need — replication, encryption, access control — require a commercial license. The programming discipline required for truly garbage-free operation is non-trivial and unfamiliar to most Java developers.

These constraints are not accidental — they are the direct consequence of the design decisions that make Chronicle Queue fast. Distributed consensus adds latency. Multi-language support adds abstraction layers. GC-friendly programming means objects on the heap. Every feature that Chronicle Queue lacks is a feature whose absence contributes to its performance.

The practical recommendation:

If you are building a low-latency Java system on a single machine — trading system components, pricing engines, risk calculators, event sourcing journals — Chronicle Queue is likely the best tool available. Nothing else in the JVM ecosystem matches its combination of speed, persistence, and determinism.
If you need inter-machine communication, evaluate Chronicle Queue Enterprise for replication and consider pairing it with Aeron (next chapter) for network transport. The combination of Chronicle Queue for local persistence and journalling with Aeron for network messaging is common in financial systems.
If your latency requirements are measured in milliseconds, Chronicle Queue is overkill. Use Kafka, NATS, or RabbitMQ. The operational simplicity and ecosystem breadth of those systems outweigh Chronicle Queue's latency advantage when microseconds do not matter.
If you are not writing Java, this is not your tool. There is no Python SDK, no Go client, no Node.js binding worth trusting in production.

Chronicle Queue exists because the JVM, despite its many strengths, has a fundamental tension between automatic memory management and deterministic latency. Peter Lawrey's answer was to sidestep the GC entirely, move data off-heap, and treat the filesystem as a communication channel. It is an unorthodox approach that works remarkably well within its constraints. Respect the constraints, and it will reward you with performance that makes other Java developers suspect you are lying about your latency numbers.

Aeron

If Chronicle Queue represents the philosophy that the filesystem is the fastest communication medium on a single machine, Aeron represents a more radical position: that the entire traditional networking stack — TCP, kernel buffers, system calls, context switches — is an unacceptable overhead for systems that need to move data between processes at the speed of hardware. Aeron is what happens when someone with deep knowledge of CPU architecture, operating systems, and network hardware decides that the standard abstractions are the problem.

That someone is Martin Thompson, and his philosophy has a name: mechanical sympathy. The idea, borrowed from racing driver Jackie Stewart's observation that you do not need to be an engineer to drive a car fast but you do need sympathy for the machine, is that the best software performance comes from understanding and respecting the hardware it runs on. CPU cache hierarchies, memory access patterns, branch prediction, kernel bypass — these are not academic concerns for the Aeron community. They are the design parameters.

Aeron is not a message broker. It is a messaging library — a transport layer that moves bytes between processes with extreme efficiency and predictable latency. If you are building systems where nanosecond-level IPC latency and microsecond-level network latency are genuine requirements (not aspirational marketing), Aeron is one of a very small number of tools that can deliver.

Overview

What It Is

Aeron is a reliable UDP unicast, UDP multicast, and IPC message transport. It provides:

Publication and Subscription abstractions for sending and receiving messages
Reliable delivery over UDP (sequencing, retransmission, flow control)
IPC transport via shared memory for same-machine communication
Back-pressure mechanisms to prevent slow consumers from being overwhelmed
Aeron Cluster for replicated state machines (Raft-based consensus)
Aeron Archive for persistent message recording and replay

Aeron operates at a lower level than systems like Kafka or RabbitMQ. There are no topics (in the Kafka sense), no queues, no routing rules, no broker process deciding where messages go. A publisher sends messages on a channel (identified by a URI-like address) and stream ID, and subscribers listening on the same channel and stream receive them. That is the model. Everything above this — topic semantics, consumer groups, message routing, persistence — is your responsibility to build if you need it.

Brief History

Aeron was created by Martin Thompson and Todd Montgomery at Real Logic, with the first public release around 2014. Thompson had previously worked on the LMAX Disruptor (the inter-thread messaging library that demonstrated that lock-free ring buffers could achieve millions of messages per second on a single thread) and spent years consulting on low-latency systems for financial services.

The motivation for Aeron was dissatisfaction with existing messaging transports. TCP, for all its reliability, adds latency through congestion control algorithms (Nagle's algorithm, slow start), kernel buffer copies, and system call overhead. UDP avoids some of these costs but provides no reliability — messages can be lost, duplicated, or reordered. Existing reliable UDP libraries were either unmaintained, poorly designed, or encumbered by unsuitable licenses.

Aeron's design goals were explicit:

Predictable, low latency. Not just low mean latency — low tail latency. The P99.9 should be close to the median.
High throughput. Millions of small messages per second between processes.
Reliable delivery. Guaranteed, ordered delivery over an unreliable transport (UDP).
Zero-allocation steady state. No garbage collection pressure during message sending or receiving.
Mechanical sympathy. Design every data structure and algorithm to work with modern hardware, not against it.

Aeron is open source under the Apache License 2.0. Real Logic provides commercial support and consulting. The codebase is available in Java (the primary implementation), C (a native implementation for non-JVM environments), and C++ (wrapping the C implementation). The Java and C implementations are maintained in parallel and interoperable — a Java publisher can send to a C subscriber and vice versa.

Thompson continues to lead development. The community, while small compared to Kafka or RabbitMQ, includes some of the most knowledgeable low-latency systems engineers in the industry. The project's GitHub issues and mailing list discussions read like a graduate seminar in systems programming. This is both an asset (high-quality support) and an honest assessment of the target audience.

Architecture

The Media Driver

The central component of Aeron's architecture is the Media Driver. The media driver is the process (or thread) responsible for all I/O: sending UDP packets, receiving UDP packets, managing IPC shared memory regions, handling retransmissions, and performing flow control.

The media driver can run in three modes:

Dedicated: A separate process with its own JVM (or native process for the C driver). Client applications communicate with the driver through shared memory.
Embedded: Running as threads within the client application's process. Lower latency (no inter-process communication with the driver) but ties the driver's lifecycle to the application.
Shared: A single thread handles both sending and receiving. Lower resource usage but potentially higher latency under load.

For lowest latency, the dedicated or embedded driver with a dedicated polling thread is preferred. The driver thread typically busy-waits (spinning on the CPU) rather than blocking, which means it consumes a full CPU core but eliminates the latency of thread wake-up.

┌─────────────────────────────────────────────────────────┐
│                     Application                          │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │Publisher  │  │ Subscriber   │  │ Subscriber   │      │
│  └─────┬────┘  └──────┬───────┘  └──────┬───────┘      │
│        │               │                  │              │
│        ▼               ▼                  ▼              │
│  ┌─────────────────────────────────────────────────┐    │
│  │           Shared Memory (CnC file)              │    │
│  └─────────────────────┬───────────────────────────┘    │
└────────────────────────┼────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                    Media Driver                          │
│  ┌──────────┐  ┌──────────────┐  ┌─────────────┐       │
│  │ Sender   │  │  Receiver    │  │ Conductor   │       │
│  └─────┬────┘  └──────┬───────┘  └─────────────┘       │
│        │               │                                 │
│        ▼               ▼                                 │
│    UDP/IPC         UDP/IPC                               │
└─────────────────────────────────────────────────────────┘

The Command and Control (CnC) file is a memory-mapped file shared between the client application and the media driver. It contains ring buffers for commands (from client to driver) and for received messages (from driver to client). This shared-memory architecture means that sending a message from the application to the driver is a memory write, not a system call — the same principle that makes Chronicle Queue fast, applied to the control plane.

Publications and Subscriptions

The API model is straightforward:

A Publication is a handle for sending messages. It is associated with a channel (transport address) and a stream ID (logical message stream within a channel).
A Subscription is a handle for receiving messages. It subscribes to a channel and one or more stream IDs.
Messages are sent as byte buffers. Aeron does not care about message format — it transports bytes.

Publication  → Channel: aeron:udp?endpoint=224.0.1.1:40456  Stream: 1001
Subscription → Channel: aeron:udp?endpoint=224.0.1.1:40456  Stream: 1001

The channel URI specifies the transport:

aeron:udp?endpoint=host:port — UDP unicast to a specific host
aeron:udp?endpoint=224.0.1.1:40456|interface=eth0 — UDP multicast
aeron:ipc — shared memory IPC (same machine only)

A single media driver can manage multiple publications and subscriptions across different channels and transports simultaneously.

Reliable UDP

Aeron implements reliability over UDP through:

Sequencing. Each message has a position (a monotonically increasing byte offset within the stream). Subscribers track their position and detect gaps.
NAK-based retransmission. When a subscriber detects a gap in the sequence, it sends a NAK (negative acknowledgement) to the publisher, which retransmits the missing data. This is more efficient than ACK-based schemes because it generates no traffic when everything is working correctly.
Flow control. Publishers are prevented from sending faster than subscribers can consume. The flow control strategy is pluggable; the default uses a window-based approach similar to TCP's sliding window but with fewer round trips.
Heartbeats. Publishers and subscribers send periodic heartbeats to detect connection loss.

This reliability layer provides ordered, lossless delivery over UDP — the reliability of TCP without TCP's overhead. The key differences from TCP:

No Nagle's algorithm (no buffering small messages waiting for a full packet)
No slow start (no gradual ramp-up of sending rate)
No head-of-line blocking (a lost packet only blocks messages in that specific stream)
No kernel-level buffer management (data goes directly from application memory to the NIC, where possible)

IPC Transport

For same-machine communication, Aeron's IPC transport bypasses the network stack entirely. Publisher and subscriber share a memory-mapped log file. Writing a message is a memory copy into the shared region; reading a message is a memory read from the shared region.

IPC latency is measured in nanoseconds — typically 50-200 nanoseconds for small messages on modern hardware. This is faster than a localhost UDP loopback by orders of magnitude, because there is no kernel involvement, no socket buffer copy, no interrupt handling.

The IPC transport is particularly useful for:

Communication between components in a trading system (market data handler, strategy engine, order gateway)
High-performance pipelines where stages run in separate processes for isolation but need minimal communication overhead
Any scenario where you would use shared memory but want a clean API with flow control and back-pressure

Aeron Cluster

Aeron Cluster extends Aeron with replicated state machine semantics based on the Raft consensus protocol. This is Aeron's answer to the question: "How do I build a fault-tolerant system with Aeron?"

A cluster consists of:

Multiple nodes (typically 3 or 5) each running a copy of your state machine
A leader node that processes client requests and replicates them to followers
Followers that maintain copies of the replicated log and can take over if the leader fails
Client sessions that connect to the cluster and send commands

The programming model is an event-sourcing style: clients send commands to the cluster, the leader sequences them into a replicated log, and all nodes apply the commands to their state machines in the same order. This guarantees that all nodes have identical state.

Aeron Cluster is not a message broker. It is a framework for building fault-tolerant services. If you want to build a matching engine, a sequencer, or an order management system that survives node failures without losing state, Aeron Cluster provides the replication and leader election infrastructure. You provide the state machine logic.

The latency characteristics are impressive: cluster commit latency (the time from a client sending a command to receiving confirmation that it is replicated to a majority) is typically in the low hundreds of microseconds over a LAN. This is orders of magnitude faster than Kafka's replication or any database commit.

Aeron Archive

Aeron Archive adds persistent recording and replay to Aeron streams. Without Archive, Aeron is an ephemeral transport — messages exist only while they are in the log buffers. Archive records stream data to disk and provides APIs for replaying recorded data.

Use cases for Archive include:

Audit logging. Record all messages for regulatory compliance or debugging.
Late joiners. A new subscriber can replay historical data to build its initial state.
Replay-based recovery. After a crash, replay recorded messages to reconstruct state.
Time-travel debugging. Replay a specific time range to reproduce a production issue.

Archive records are stored as files on the local filesystem. They can be replicated to other machines using Aeron's standard transport.

The combination of Aeron transport (live messaging) + Aeron Archive (persistence) + Aeron Cluster (replication) provides a complete platform for building fault-tolerant, low-latency distributed systems. It is not a turnkey solution — assembling these components requires significant engineering effort — but the building blocks are sound.

Simple Binary Encoding (SBE)

SBE is not technically part of Aeron, but it is Martin Thompson's companion project and is used extensively alongside Aeron. SBE is a serialisation format designed for the same constraints as Aeron: zero allocation, minimal CPU overhead, and direct memory access.

SBE works by:

You define a message schema in XML
The SBE compiler generates Java (or C, C++) codec classes
The codec reads and writes fields directly from/to a byte buffer — no intermediate object allocation
Fields are at fixed offsets within the buffer, so reading field N does not require parsing fields 1 through N-1

<!-- SBE schema for a market data message -->
<sbe:messageSchema package="com.example.sbe"
                    id="1" version="0"
                    semanticVersion="1.0"
                    byteOrder="littleEndian">

    <types>
        <type name="Symbol" primitiveType="char" length="8"/>
    </types>

    <sbe:message name="PriceUpdate" id="1">
        <field name="symbol" id="1" type="Symbol"/>
        <field name="bidPrice" id="2" type="int64"/>
        <field name="askPrice" id="3" type="int64"/>
        <field name="bidSize" id="4" type="int32"/>
        <field name="askSize" id="5" type="int32"/>
        <field name="timestampNanos" id="6" type="int64"/>
    </sbe:message>
</sbe:messageSchema>

The generated codec provides direct access methods:

// Writing (zero allocation)
priceUpdateEncoder.wrap(buffer, offset)
    .symbol("AAPL    ")
    .bidPrice(17852)    // Price in hundredths of a cent
    .askPrice(17855)
    .bidSize(500)
    .askSize(300)
    .timestampNanos(System.nanoTime());

// Reading (zero allocation)
priceUpdateDecoder.wrap(buffer, offset,
    PriceUpdateDecoder.BLOCK_LENGTH,
    PriceUpdateDecoder.SCHEMA_VERSION);

String symbol = priceUpdateDecoder.symbol();  // Direct read from buffer
long bid = priceUpdateDecoder.bidPrice();      // No parsing, no allocation
long ask = priceUpdateDecoder.askPrice();

SBE messages are typically 10-100x smaller than JSON and 2-5x smaller than Protobuf for the same content. Encoding and decoding times are measured in nanoseconds. The trade-off is rigidity: SBE messages have fixed schemas, field reordering requires schema changes, and variable-length data (strings, arrays) is more cumbersome than in Protobuf or JSON.

For Aeron-based systems, SBE is the natural serialisation choice. The combination of Aeron's zero-allocation transport and SBE's zero-allocation serialisation means the entire message path — from application fields to network packet — involves no heap allocation.

Strengths

Predictable Nanosecond-Level Latency

Aeron's IPC transport delivers sub-microsecond latency. Network transport over UDP delivers single-digit microsecond latency on a LAN. These are not benchmarks measured on idle systems — they are the operational profile under load. The P99 is close to the median because there are no GC pauses, no lock contention, no kernel buffer copies adding sporadic latency spikes.

Zero-Allocation Design

Like Chronicle Queue, Aeron operates without generating garbage in steady state. The media driver, publications, subscriptions, and message handling all operate without heap allocation. This is fundamental to achieving predictable latency on the JVM.

Mechanical Sympathy

Every data structure in Aeron is designed for the hardware it runs on:

Ring buffers use padding to prevent false sharing between producer and consumer cache lines
Log buffers are sized to align with OS page sizes
Counters use memory-mapped files for zero-copy sharing between driver and client
Busy-wait loops keep threads on-CPU, avoiding the latency of context switches

This is not premature optimisation — it is the necessary foundation for nanosecond-level performance. At these latencies, a single cache miss (100+ nanoseconds) or context switch (1-10 microseconds) is a significant portion of the total time budget.

IPC Transport

Aeron's IPC transport is arguably its most impressive feature. Shared-memory communication with flow control, back-pressure, and a clean API. It is what Unix shared memory should have been: fast, safe, and usable without deep kernel knowledge.

Multi-Language Support

Unlike Chronicle Queue (Java only), Aeron has production-quality implementations in Java and C/C++. A Java publisher can send to a C subscriber. This is critical for systems that span languages — a Java strategy engine sending orders to a C++ gateway, for example.

The Building Blocks Approach

Aeron, Aeron Cluster, Aeron Archive, and SBE form a coherent set of building blocks for low-latency distributed systems. They are designed to work together but are independently useful. You can use Aeron transport without Cluster. You can use SBE without Aeron. This modularity lets you adopt what you need without buying into an all-or-nothing framework.

Weaknesses

Steep Learning Curve

Aeron is not a system you pick up in an afternoon. The concepts — media drivers, channels, stream IDs, fragment handlers, back-pressure, log buffers — are unfamiliar to most developers. The documentation is accurate but assumes systems programming knowledge. The error messages are descriptive but assume you understand why a publication might be "back-pressured" or why a subscription's "image" might be "unavailable."

The community is helpful but small. There is no "Aeron for Beginners" ecosystem of blog posts and video tutorials. The canonical learning resources are Martin Thompson's conference talks, the project's wiki, and reading the source code. If you are comfortable with that, you will be fine. If you need a gentle on-ramp, budget significant time for learning.

Not a General-Purpose Broker

Aeron does not have topics, queues, routing rules, consumer groups, dead-letter handling, message filtering, or any of the features that general-purpose brokers provide. It is a transport layer. If you need broker semantics, you build them on top of Aeron or use a different tool.

This is a deliberate design choice — broker features add latency and complexity — but it means that using Aeron for anything beyond point-to-point or multicast messaging requires significant application-level development.

Java/C++/C Only

Three languages is more than Chronicle Queue's one, but it is still a limitation. If your system includes Python, Go, Rust, or .NET services, those services cannot use Aeron directly. There are community bindings for some languages (Rust and .NET notably), but they vary in completeness and maintenance status. The primary implementations are Java and C.

Requires Deep Systems Knowledge

Running Aeron well — especially for the lowest latency — requires understanding of:

CPU affinity and pinning (keeping the media driver thread on a specific core)
NUMA topology (ensuring the driver thread and memory are on the same NUMA node)
Network interface configuration (interrupt coalescing, ring buffer sizes)
OS tuning (huge pages, scheduler settings, network stack parameters)
JVM tuning (GC configuration, JIT compilation, safepoints)

A misconfigured Aeron deployment can perform worse than a well-configured TCP solution. The defaults are reasonable for development, but production deployments targeting the lowest latency require expert tuning.

Operational Complexity

There is no Aeron web UI, no Grafana dashboard out of the box, no management CLI with friendly output. Aeron exposes counters through memory-mapped files, which can be read by monitoring tools, but the monitoring infrastructure is your responsibility to build. The AeronStat tool provides counter values, but interpreting them requires understanding of Aeron's internals.

For teams accustomed to Kafka's JMX metrics, Prometheus exporters, and third-party monitoring dashboards, Aeron's operational tooling feels sparse. You are expected to know what you are doing.

Ideal Use Cases

Trading Systems

Aeron's natural habitat. Market data distribution, order routing, position updates, risk calculations — any component of a trading system that needs to move data between processes with minimal, predictable latency. The IPC transport for intra-machine communication and UDP multicast for market data distribution are purpose-built for this domain.

Real-Time Pricing

Systems that calculate and distribute prices in real time — foreign exchange rates, options pricing, bond yields — where stale prices have direct financial consequences. Aeron's multicast transport is particularly suitable: one publisher, many subscribers, all receiving the same data simultaneously.

Systems Where GC Pauses Are Unacceptable

Any JVM-based system where a 10-millisecond GC pause causes a measurable business impact. This goes beyond trading to include real-time control systems, live audio/video processing, and interactive gaming servers.

High-Performance Microservices Communication

For microservices architectures where inter-service latency is a critical constraint, Aeron's IPC transport (for co-located services) and UDP transport (for distributed services) offer dramatically lower latency than HTTP/gRPC or even most message brokers. The trade-off is operational complexity and the need to build service discovery, load balancing, and routing logic yourself.

Operational Reality

Media Driver Tuning

The media driver is the heart of Aeron's performance, and tuning it is the most impactful operational task:

Thread mode. Dedicated threads (one for sending, one for receiving, one for the conductor) provide the best performance. Shared mode (one thread for everything) reduces CPU usage but increases latency.
Busy-wait vs. back-off. Busy-wait (spinning) provides the lowest latency but consumes a full CPU core per thread. Back-off strategies (yielding, sleeping) reduce CPU usage at the cost of latency.
Buffer sizes. Publication and subscription log buffer sizes affect throughput and memory usage. Larger buffers tolerate more burst traffic but consume more memory.
Term length. The log buffer term length affects how much data can be in-flight. Default is 64KB, which is suitable for most workloads.

CPU Pinning

For lowest latency, the media driver threads should be pinned to specific CPU cores using taskset or isolcpus. This prevents the OS scheduler from migrating threads between cores, which would cause cache invalidation and latency spikes.

# Pin the media driver to cores 2 and 3
taskset -c 2,3 java -cp aeron-all.jar \
    io.aeron.driver.MediaDriver \
    aeron.threading.mode=DEDICATED \
    aeron.sender.idle.strategy=noop \
    aeron.receiver.idle.strategy=noop

On NUMA systems, ensure the pinned cores and the memory used by the driver are on the same NUMA node. Cross-NUMA memory access adds 50-100 nanoseconds per access — measurable at Aeron's latency scale.

DPDK Considerations

For the absolute lowest network latency, some Aeron deployments use DPDK (Data Plane Development Kit) to bypass the kernel's network stack entirely. DPDK provides user-space network drivers that read and write packets directly from/to the NIC's memory, eliminating kernel overhead.

Aeron does not include DPDK integration out of the box, but the C media driver can be modified to use DPDK for packet I/O. This is deep systems work — you are essentially taking ownership of the network interface from the OS — but it can reduce network latency from single-digit microseconds to hundreds of nanoseconds.

Whether DPDK is worth the complexity depends on your latency requirements and your team's capability. For most Aeron deployments, standard UDP with tuned kernel settings is sufficient. DPDK is for the last few microseconds, and extracting those microseconds requires expertise that is expensive and rare.

Aeron vs Chronicle Queue vs Kernel Bypass

These three technologies are frequently mentioned together and occasionally conflated. Here is how they compare:

Aspect	Aeron	Chronicle Queue	Kernel Bypass (DPDK/RDMA)
Primary function	Message transport	Persistent journal	Raw packet I/O
Network support	UDP unicast/multicast	Enterprise only	Direct NIC access
IPC	Shared memory	Memory-mapped files	N/A (network only)
Persistence	Archive (optional)	Built-in	None
Reliability	Built-in (NAK-based)	N/A (local only)	Application's problem
Latency (IPC)	50-200 ns	1-2 us	N/A
Latency (network)	2-10 us	N/A (no networking)	0.5-2 us
Languages	Java, C, C++	Java	C (primarily)
Abstraction level	Transport	Storage	Hardware
Operational complexity	High	Medium	Very high

The relationship between them:

Aeron is a transport. It moves bytes between processes efficiently. It does not store them long-term (without Archive).
Chronicle Queue is a store. It persists ordered messages to disk efficiently. It does not move them between machines (without Enterprise).
Kernel bypass is infrastructure. It gives you raw access to network hardware. It provides no messaging semantics at all.

A complete low-latency system might use all three: kernel bypass (DPDK) for receiving raw market data from an exchange, Aeron for distributing that data between internal components, and Chronicle Queue for journalling every message for audit and replay. Each tool handles the layer it is designed for.

Alternatively, many systems use Aeron alone (its built-in UDP handling is sufficient for most purposes) with Chronicle Queue for persistence. Adding kernel bypass is a significant engineering investment that is justified only when Aeron's standard UDP latency is insufficient — which is a rare requirement outside of the most competitive trading environments.

Code Examples

Basic Publication and Subscription

import io.aeron.Aeron;
import io.aeron.Publication;
import io.aeron.Subscription;
import io.aeron.driver.MediaDriver;
import io.aeron.logbuffer.FragmentHandler;
import org.agrona.BufferUtil;
import org.agrona.concurrent.UnsafeBuffer;

public class AeronBasicExample {

    private static final String CHANNEL = "aeron:udp?endpoint=localhost:40123";
    private static final int STREAM_ID = 1001;

    public static void main(String[] args) throws Exception {
        // Start an embedded media driver
        try (MediaDriver driver = MediaDriver.launchEmbedded();
             Aeron aeron = Aeron.connect(
                 new Aeron.Context().aeronDirectoryName(
                     driver.aeronDirectoryName()))) {

            // Create a publication (sender)
            try (Publication publication = aeron.addPublication(
                     CHANNEL, STREAM_ID)) {

                // Create a subscription (receiver)
                try (Subscription subscription = aeron.addSubscription(
                         CHANNEL, STREAM_ID)) {

                    // Wait for the subscription to be connected
                    while (!subscription.isConnected()) {
                        Thread.yield();
                    }

                    // Prepare a message buffer (reusable, no allocation
                    // in the send loop)
                    UnsafeBuffer buffer = new UnsafeBuffer(
                        BufferUtil.allocateDirectAligned(256, 64));

                    // Send messages
                    for (int i = 0; i < 10; i++) {
                        String message = "Order-" + i;
                        buffer.putStringWithoutLengthAscii(0, message);

                        // Offer the message to the publication
                        long result;
                        while ((result = publication.offer(
                                buffer, 0, message.length())) < 0) {
                            // Back-pressured or not connected — retry
                            if (result == Publication.BACK_PRESSURED) {
                                Thread.yield();
                            } else if (result == Publication.NOT_CONNECTED) {
                                Thread.sleep(1);
                            }
                        }
                        System.out.println("Sent: " + message);
                    }

                    // Receive messages
                    FragmentHandler handler = (directBuffer, offset,
                            length, header) -> {
                        byte[] data = new byte[length];
                        directBuffer.getBytes(offset, data);
                        System.out.println("Received: "
                            + new String(data));
                    };

                    int received = 0;
                    while (received < 10) {
                        int fragments = subscription.poll(handler, 10);
                        received += fragments;
                        if (fragments == 0) {
                            Thread.yield();
                        }
                    }
                }
            }
        }
    }
}

Note the explicit handling of back-pressure on the publication side. When offer() returns a negative value, the publisher must decide what to do: retry, yield, drop the message, or apply application-level back-pressure. This is a fundamental difference from broker-based systems where the broker absorbs back-pressure. In Aeron, back-pressure is the publisher's problem, and ignoring it is a bug.

Also note the FragmentHandler callback pattern. Aeron delivers messages as fragments — a message may span multiple fragments if it exceeds the MTU. The standard FragmentAssembler handles reassembly, but for maximum performance, designing messages to fit within a single fragment avoids the reassembly overhead.

IPC Example (Shared Memory)

import io.aeron.Aeron;
import io.aeron.Publication;
import io.aeron.Subscription;
import io.aeron.driver.MediaDriver;
import io.aeron.driver.ThreadingMode;
import io.aeron.logbuffer.FragmentHandler;
import org.agrona.BufferUtil;
import org.agrona.concurrent.BusySpinIdleStrategy;
import org.agrona.concurrent.IdleStrategy;
import org.agrona.concurrent.UnsafeBuffer;

public class AeronIpcExample {

    // IPC channel — no network, shared memory only
    private static final String CHANNEL = "aeron:ipc";
    private static final int STREAM_ID = 2001;

    public static void main(String[] args) throws Exception {
        // Configure driver for lowest latency
        MediaDriver.Context driverCtx = new MediaDriver.Context()
            .threadingMode(ThreadingMode.DEDICATED)
            .conductorIdleStrategy(new BusySpinIdleStrategy())
            .senderIdleStrategy(new BusySpinIdleStrategy())
            .receiverIdleStrategy(new BusySpinIdleStrategy());

        try (MediaDriver driver = MediaDriver.launch(driverCtx);
             Aeron aeron = Aeron.connect(
                 new Aeron.Context().aeronDirectoryName(
                     driver.aeronDirectoryName()))) {

            Publication publication = aeron.addPublication(
                CHANNEL, STREAM_ID);
            Subscription subscription = aeron.addSubscription(
                CHANNEL, STREAM_ID);

            // Wait for connection
            while (!subscription.isConnected()) {
                Thread.yield();
            }

            UnsafeBuffer buffer = new UnsafeBuffer(
                BufferUtil.allocateDirectAligned(64, 64));

            // Publisher thread
            Thread publisher = new Thread(() -> {
                IdleStrategy idle = new BusySpinIdleStrategy();
                for (int i = 0; i < 1_000_000; i++) {
                    buffer.putLong(0, System.nanoTime());
                    buffer.putInt(8, i);

                    while (publication.offer(buffer, 0, 12) < 0) {
                        idle.idle();
                    }
                }
            }, "publisher");

            // Subscriber thread — measure latency
            long[] latencies = new long[1_000_000];
            Thread subscriber = new Thread(() -> {
                IdleStrategy idle = new BusySpinIdleStrategy();
                int[] count = {0};

                FragmentHandler handler = (buf, offset, length, header) -> {
                    long sendTime = buf.getLong(offset);
                    long latency = System.nanoTime() - sendTime;
                    if (count[0] < latencies.length) {
                        latencies[count[0]++] = latency;
                    }
                };

                while (count[0] < 1_000_000) {
                    int fragments = subscription.poll(handler, 10);
                    if (fragments == 0) {
                        idle.idle();
                    }
                }
            }, "subscriber");

            subscriber.start();
            publisher.start();

            publisher.join();
            subscriber.join();

            // Report latency statistics
            java.util.Arrays.sort(latencies);
            System.out.printf("IPC Latency (nanoseconds):%n");
            System.out.printf("  Median:  %,d ns%n",
                latencies[500_000]);
            System.out.printf("  P99:     %,d ns%n",
                latencies[990_000]);
            System.out.printf("  P99.9:   %,d ns%n",
                latencies[999_000]);
            System.out.printf("  P99.99:  %,d ns%n",
                latencies[999_900]);
            System.out.printf("  Max:     %,d ns%n",
                latencies[999_999]);

            publication.close();
            subscription.close();
        }
    }
}

This example measures IPC latency end-to-end. On a modern server with CPU pinning configured, you can expect median latencies around 50-200 nanoseconds and P99 under 1 microsecond. These numbers sound implausible until you realise that the message path is: write 12 bytes to a memory-mapped region (the publication log buffer), the media driver copies them to the subscription log buffer (another memory-mapped region), and the subscriber reads them. There are no system calls, no kernel transitions, no network stack involvement.

The BusySpinIdleStrategy on all threads means every thread consumes a full CPU core at 100% utilisation. This is the latency-optimal configuration and the resource-expensive one. For systems where CPU cores are less abundant than latency budgets, the BackoffIdleStrategy provides a configurable spin-then-yield-then-sleep sequence.

Verdict

Aeron is the most technically impressive messaging technology covered in this book. The combination of reliable UDP transport, nanosecond-level IPC, zero-allocation design, and mechanical sympathy produces a system that operates at the boundary of what software can achieve on commodity hardware. Martin Thompson and the Real Logic team have built something that genuinely pushes the state of the art in messaging performance.

It is also the least accessible. The learning curve is steep. The operational requirements are demanding. The ecosystem is minimal. The documentation assumes expertise. Building a complete system on Aeron requires significantly more engineering effort than using Kafka or RabbitMQ because Aeron provides the transport layer and leaves everything above it — routing, persistence, consumer management, monitoring — to you.

This is not a criticism. It is a statement of what Aeron is: a high-performance building block for teams that know exactly what they need and have the expertise to build it. Aeron does not try to be everything to everyone. It tries to be the fastest message transport available, and it succeeds.

The practical recommendation:

If you are building a low-latency trading system or similar performance-critical infrastructure, Aeron should be on your shortlist. Evaluate it alongside kernel bypass solutions and commercial offerings from LMAX, Solace, and others. Aeron's open-source availability and clean design make it a strong foundation.
If you need IPC between co-located processes and are willing to accept the operational overhead, Aeron's IPC transport is unmatched. Nothing else provides sub-microsecond inter-process communication with flow control and back-pressure through a usable API.
If your latency requirements are measured in milliseconds, Aeron is the wrong tool. It is designed for microsecond and nanosecond latencies. Using it for a system where 10-millisecond latency is acceptable means paying the complexity cost without realising the performance benefit. Use NATS, Kafka, or even HTTP.
If your team does not include systems-level engineers who understand CPU affinity, NUMA topology, and memory-mapped I/O, be honest about the operational investment. Aeron in the hands of an experienced low-latency team is transformative. Aeron in the hands of a team that primarily writes Spring Boot applications is a liability.
Consider the combination. Aeron for transport, Chronicle Queue for persistence, SBE for serialisation. These tools were designed by people who talk to each other, share a philosophy, and built complementary systems. The whole is greater than the sum of the parts — if you need the whole.

Aeron exists because Martin Thompson believed that messaging could be faster than it was, and he was right. The question is not whether Aeron is impressive — it is whether your problem requires that level of performance, and whether your team can pay the engineering cost to wield it. For the right problem and the right team, there is nothing else quite like it.

The Obscure and the Curious

The previous chapters covered the brokers that dominate the conversation — the ones that show up on every "Top 5 Message Brokers" listicle, the ones with conference talks and certification programmes and vendor booths the size of studio apartments. But the messaging landscape is considerably wider than the conference circuit suggests. There are tools that solve specific problems brilliantly, tools that approach the problem from an entirely different angle, and tools that most engineers have never heard of despite being quietly excellent.

This chapter is for them. Some are brokers. Some are libraries. Some are somewhere in between. All of them deserve a look, even if most will never be the centrepiece of your architecture. The messaging world has a long tail, and the long tail is where interesting things happen.

QStash (Upstash)

Serverless Messaging for People Who Do Not Want to Run Anything

QStash is what happens when you take the concept of a message queue and strip it down to an HTTP API with a credit card attached. Built by Upstash, the company that also offers serverless Redis and Kafka, QStash is an HTTP-based message queue designed specifically for serverless and edge function environments. You POST a message to QStash with a destination URL. QStash delivers it. If the destination fails, QStash retries. That is, conceptually, the entire product.

The insight behind QStash is that serverless functions (AWS Lambda, Cloudflare Workers, Vercel Edge Functions) have an awkward relationship with traditional message brokers. A Lambda function cannot maintain a persistent TCP connection to Kafka. It spins up, does its work, and dies. Traditional consumer patterns — long-polling, persistent connections, consumer group coordination — are fundamentally at odds with the serverless execution model. QStash solves this by inverting the flow: instead of the consumer pulling messages, QStash pushes messages to the consumer via HTTP. Your serverless function is just an HTTP endpoint that receives POST requests.

QStash includes built-in delay scheduling (deliver this message in 30 minutes), automatic retries with configurable backoff, deduplication, and basic dead-letter handling. It also supports cron-like scheduling, making it a lightweight alternative to EventBridge Scheduler or Cloud Scheduler for simple periodic tasks. The pricing is per-message, which aligns naturally with serverless cost models — you pay when work happens, not when infrastructure idles.

The limitations are exactly what you would expect. Throughput is modest — this is not a tool for streaming a million events per second. There are no consumer groups, no partitioning, no ordering guarantees beyond single-message delivery. The delivery model is push-only, so your consumer must be an HTTP endpoint, which means you need something publicly addressable or tunnelled. And the entire thing is a managed service with no self-hosted option — you are trusting Upstash with your message delivery and accepting the vendor dependency. For the use cases it targets — background jobs, webhooks, scheduled tasks, inter-service communication in serverless architectures — these limitations are perfectly acceptable. For anything else, you probably want a real broker.

# Publish a message to QStash — deliver to your endpoint with retry
curl -X POST "https://qstash.upstash.io/v2/publish/https://my-api.example.com/webhook" \
  -H "Authorization: Bearer <QSTASH_TOKEN>" \
  -H "Content-Type: application/json" \
  -H "Upstash-Delay: 60s" \
  -H "Upstash-Retries: 3" \
  -d '{"orderId": "12345", "action": "process_payment"}'

# Schedule a recurring message (cron)
curl -X POST "https://qstash.upstash.io/v2/schedules" \
  -H "Authorization: Bearer <QSTASH_TOKEN>" \
  -H "Content-Type: application/json" \
  -H "Upstash-Cron: */5 * * * *" \
  -d '{"destination": "https://my-api.example.com/cleanup", "body": "{}"}'

Watermill (Go)

Not a Broker — a Way of Thinking About Messages

Watermill is an event-driven library for Go, and the most important thing to understand is what it is not: it is not a message broker. It does not store messages. It does not manage subscriptions. It does not replicate data. It is a library that provides a clean, consistent abstraction over other systems that do those things. You plug in Kafka, RabbitMQ, Google Pub/Sub, NATS, Amazon SQS, or even an in-memory channel as the backend, and Watermill gives you a uniform API for publishing, subscribing, and routing messages.

The core value proposition is middleware. Watermill borrows the middleware pattern from HTTP frameworks (think Go's net/http middleware or Express.js middleware) and applies it to message processing. You can chain middleware functions that handle retries, deduplication, logging, metrics, tracing, poison message detection, and throttling — all independent of the underlying broker. This is genuinely useful. If you have ever written the same "retry with exponential backoff and dead-letter on exhaustion" logic for the fourth time across three different broker integrations, Watermill's middleware chains will feel like relief.

The router is the other key concept. Instead of writing bare consumer loops, you define routes that bind a subscriber topic, a handler function, and an optional publisher topic. The router manages the lifecycle — starting subscribers, passing messages through middleware, calling your handler, and optionally publishing the result. It handles graceful shutdown, which is the kind of thing that sounds trivial until you have debugged a production system that loses messages because os.Exit was called while a handler was mid-transaction.

Watermill is opinionated about structure but agnostic about infrastructure, which is an uncommon and valuable position. The main risk is the usual risk of abstraction layers: you lose access to broker-specific features. If you need Kafka's exactly-once transactions, or RabbitMQ's exchange topologies, or NATS's subject-based addressing with wildcards, the Watermill abstraction may not expose them. For many applications, the features Watermill does expose are sufficient. For others, the abstraction leaks at exactly the wrong moment. Know your use case before committing.

package main

import (
    "context"
    "log"

    "github.com/ThreeDotsLabs/watermill"
    "github.com/ThreeDotsLabs/watermill-kafka/v3/pkg/kafka"
    "github.com/ThreeDotsLabs/watermill/message"
    "github.com/ThreeDotsLabs/watermill/message/router/middleware"
)

func main() {
    logger := watermill.NewStdLogger(false, false)

    subscriber, _ := kafka.NewSubscriber(
        kafka.SubscriberConfig{
            Brokers:       []string{"localhost:9092"},
            ConsumerGroup: "order-processor",
        },
        logger,
    )

    publisher, _ := kafka.NewPublisher(
        kafka.PublisherConfig{Brokers: []string{"localhost:9092"}},
        logger,
    )

    router, _ := message.NewRouter(message.RouterConfig{}, logger)

    // Middleware chains — the real value of Watermill
    router.AddMiddleware(
        middleware.Retry{MaxRetries: 3}.Middleware,
        middleware.Recoverer,
        middleware.CorrelationID,
    )

    router.AddHandler(
        "order_to_invoice",       // handler name
        "orders",                 // subscribe topic
        subscriber,
        "invoices",               // publish topic
        publisher,
        func(msg *message.Message) ([]*message.Message, error) {
            log.Printf("Processing order: %s", string(msg.Payload))
            invoice := message.NewMessage(watermill.NewUUID(), msg.Payload)
            return []*message.Message{invoice}, nil
        },
    )

    router.Run(context.Background())
}

Eventuous

Event Sourcing for .NET, Without the Archaeology

Eventuous is an opinionated event sourcing library for .NET. If you are building event-sourced systems on the .NET platform and you have spent time evaluating Marten, Axon (via the Java interop pain), or rolling your own aggregate/event/projection infrastructure for the third time, Eventuous deserves your attention.

The library is designed to work with EventStoreDB (covered separately below) as its primary event store, though it supports other backends. Eventuous provides the wiring that sits between your domain model and the event store: aggregate base classes, command handling, event serialisation, subscriptions, projections (read model updates), and gateway patterns for integrating with external systems. It is opinionated in the sense that it steers you toward specific patterns — aggregates that emit events, command handlers that load and save aggregates, subscriptions that project events into read models — rather than giving you a toolkit of primitives and wishing you luck.

The opinionation is a feature, not a bug, for teams that have decided they are doing event sourcing and want to get to productive code quickly rather than spending their first sprint debating whether aggregates should be classes or records, whether events should be interfaces or sealed hierarchies, and whether the command handler should be a method on the aggregate or a separate service. Eventuous makes these decisions for you. If you agree with the decisions, you move fast. If you disagree, you will fight the framework, and fighting frameworks is a losing proposition.

The integration with EventStoreDB is where Eventuous is strongest. Subscriptions — both catch-up subscriptions (replaying the event stream from a position) and persistent subscriptions (server-managed consumer positions) — are first-class concepts. Projections can be built using either EventStoreDB's built-in projection engine or Eventuous's own subscription-based projection infrastructure, which projects events into MongoDB, Elasticsearch, PostgreSQL, or other read stores. For teams building CQRS/ES systems on .NET with EventStoreDB, Eventuous is likely the fastest path from "we have decided to do event sourcing" to "we are shipping features."

// Define domain events
public record RoomBooked(string RoomId, string GuestName, DateTime CheckIn, DateTime CheckOut);
public record BookingCancelled(string Reason);

// Aggregate with event sourcing
public class Booking : Aggregate<BookingState> {
    public void BookRoom(string roomId, string guest, DateTime checkIn, DateTime checkOut) {
        EnsureDoesntExist();
        Apply(new RoomBooked(roomId, guest, checkIn, checkOut));
    }

    public void Cancel(string reason) {
        EnsureExists();
        Apply(new BookingCancelled(reason));
    }
}

public record BookingState : State<BookingState> {
    public string RoomId { get; init; }
    public bool IsCancelled { get; init; }

    public BookingState() {
        On<RoomBooked>((state, evt) => state with { RoomId = evt.RoomId });
        On<BookingCancelled>((state, _) => state with { IsCancelled = true });
    }
}

// Command handler — Eventuous wires this up
public class BookingCommandService : CommandService<Booking, BookingState, BookingId> {
    public BookingCommandService(IAggregateStore store) : base(store) {
        OnNew<BookRoom>(cmd => new BookingId(cmd.BookingId),
            (booking, cmd) => booking.BookRoom(cmd.RoomId, cmd.Guest, cmd.CheckIn, cmd.CheckOut));

        OnExisting<CancelBooking>(cmd => new BookingId(cmd.BookingId),
            (booking, cmd) => booking.Cancel(cmd.Reason));
    }
}

Mochi MQTT

When You Need MQTT but Mosquitto Feels Like Overkill

MQTT is the lingua franca of IoT messaging — lightweight, low-bandwidth, designed for devices that may have the processing power of a potato. The dominant open-source MQTT broker is Eclipse Mosquitto, which is excellent and battle-tested but is also a standalone daemon written in C that you deploy as infrastructure. Mochi MQTT takes a different approach: it is an embeddable MQTT broker written in Go that you can import as a library and run inside your own application.

The use case is specific but not rare. You are building a Go application — perhaps an IoT gateway, an edge computing service, or a testing harness — and you need MQTT broker functionality without deploying a separate process. Maybe you want to embed MQTT message handling directly in your application server. Maybe you are building an appliance or a self-contained system where minimising process count matters. Maybe you just want an MQTT broker you can spin up in a test with go test and tear down when the test finishes, without Docker or process management.

Mochi MQTT implements MQTT v5.0 (and v3.1.1) with support for QoS 0, 1, and 2, retained messages, will messages, topic filters with wildcards, shared subscriptions, and the other features you expect from an MQTT broker. It supports pluggable persistence backends — in-memory for testing, Bolt or Badger for embedded persistence, or you can write your own. It also supports pluggable authentication via hooks, so you can integrate it with your application's existing auth system rather than managing a separate credential store.

The trade-off is clear: Mochi MQTT is not Mosquitto. It does not have Mosquitto's years of production hardening, its bridging capabilities, or its ecosystem of plugins. For a fleet of ten thousand devices in production, you probably want Mosquitto (or EMQX, or HiveMQ, or a managed MQTT service). For embedding broker functionality in a Go application, for testing, for edge deployments, or for situations where "deploy another daemon" is not an option, Mochi is a clean and well-designed solution.

package main

import (
    "log"
    "os"
    "os/signal"
    "syscall"

    mqtt "github.com/mochi-mqtt/server/v2"
    "github.com/mochi-mqtt/server/v2/hooks/auth"
    "github.com/mochi-mqtt/server/v2/listeners"
)

func main() {
    // Create the broker — it is just a Go struct
    server := mqtt.New(&mqtt.Options{
        InlineClient: true, // Allow the embedding app to subscribe/publish
    })

    // Allow all connections (use a real auth hook in production)
    _ = server.AddHook(new(auth.AllowHook), nil)

    // Add a TCP listener
    tcp := listeners.NewTCP(listeners.Config{
        ID:      "tcp1",
        Address: ":1883",
    })
    _ = server.AddListener(tcp)

    // Subscribe from within the embedding application
    callbackFn := func(cl *mqtt.Client, sub mqtt.Subscription, pk mqtt.Packet) {
        log.Printf("Embedded subscriber received on %s: %s",
            pk.TopicName, string(pk.Payload))
    }
    _ = server.Subscribe("sensors/+/temperature", 1, callbackFn)

    go func() { _ = server.Serve() }()
    log.Println("MQTT broker running on :1883")

    // Graceful shutdown
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
    <-sig
    _ = server.Close()
}

LavinMQ

RabbitMQ's Diet Cousin, Written in Crystal

LavinMQ is an AMQP 0.9.1 compatible message broker written in Crystal — yes, Crystal, the language that looks like Ruby but compiles to native code. It is wire-compatible with RabbitMQ, meaning your existing RabbitMQ client libraries (for any language) work with LavinMQ without modification. The pitch is simple: all the protocol compatibility of RabbitMQ with a dramatically smaller resource footprint.

And "dramatically smaller" is not marketing hyperbole. LavinMQ consistently runs at a fraction of RabbitMQ's memory usage. Where a RabbitMQ node might consume several gigabytes of RAM under moderate load, LavinMQ can handle comparable workloads in hundreds of megabytes. The disk I/O profile is also leaner — LavinMQ uses memory-mapped files and an append-only segment-based storage engine that avoids the complex Erlang queue mirroring machinery. For environments where resources are constrained — edge deployments, small VPS instances, development machines, or situations where you genuinely do not need the full weight of RabbitMQ's feature set — the resource savings are meaningful.

The trade-offs are significant and worth understanding. LavinMQ does not implement RabbitMQ's quorum queues, shovel plugin, federation, or the more advanced clustering features. It does not have RabbitMQ's plugin ecosystem. The community is small. The Crystal language ecosystem, while growing, is nowhere near the size of Erlang/OTP's, which means fewer contributors and a smaller pool of people who can debug the internals. If you need RabbitMQ's full feature set, use RabbitMQ. LavinMQ is for situations where you need the AMQP protocol with minimal overhead, and you can live without the features you are giving up.

LavinMQ is developed by 84codes, the company behind CloudAMQP (a major managed RabbitMQ provider), which means the team building it understands AMQP in production at scale. This is reassuring — they are not building a toy, they are building a tool they understand the need for from years of operating RabbitMQ for thousands of customers.

Apache RocketMQ

The Messaging Giant You Have Probably Never Used

Apache RocketMQ is a distributed messaging system originally developed at Alibaba, donated to the Apache Software Foundation, and widely deployed across Alibaba's infrastructure where it handles trillions of messages. It is one of the most battle-tested messaging systems in existence, but unless you work with Chinese technology companies or read Chinese-language technical blogs, you may have never encountered it.

RocketMQ occupies a similar space to Kafka — distributed, partitioned, high-throughput, log-based — but with a different set of design decisions. The most distinctive feature is transaction messages: RocketMQ has first-class support for a two-phase commit protocol that coordinates message publishing with local database transactions. You begin a "half message" (invisible to consumers), execute your local transaction, and then commit or roll back the half message based on whether the local transaction succeeded. If the commit/rollback is lost (process crash, network failure), RocketMQ will call back to your application to check the transaction status. This is a feature that Kafka users typically implement at the application level with the Outbox pattern; RocketMQ builds it into the broker.

Other notable features include scheduled messages with arbitrary delay (not just fixed delay levels, though the implementation details have evolved across versions), message filtering on the broker side using SQL92-like expressions or tag-based filtering, and a built-in tracing and metrics system. The operational model uses a "NameServer" for service discovery (simpler than ZooKeeper but less feature-rich) and supports both master-slave and Raft-based replication in newer versions.

The adoption barrier outside China is real and worth acknowledging honestly. Documentation quality in English has historically been uneven. The community discussion happens substantially in Chinese. Client library quality varies by language — the Java client is excellent (it is what Alibaba uses), while clients for other languages range from adequate to experimental. If you are a Java shop comfortable reading some Chinese-language resources and you need transaction message support without implementing it yourself, RocketMQ is a serious option. If you are a polyglot team that relies on English-language documentation and Stack Overflow, the friction will be higher than with Kafka or RabbitMQ.

// RocketMQ transaction message — the killer feature
TransactionMQProducer producer = new TransactionMQProducer("tx-producer-group");
producer.setNamesrvAddr("localhost:9876");
producer.setTransactionListener(new TransactionListener() {

    @Override
    public LocalTransactionState executeLocalTransaction(Message msg, Object arg) {
        // This runs after the half message is sent but before it's visible
        try {
            orderRepository.save(new Order(msg.getKeys(), msg.getBody()));
            return LocalTransactionState.COMMIT_MESSAGE;  // Make message visible
        } catch (Exception e) {
            return LocalTransactionState.ROLLBACK_MESSAGE; // Discard the message
        }
    }

    @Override
    public LocalTransactionState checkLocalTransaction(MessageExt msg) {
        // RocketMQ calls this if commit/rollback was lost
        // Check your database: did the order actually save?
        boolean exists = orderRepository.existsByOrderId(msg.getKeys());
        return exists
            ? LocalTransactionState.COMMIT_MESSAGE
            : LocalTransactionState.ROLLBACK_MESSAGE;
    }
});
producer.start();

Message msg = new Message("orders", "OrderCreated", orderId,
    orderJson.getBytes(StandardCharsets.UTF_8));
producer.sendMessageInTransaction(msg, null);

EventStoreDB

The Event Store That Started the Conversation

EventStoreDB is the database that Greg Young built to prove that event sourcing was not just an academic exercise but a practical architecture. If event sourcing has a spiritual home, EventStoreDB is it. While you can do event sourcing on top of PostgreSQL, Kafka, or DynamoDB (and many people do), EventStoreDB was purpose-built for the pattern, and that purpose-built nature shows in everything from its data model to its query capabilities.

The core abstraction is the stream — an ordered, append-only sequence of events identified by a stream name. You write events to streams (typically one stream per aggregate: order-12345, customer-67890). You read events from streams, either forward from a position or backward from the end. This maps directly to the event sourcing pattern: to reconstitute an aggregate, read its stream from the beginning and replay the events. To see what happened to a specific entity, read its stream. Simple.

Where EventStoreDB gets interesting is projections and subscriptions. Projections are server-side JavaScript functions that run over event streams and produce new streams, state, or views. You can create a projection that reads from all order-* streams and produces a high-value-orders stream containing only orders above a certain amount. Or a projection that maintains a running count of events by type. Projections run continuously as new events are written, making them a form of real-time stream processing built into the database. Subscriptions allow clients to follow streams in real time — your read model updater subscribes to a category of streams and updates a SQL database as new events arrive. Persistent subscriptions add consumer-group-like semantics with server-managed checkpoints.

The operational story has improved significantly over the years. EventStoreDB 20+ moved from the Mono runtime to .NET and introduced a gRPC-based client protocol, which broadened client library support beyond the .NET ecosystem. Clustering uses a gossip-based protocol for leader election and supports both synchronous and asynchronous replication. There is a commercial cloud offering (Event Store Cloud) for teams that prefer managed infrastructure. The community is passionate, knowledgeable, and occasionally evangelical in a way that only people who have had a genuine architectural revelation can be.

The honest assessment: EventStoreDB is exceptional at what it was designed for — event sourcing and CQRS. If your architecture is built around event-sourced aggregates, projections, and read models, EventStoreDB is the most natural fit. If you are doing general-purpose pub/sub, event streaming, or task queuing, you are better served by tools designed for those patterns. EventStoreDB is a specialist, not a generalist, and that is exactly what makes it valuable.

// Writing events to EventStoreDB
var client = new EventStoreClient(EventStoreClientSettings.Create("esdb://localhost:2113?tls=false"));

var events = new[] {
    new EventData(
        Uuid.NewUuid(),
        "OrderPlaced",
        JsonSerializer.SerializeToUtf8Bytes(new {
            OrderId = "order-42",
            CustomerId = "cust-7",
            Total = 159.99m,
            PlacedAt = DateTime.UtcNow
        })
    ),
    new EventData(
        Uuid.NewUuid(),
        "OrderConfirmed",
        JsonSerializer.SerializeToUtf8Bytes(new {
            OrderId = "order-42",
            ConfirmedAt = DateTime.UtcNow
        })
    )
};

// Append to a stream — optimistic concurrency via expected revision
await client.AppendToStreamAsync(
    "order-42",
    StreamState.NoStream,  // Expect the stream doesn't exist yet
    events
);

// Read the stream back
var result = client.ReadStreamAsync(Direction.Forwards, "order-42", StreamPosition.Start);
await foreach (var evt in result) {
    Console.WriteLine($"{evt.Event.EventType}: {Encoding.UTF8.GetString(evt.Event.Data.Span)}");
}

// Subscribe to all events in the "order" category
await client.SubscribeToStreamAsync(
    "$ce-order",  // Category projection — all streams starting with "order-"
    FromStream.Start,
    (sub, evt, ct) => {
        Console.WriteLine($"Read model update: {evt.Event.EventType}");
        return Task.CompletedTask;
    }
);

Liftbridge

NATS Plus Persistence, Before JetStream Made It Redundant

Liftbridge is a cautionary tale about timing in open source. It was created to solve a real problem: NATS Core was excellent for ephemeral pub/sub messaging but had no persistence. Messages were fire-and-forget — if no subscriber was listening when a message was published, it was gone. Liftbridge added a persistence layer on top of NATS by implementing a Kafka-like log abstraction — streams with offsets, consumer positions, and durable storage — while using NATS as the transport layer.

The architecture was clever. Liftbridge servers formed a cluster alongside NATS servers, intercepting messages on designated subjects and writing them to persistent logs. Consumers could then read from these logs using offsets, just like Kafka. You got NATS's simplicity and performance for ephemeral messaging, plus Kafka-like durability and replay for the subjects that needed it. It was the best of both worlds — or at least, that was the pitch.

Then NATS JetStream arrived. JetStream, built directly into the NATS server by the core NATS team at Synadia, provided persistence, stream processing, key-value storage, and object storage as a first-party feature. It solved the same fundamental problem as Liftbridge but with deeper integration, official support, and the full weight of the NATS community behind it. Liftbridge, as a third-party add-on solving a problem that the first party had now officially solved, found its reason for existing substantially eroded.

Liftbridge still works. The concepts are sound. If you encounter it in an existing system, there is no urgent reason to rip it out. But for new projects, JetStream is the clear choice for adding persistence to NATS. Liftbridge's story is a useful reminder that building on top of someone else's platform carries inherent risk: the platform may eventually absorb your value proposition. It happens in messaging. It happens in the cloud. It happens everywhere in software. The question is not whether it will happen but whether you have shipped enough value before it does.

Notable Mentions

The long tail of messaging systems extends well beyond what any single chapter can cover. Here are a few more worth knowing about, even if they do not warrant a full profile.

NSQ — Originally built at Bitly, NSQ is a real-time distributed messaging platform written in Go. It emphasises operational simplicity: no single point of failure, no complex configuration, minimal dependencies. Messages are pushed to consumers, and there are no consumer groups or complex routing — just topics and channels. NSQ was ahead of its time in prioritising developer experience and operational simplicity, and it influenced the design of several later systems. It still works well for straightforward pub/sub workloads where you want something simpler than Kafka but more durable than Redis Pub/Sub. Development has slowed, but the codebase is stable and the design is sound.

Zenoh — An interesting protocol and implementation emerging from the Eclipse Foundation (originally from ADLINK Technology, now ZettaScale). Zenoh is a pub/sub/query protocol designed for robotics, IoT, and edge-to-cloud communication. It unifies data in motion (pub/sub), data at rest (storage), and data in computation (queries) under a single protocol. The most intriguing aspect is its ability to bridge different network topologies — it can work peer-to-peer, through routers, or via brokerless gossip. If you are building systems that span edge devices, fog nodes, and cloud services, Zenoh's unified model is worth investigating. The community is small but growing, and the protocol design is genuinely novel rather than "Kafka but different."

Tributary — A Python library for building streaming reactive pipelines. It is less a messaging system and more a dataflow framework, using Python's async capabilities to build directed acyclic graphs of computations. Useful for data science workflows that need reactive processing without the overhead of deploying a full streaming platform.

Memphis.dev's legacy — Covered in detail in Chapter 21, but worth mentioning here as a member of the "developer experience layer over NATS JetStream" category, which turned out to be a difficult space to sustain a business in.

KubeMQ — A Kubernetes-native message broker that runs as a single container and supports multiple messaging patterns (queues, pub/sub, RPC). The pitch is simplicity for teams that want messaging inside Kubernetes without the operational weight of Kafka or RabbitMQ. The community is small, and the long-term viability question applies.

VerneMQ and NanoMQ — Alternative MQTT brokers. VerneMQ is Erlang-based (like RabbitMQ) and designed for large-scale IoT deployments. NanoMQ is a lightweight C-based broker from the same organisation behind NNG (nanomsg next generation). Both are worth evaluating if EMQX or Mosquitto do not fit your specific constraints.

Beanstalkd — A simple, fast work queue. It does one thing — job queuing with priorities, delays, and time-to-run — and does it well. No pub/sub, no streaming, no log compaction. If all you need is a work queue and you find even Redis overly complex for the task, Beanstalkd's simplicity is appealing. It has been around since 2007 and remains quietly useful.

The Long Tail of Messaging

If this chapter has demonstrated anything, it is that the messaging landscape is far more diverse than the Kafka-versus-RabbitMQ debate suggests. There are hundreds of messaging systems in existence, ranging from battle-tested infrastructure running at Alibaba scale to a single developer's weekend project on GitHub with eleven stars and a README that says "TODO: add documentation."

This diversity is not a problem. It is a sign of a healthy engineering ecosystem. Different problems genuinely require different solutions. An IoT gateway collecting sensor data from ten thousand devices has different needs than a financial trading system processing market data with microsecond latency, and both have different needs than a startup's background job queue running on a single $20/month VPS. A tool that is perfect for one of these is likely terrible for the others.

The risk is not in the diversity itself but in the temptation to chase novelty. Every new messaging system promises to fix the problems of the ones that came before. Sometimes it does. Sometimes the "problems" it fixes are actually trade-offs that exist for good reasons, and the new system has simply chosen different trade-offs that it has not yet had enough production miles to discover. The graveyard of messaging systems that were going to replace Kafka is well-populated and continues to accept new residents.

The practical advice is straightforward. For your core messaging infrastructure, choose something battle-tested with a large community, active maintenance, and a track record measured in years, not months. For specialised use cases — embedded MQTT, event sourcing, serverless job queues, edge computing — the niche tools in this chapter may be exactly what you need, and they may save you from contorting a general-purpose broker into a shape it was never designed for. Know the difference between your core infrastructure choices and your specialised tooling choices, and apply different risk tolerances to each.

The messaging world will continue to diversify. New protocols will emerge. New brokers will be announced. New "Kafka killers" will appear on Hacker News, generate excitement, and either prove their worth or quietly fade. The evaluation framework from Chapter 10 applies to all of them. The fundamentals — durability, ordering, delivery semantics, operational complexity — do not change just because the implementation language is novel or the website is well-designed. Judge tools by what they do under load, not by what they promise in blog posts.

The Comparison Matrix

You have now read detailed chapters on sixteen brokers and a collection of niche systems. You have absorbed thousands of words about throughput, latency, durability, delivery semantics, and operational complexity. If you are anything like most engineers at this stage, you are thinking: "Just give me the table." Fair enough.

This chapter is the table. Or rather, it is several tables, plus the caveats and context that prevent those tables from being actively misleading. Because here is the uncomfortable truth about comparison matrices: they are simultaneously the most requested and the most dangerous form of technical documentation. A table compresses nuanced, context-dependent, workload-specific trade-offs into neat cells that fit on a single screen. That compression is useful for orientation and terrible for decision-making. Use this chapter to narrow your shortlist. Do not use it to make your final choice.

Every value in the tables below is a simplification. "High throughput" means different things at different message sizes, replication factors, and durability settings. "Low operational complexity" means different things depending on whether your team has Kubernetes expertise, Erlang experience, or a JVM tuning fetish. Read the individual chapters for the context behind each rating. The table tells you what. The chapters tell you why.

The Big Table

This is the primary comparison matrix, covering the evaluation dimensions from Chapter 10. Ratings are relative to the other brokers in this table, not absolute. A "Medium" throughput rating does not mean the broker is slow — it means other brokers in this comparison are faster under comparable conditions.

Broker	Max Throughput	p99 Latency	Durability	Ordering	Delivery Semantics	Ops Complexity	Ecosystem	Cost Model	Cloud-Native	Multi-Tenancy
Kafka	Very High (millions msg/s, GBs/s with partitions)	Medium (2-15ms typical, GC spikes)	High (ISR replication, configurable acks)	Partition-level	At-least-once; exactly-once within Kafka	High (ZK/KRaft, partition management, tuning)	Very Large (Connect, Streams, Schema Registry, massive client ecosystem)	Open source + infra; managed (Confluent, MSK) per-partition/hr	Good (K8s operators, tiered storage, managed offerings)	Limited (ACLs, quotas, no native namespaces)
RabbitMQ	Medium (50-100K msg/s typical per node)	Low-Medium (sub-ms to low-ms for simple queues)	High (quorum queues with Raft)	Queue-level	At-least-once; at-most-once configurable	Medium (single binary, Erlang runtime, clustering can be finicky)	Large (many client libs, plugins, management UI, Shovel, Federation)	Open source + infra; managed (CloudAMQP)	Moderate (K8s operator exists, stateful nature fights K8s)	Good (vhosts, per-vhost permissions and policies)
Pulsar	Very High (comparable to Kafka with enough bookies)	Medium (similar to Kafka, BookKeeper adds latency)	Very High (BookKeeper, fencing, rack-aware replication)	Partition-level; key-shared for per-key	At-least-once; exactly-once (transactional)	Very High (ZK + BookKeeper + brokers, three systems to operate)	Medium-Large (clients in many languages, Pulsar Functions, IO connectors)	Open source + infra; managed (StreamNative)	Good (tiered storage, K8s operators)	Excellent (native tenants, namespaces, policies, quotas)
SNS/SQS	High (SQS: virtually unlimited with horizontal scaling)	Medium (SQS: 10-50ms typical, polling-based)	Very High (AWS-managed, multi-AZ by default)	FIFO queues: per-group; Standard: best-effort	At-least-once (standard); exactly-once (FIFO)	Very Low (fully managed, nothing to operate)	AWS-native (Lambda triggers, EventBridge integration, limited outside AWS)	Per-request + per-GB data transfer; can get expensive at high volume	Excellent (it is the cloud)	Limited (IAM-based, no native namespaces)
EventBridge	Medium (default quotas, request increases available)	Medium-High (50-500ms typical end-to-end)	Very High (AWS-managed, multi-AZ)	None guaranteed	At-least-once	Very Low (fully managed, rule-based routing)	AWS-native (deep integration with 100+ AWS services, SaaS partners)	Per-event; cheap at low volume, adds up at high volume	Excellent (serverless-native)	Limited (per-account event buses, cross-account sharing)
Google Pub/Sub	Very High (auto-scales, no partitioning to manage)	Medium (50-100ms typical)	Very High (synchronous replication across zones)	Ordering keys (per-key ordering)	At-least-once; exactly-once (per-subscription)	Very Low (fully managed)	GCP-native (Dataflow, BigQuery subscriptions, many client libs)	Per-message + per-GB egress; volume discounts	Excellent (it is the cloud)	Moderate (IAM, per-project isolation)
Azure Event Hubs	Very High (throughput units, Kafka wire-compatible)	Medium (comparable to Kafka)	Very High (Azure-managed, zone-redundant)	Partition-level (Kafka-compatible)	At-least-once	Low (managed, but throughput unit planning needed)	Azure-native (Stream Analytics, Functions, Kafka compatibility layer)	Per throughput unit/hr + per-event ingress	Excellent (native Azure integration)	Moderate (consumer groups, namespace-level isolation)
Redis Streams	High (100K+ msg/s per node easily)	Very Low (sub-ms to low single-digit ms)	Medium (AOF/RDB, Redis Cluster replication is async)	Stream-level (within a single stream)	At-least-once (with consumer groups and ACK)	Low-Medium (Redis is well-known, but clustering adds complexity)	Large as Redis, small as a streaming platform (limited connectors)	Open source + infra; managed (ElastiCache, Redis Cloud)	Good (well-supported on K8s, managed offerings)	Limited (database-level isolation, no native multi-tenancy)
NATS/JetStream	High (NATS core: millions msg/s; JetStream: lower with persistence)	Very Low (NATS core: sub-ms; JetStream: low-ms)	Medium-High (JetStream: Raft-based, R3 replication)	Stream-level; per-subject with consumers	At-least-once; exactly-once (double-ack)	Low (single binary, simple config, built-in monitoring)	Medium (growing client ecosystem, no equivalent to Kafka Connect)	Open source + infra; managed (Synadia Cloud)	Good (single binary, K8s-friendly, Helm charts)	Good (accounts, JetStream resource limits per account)
ActiveMQ/Artemis	Medium (Artemis significantly faster than Classic)	Low-Medium (Artemis: sub-ms to low-ms)	High (Artemis: journal-based, replication)	Queue-level	At-least-once; at-most-once; XA transactions	Medium (JVM tuning, journal configuration, address model)	Large (JMS ecosystem, many client protocols: AMQP, STOMP, MQTT, OpenWire)	Open source + infra	Moderate (K8s operators available, JVM resource overhead)	Moderate (addresses, security domains)
ZeroMQ	Very High (millions msg/s, zero-copy, no broker)	Very Low (microsecond-range, in-process)	None (no persistence, no broker)	Per-socket (in-order delivery per connection)	At-most-once (default); at-least-once (application-level)	Low (library, no infrastructure) but High (you build everything)	Medium (many language bindings, no connectors — it is a library)	Open source; zero infrastructure cost; high development cost	N/A (library, not infrastructure)	N/A (no broker)
Redpanda	Very High (Kafka-competitive, often better single-node)	Low (no JVM GC, lower tail latency than Kafka)	High (Raft-based replication)	Partition-level (Kafka-compatible)	At-least-once; exactly-once (Kafka-compatible)	Medium (single binary, no ZK/JVM, but still distributed system)	Large (Kafka API compatible, inherits Kafka ecosystem)	Open source (BSL → relicensed) + infra; managed (Redpanda Cloud)	Good (K8s operator, tiered storage)	Limited (same as Kafka — ACLs, quotas)
Memphis	Medium (limited by NATS JetStream backend)	Low (inherits NATS JetStream latency)	Medium-High (inherits JetStream durability)	Station-level	At-least-once	Low-Medium (GUI-driven, but project viability concerns)	Small (limited client SDKs, minimal connectors)	Open source + infra; was offering managed service	Good (K8s-native, Helm charts)	Limited
Solace PubSub+	Very High (hardware-accelerated appliances or software)	Very Low (sub-ms, deterministic)	High (guaranteed messaging with persistence)	Queue-level; topic-level configurable	At-least-once; at-most-once; JMS transactional	Medium (rich feature set, but well-documented; managed option available)	Large (JMS, MQTT, AMQP, REST, many enterprise integrations)	Commercial license or managed (Solace Cloud); enterprise pricing	Good (K8s operator, Docker, managed cloud)	Good (message VPNs, client profiles, ACLs)
Chronicle Queue	Extreme (millions msg/s, microsecond latency)	Ultra-Low (single-digit microsecond, no GC)	Medium (local disk, memory-mapped files, no replication)	Total (single writer, sequential)	At-most-once (single machine); replication is application-level	Low (library, no infrastructure) but specialised (Java/JVM only)	Small (Java only, Chronicle ecosystem)	Open source (community) + commercial (enterprise features)	N/A (library, designed for co-located processes)	N/A (library)
Aeron	Extreme (designed for low-latency, millions msg/s)	Ultra-Low (microsecond-range, zero-GC paths)	Configurable (Archive for persistence, Cluster for fault-tolerance)	Per-stream, per-session	At-most-once (UDP); reliable (Aeron protocol); exactly-once (Cluster)	High (complex configuration, deep networking knowledge needed)	Small (Java primary, C++ and .NET drivers, specialised community)	Open source (Apache 2.0) + commercial (Aeron Premium)	Limited (designed for bare metal / dedicated infra, not cloud-native)	N/A (transport-level, not a multi-tenant system)

How to Read This Table

A few notes before you screenshot this and present it in a meeting as if it were gospel:

"Very High" throughput is not the same Very High across brokers. Kafka's "Very High" and Aeron's "Extreme" exist on different scales. Kafka moves millions of messages per second across a distributed cluster to durable storage. Aeron moves millions of messages per second across a network with microsecond latency but without the same durability guarantees. Comparing them directly is like comparing a cargo ship and a speedboat — they both move fast, but "fast" means something different for each.
Latency numbers depend on your definition of "latency." Does the clock start when the producer calls send()? When the message hits the broker? When the consumer receives it? When the consumer acknowledges it? These are different measurements, and vendors are not always clear about which one they are quoting.
Ops Complexity is the most subjective column. A team that has run Kafka for five years will rate Kafka's operational complexity as "Medium." A team deploying it for the first time will rate it as "Please make it stop." Context matters more than the rating.
Cost Model does not tell you what it will actually cost. A per-message pricing model can be cheaper or more expensive than provisioned infrastructure, depending entirely on your traffic patterns. Do the maths for your workload, not the example on the pricing page.

Notes and Caveats by Dimension

Throughput Caveats

Throughput benchmarks are the most abused numbers in the messaging world. Every caveat from Chapter 10 applies, but the ones that most frequently invalidate comparisons are:

Message size. A broker that handles 2 million 100-byte messages per second may struggle with 50,000 10KB messages per second. Always benchmark with representative message sizes.
Replication factor. Virtually all vendor benchmarks quote throughput with minimal or no replication. Production systems run RF=3. The throughput difference can be 2-3x.
Durability settings. Kafka with acks=0 versus acks=all is a different broker in terms of throughput. Redis Streams with WAIT versus without WAIT is a different broker.
Consumer count. Throughput is often measured at the producer. Adding consumers — especially with acknowledgments, processing logic, and back-pressure — changes the picture.

Latency Caveats

JVM warm-up. JVM-based brokers (Kafka, Pulsar, ActiveMQ) have a warm-up period where JIT compilation improves performance. Benchmarks that include cold-start latency look worse than steady-state.
Coordinated omission. If a benchmark tool waits for the previous request to complete before sending the next one, it understates tail latency. Any benchmark that does not address coordinated omission should be viewed with suspicion.
Batching. Many brokers batch messages for efficiency. Batching improves throughput at the cost of latency. A broker's "low latency" mode may have dramatically different throughput than its "high throughput" mode.

Durability Caveats

Default configurations lie. Kafka's default acks=1 is not durable. RabbitMQ's classic queues without publisher confirms can lose messages on crash. Always check what "durable" means in the broker's default configuration versus its recommended production configuration.
Replication is not backup. Replication protects against node failure. It does not protect against bugs that corrupt data on all replicas simultaneously, or against an operator who accidentally deletes a topic.

Ordering Caveats

Ordering and retries are enemies. If message A fails and is retried, it arrives after message B. Your "ordered" stream is now unordered. Kafka's max.in.flight.requests.per.connection=1 prevents this but reduces throughput. Other brokers have analogous trade-offs.
Consumer parallelism destroys ordering. Even if the broker delivers messages in order, a consumer that processes them in a thread pool has destroyed that ordering at the application level.

Delivery Semantics Caveats

Exactly-once has a scope. Kafka's exactly-once works within Kafka (topic to topic via Streams). The moment you write to an external system, you need application-level deduplication. This is true for every broker.
At-least-once is the pragmatic default. With idempotent consumers, at-least-once is cheaper and more broadly applicable than exactly-once. Design for idempotency first, then evaluate whether you need exactly-once semantics.

What the Table Does Not Tell You

The comparison matrix captures the measurable, the quantifiable, the things you can put in a spreadsheet. It does not capture the soft factors that, in practice, often matter more than raw performance numbers. These are the dimensions that do not fit in a cell.

Documentation Quality

There is a spectrum from "comprehensive, well-organised, with practical examples and troubleshooting guides" to "auto-generated API docs with no context and a README that was last updated before the current major version." Where a broker falls on this spectrum determines how quickly your team becomes productive and how efficiently they debug production issues.

Kafka's documentation is extensive but can feel like reference material rather than guidance — you need to already know what you are looking for. RabbitMQ's documentation is genuinely good: well-structured, honest about trade-offs, and written by people who understand that operators need different information than developers. NATS's documentation is clean and focused. Pulsar's documentation has improved but still has gaps, especially for operational topics. The cloud providers (AWS, GCP, Azure) have documentation that is thorough but spread across dozens of service pages and can be hard to navigate. Solace's documentation is enterprise-grade — comprehensive, if somewhat formal.

Community Vibe

Communities have cultures, and those cultures affect your experience when you need help.

Kafka's community is large but fragmented between the Apache project, Confluent's ecosystem, and various third-party tools. Help is available, but you may need to search in several places. RabbitMQ's community is friendly and helpful, with a long tradition of answering questions on mailing lists and forums. NATS's community is small but enthusiastic, with the core team being unusually responsive on GitHub and Slack. Pulsar's community is growing and technically strong but smaller than Kafka's. The cloud-native services have "communities" in the form of AWS re:Post and Google Groups, which is to say, professional support forums rather than communities in the social sense.

Hiring Pool

If you need to hire someone who knows your broker, the size of the candidate pool matters.

Kafka expertise is the most available — it is on countless CVs, and "I know Kafka" has become a standard line item for backend engineers (though the depth of knowledge varies from "I used a Kafka consumer in a Spring Boot project" to "I can tune ISR replication and debug consumer lag at the partition level"). RabbitMQ expertise is also widely available. Redis expertise is everywhere, though Redis Streams-specific expertise is rarer. NATS, Pulsar, Redpanda, and most other brokers have smaller talent pools. For niche systems (Chronicle Queue, Aeron, Solace), you are hiring for general distributed systems expertise and training on the specific tool.

The "2 AM Factor"

When your messaging system is broken at 2 AM, what is your experience going to be like?

This is a function of error messages (are they helpful?), observability (can you see what is wrong?), debugging tools (can you inspect the state?), and recovery procedures (can you fix it without losing data?). Some brokers fail gracefully with clear error messages and well-documented recovery procedures. Others fail with cryptic stack traces and recovery procedures that amount to "restart everything and hope."

Kafka's failure modes are well-documented because thousands of organisations have experienced them and written about them. RabbitMQ's management UI lets you see queue state and connection status, which is invaluable during incidents. NATS provides clear server logs and a monitoring endpoint. Cloud-managed services outsource the 2 AM problem to the cloud provider, which is worth its weight in gold for small teams.

Momentum and Trajectory

Is the project gaining contributors, features, and adoption? Or is it in maintenance mode, slowly declining, or at risk of being abandoned?

As of this writing: Kafka is mature and evolving (KRaft replacing ZooKeeper, tiered storage maturing). RabbitMQ is stable under Broadcom's stewardship, though the acquisition has created some community uncertainty. Pulsar is growing but faces the challenge of operational complexity. NATS is on an upward trajectory with JetStream gaining adoption. Redpanda is actively developing and competing aggressively with Kafka. The cloud-native services (SNS/SQS, Pub/Sub, Event Hubs) are evolving steadily as part of their respective cloud platforms. Memphis's trajectory is uncertain. Solace continues to serve its enterprise niche.

Protocol Support Comparison

Which wire protocols does each broker speak? This matters when you have existing clients, when you need to integrate with third-party systems, or when you want to avoid client library lock-in.

Broker	AMQP 0.9.1	AMQP 1.0	MQTT	Kafka Protocol	HTTP/REST	gRPC	STOMP	Custom/Other
Kafka	—	—	—	Native	Confluent REST Proxy	—	—	Kafka binary protocol
RabbitMQ	Native	Plugin	Plugin	—	Management API	—	Plugin	—
Pulsar	—	—	Plugin (Pulsar-MQTT)	KoP (Kafka on Pulsar)	REST Admin API	—	—	Pulsar binary protocol
SNS/SQS	—	—	—	—	Native (AWS API)	—	—	AWS SDK protocol
EventBridge	—	—	—	—	Native (AWS API)	—	—	AWS SDK protocol
Google Pub/Sub	—	—	—	—	Native (REST)	Native (gRPC)	—	—
Azure Event Hubs	Yes (native)	Yes (native)	—	Yes (compatibility layer)	REST	—	—	—
Redis Streams	—	—	—	—	—	—	—	RESP (Redis protocol)
NATS/JetStream	—	—	Via gateway	—	Via WebSocket	—	—	NATS text protocol
ActiveMQ/Artemis	—	Artemis native	Artemis plugin	—	Jolokia REST	—	Both support	OpenWire (Classic)
ZeroMQ	—	—	—	—	—	—	—	ZMTP (ZeroMQ protocol)
Redpanda	—	—	—	Native (compatible)	HTTP Proxy (Pandaproxy)	—	—	Kafka binary protocol
Solace PubSub+	—	Yes	Yes (native)	—	Yes (native)	—	—	SMF (Solace Message Format)
Aeron	—	—	—	—	—	—	—	Aeron protocol (SBE)

Reading the Protocol Table

A few patterns emerge:

Kafka's protocol is becoming a de facto standard. Redpanda implements it natively. Pulsar has KoP. Azure Event Hubs has a compatibility layer. When a broker adds "Kafka compatibility," it is an acknowledgment of Kafka's market position — your existing Kafka clients become a migration path.
AMQP has two versions and they are not the same thing. AMQP 0.9.1 (RabbitMQ's native protocol) and AMQP 1.0 (an OASIS standard, ActiveMQ Artemis's native protocol) are different protocols with different semantics. "Supports AMQP" without a version number is ambiguous and possibly deceptive.
Multi-protocol brokers offer flexibility at the cost of cognitive load. Solace, ActiveMQ Artemis, and RabbitMQ (with plugins) can speak multiple protocols. This is valuable for integration scenarios but means the operational team needs to understand the semantics and limitations of each protocol on that broker, not just the protocol in general.
HTTP/REST support is the great equaliser. Almost every language and platform can make HTTP requests. Brokers with HTTP interfaces are accessible from serverless functions, legacy systems, and environments where installing a native client library is impractical. The trade-off is performance — HTTP is not the most efficient transport for high-throughput messaging.

Language and Client Library Support

Having a broker is useless if you cannot talk to it from your programming language. Official, maintained client libraries matter — a community client with three GitHub stars and a last commit from 2021 is a liability, not a feature.

Broker	Java/JVM	Python	Go	.NET/C#	JavaScript/Node.js	Rust	C/C++	Ruby	PHP
Kafka	Official (excellent)	confluent-kafka-python (librdkafka)	confluent-kafka-go (librdkafka)	confluent-kafka-dotnet (librdkafka)	kafkajs / confluent-kafka-js	rdkafka (community, good)	librdkafka (reference)	ruby-kafka (community)	php-rdkafka (community)
RabbitMQ	Official (amqp-client)	pika (maintained)	amqp091-go (official)	RabbitMQ.Client (official)	amqplib (community, excellent)	lapin (community, good)	rabbitmq-c (official)	bunny (community, excellent)	php-amqplib (community)
Pulsar	Official (excellent)	Official	Official	Official (DotPulsar)	Official (Node.js)	Community	Official (C++)	Community	Community
SNS/SQS	AWS SDK	boto3	AWS SDK	AWS SDK	AWS SDK	AWS SDK	AWS SDK	AWS SDK	AWS SDK
Google Pub/Sub	Official	Official	Official	Official	Official	Community	Community (C++)	Official	Official
Azure Event Hubs	Official	Official	Official (azeventhubs)	Official	Official	Community	Community	Community	Community
Redis Streams	Jedis, Lettuce	redis-py	go-redis	StackExchange.Redis	ioredis	redis-rs	hiredis	redis-rb	phpredis
NATS/JetStream	Official (jnats)	Official (nats.py)	Official (nats.go)	Official (nats.net)	Official (nats.js)	Official (nats.rs)	Official (nats.c)	Official (nats-pure)	Community
ActiveMQ/Artemis	Official (JMS)	stomp.py / proton	proton (AMQP 1.0)	NMS / AMQP.Net Lite	stompit / rhea	Community	proton-c (Qpid)	stomp (gem)	stomp-php
ZeroMQ	JeroMQ (pure Java)	pyzmq (official)	go-zeromq / pebbe/zmq4	NetMQ (community, excellent)	zeromq.js	rust-zmq	czmq / libzmq (reference)	ffi-rzmq	php-zmq
Redpanda	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)	Kafka clients (compatible)
Solace PubSub+	Official (JCSMP, JMS)	Official	Official	Official	Official	Community	Official (CCSMP)	Community	Community
Chronicle Queue	Official (Java only)	—	—	—	—	—	—	—	—
Aeron	Official (primary)	—	—	Community	—	Community	Official (C)	—	—

Reading the Client Library Table

Kafka's client ecosystem is unmatched in breadth, largely thanks to librdkafka — a C library that provides the foundation for clients in Python, Go, .NET, and others. Redpanda inherits this ecosystem entirely because it speaks the Kafka protocol.
Cloud provider SDKs are comprehensive but vendor-locked. AWS, Google, and Azure provide official SDKs for every major language. The quality is consistently high, but your code is coupled to that cloud provider's API.
NATS has invested heavily in official clients. The NATS team maintains official clients for eight languages, which is unusual for a project of its size and reflects a deliberate strategy to reduce the friction of adoption.
ZeroMQ's client story reflects its nature as a library. The reference implementation is in C (libzmq), and most language bindings are FFI wrappers around it. The quality is generally good, but debugging issues sometimes requires understanding the C layer beneath your language's abstraction.
Chronicle Queue and Aeron are JVM-first by design. If you are not on the JVM, these are not practical options. This is not a limitation — it is a deliberate design choice. When you are optimising for microsecond latency and zero-GC, you are writing Java (or C++), full stop.
"Community" does not mean "bad." Some community clients are excellent (bunny for RabbitMQ in Ruby, NetMQ for ZeroMQ in .NET). But community clients carry inherent risk: they may not keep up with broker protocol changes, and they may be maintained by one person whose interests could shift. Check the commit history and issue response time before depending on a community client in production.

The Meta-Comparison: What Kind of Tool Are You Looking At?

Before comparing individual features, it helps to understand that these tools fall into fundamentally different categories. Comparing Kafka to ZeroMQ is like comparing PostgreSQL to SQLite — they are both "databases" in the way that both a container ship and a kayak are both "boats."

Category	Brokers	What They Are	When to Compare Them
Distributed log / Event streaming	Kafka, Redpanda, Pulsar	Persistent, partitioned, high-throughput event logs	When your primary need is durable, replayable event streaming
Traditional message broker	RabbitMQ, ActiveMQ/Artemis, Solace	Message routing, queuing, exchange patterns	When you need flexible routing, protocol diversity, or enterprise integration
Cloud-managed messaging	SNS/SQS, EventBridge, Google Pub/Sub, Azure Event Hubs	Fully managed, no infrastructure to operate	When operational simplicity trumps all and you are committed to a cloud provider
Lightweight / Embeddable	Redis Streams, NATS/JetStream, Memphis	Low ceremony, quick to deploy, smaller footprint	When you want messaging without the operational weight of Kafka or Pulsar
Messaging library	ZeroMQ, Chronicle Queue, Aeron	Libraries you embed in your application, not infrastructure you deploy	When you need extreme performance and are willing to build your own infrastructure semantics

Comparing tools within the same category is meaningful. Comparing tools across categories is meaningful only if you are genuinely deciding between fundamentally different approaches, which is a design decision, not a feature comparison.

Final Note on the Matrix

No comparison matrix, however detailed, can substitute for benchmarking with your own workload on your own infrastructure with your own team. The matrix tells you where to look. It does not tell you what you will find when you get there.

The brokers that score "High" on every dimension do not exist. Every tool in this table made trade-offs, and those trade-offs are features, not bugs — they are the reason the tool is good at what it is good at. Kafka is operationally complex because it provides the throughput, durability, and ecosystem richness that require that complexity. ZeroMQ has no persistence because persistence would add latency that contradicts its design goals. SNS/SQS costs money because someone else is doing the operational work you are not doing.

The best broker for you is the one whose trade-offs align with your priorities. The matrix helps you see the trade-offs. The decision is yours.

Selection Guide

You have read twenty-five chapters. You have absorbed throughput numbers, latency percentiles, durability guarantees, and enough operational caveats to fill a incident retrospective database. You can now recite the difference between AMQP 0.9.1 and AMQP 1.0 at dinner parties, which will not make you more popular but will make you more correct. The question remains: which broker should you actually use?

This chapter provides decision frameworks, not answers. Answers require context that a book cannot have — your team's skills, your budget, your existing infrastructure, your risk tolerance, your deadline, and whether you have a VP who once read a Confluent blog post and has Opinions. What this chapter can do is give you structured ways to narrow the field, practical heuristics for common scenarios, and enough honesty about the process to save you from the most common selection mistakes.

The First Question: Do You Even Need a Message Broker?

Before evaluating brokers, evaluate whether you need one at all. This sounds obvious. It is not. A significant number of message broker deployments exist because someone said "we need to decouple our services" without examining whether the coupling was actually causing problems, or because an architect drew boxes and arrows on a whiteboard and one of the arrows was labelled "message queue" without anyone asking why.

You probably do need a message broker if:

Multiple consumers need the same event. A new order is created, and the inventory service, billing service, analytics pipeline, and notification service all need to know. Without a broker, you are writing point-to-point integrations, and the next service that needs the event means modifying the producer. This is the strongest case for a broker.
You need to absorb traffic spikes. Your producers generate work faster than your consumers can process it during peaks. A buffer between them prevents cascade failures and lets consumers process at their own pace.
You need durability for in-flight work. If a consumer crashes, you need the work item to survive and be reprocessed. A database with a polling pattern can do this, but a broker does it with less latency and less polling overhead.
You need geographic distribution of events. Events produced in one region need to be consumed in another. Brokers with replication and federation handle this; building it yourself is a distributed systems project you do not want.

You probably do not need a message broker if:

You have one producer and one consumer. A direct HTTP call with a retry library, or a database table with a polling consumer, may be simpler and more appropriate. The overhead of deploying and operating a broker is not justified for a point-to-point integration.
Your "events" are actually request-response. If the producer needs an immediate answer from the consumer, you do not have a messaging problem — you have an RPC problem. Use HTTP, gRPC, or a service mesh. Shoehorning request-response into a message broker adds latency and complexity for no benefit.
Your total message volume is tiny. If you process a hundred messages a day, a PostgreSQL table with a processed boolean column and a cron job is a perfectly respectable architecture. It is boring, reliable, easy to debug, and requires zero additional infrastructure. Do not let anyone shame you into deploying Kafka for this.
You are a team of two and your deadline is next month. A message broker is an additional system to learn, deploy, monitor, and debug. If you are resource-constrained and your architecture does not absolutely require asynchronous messaging, postpone the broker. You can always add one later. You cannot easily remove one that has become load-bearing.

The honest truth: most applications start without a broker and add one when the need becomes clear. The applications that start with a broker and never quite use it properly are more common than anyone admits.

Decision Tree #1: By Primary Use Case

Start with what you are trying to do. Different use cases lead to fundamentally different parts of the broker landscape.

Event Streaming and Log Processing

The problem: You need to capture, store, and process a high-volume stream of events — clickstream data, IoT telemetry, application logs, change data capture from databases, user activity tracking. Events need to be durable, replayable, and processable by multiple independent consumers.

First choice: Kafka or Redpanda. This is their home turf. The distributed log model — append-only, partitioned, consumer-driven — was designed for exactly this use case. Kafka has the larger ecosystem (Connect, Streams, Schema Registry, massive community). Redpanda offers Kafka API compatibility with a simpler operational profile (single binary, no JVM, no ZooKeeper).

Cloud-native alternative: Google Pub/Sub or Azure Event Hubs if you are committed to a cloud provider and want to eliminate operational burden. Event Hubs has Kafka wire compatibility, which makes migration feasible. Google Pub/Sub is excellent but uses its own API, so you are buying in fully.

If scale is modest: NATS JetStream. Lighter weight than Kafka, easier to operate, and capable enough for event streaming workloads that do not need the full Kafka ecosystem.

Task Queuing and Background Jobs

The problem: You need to distribute units of work to a pool of workers. An order needs to be processed. An email needs to be sent. A report needs to be generated. Work items should be load-balanced across workers, retried on failure, and not lost if a worker crashes.

First choice: RabbitMQ. This is what it was built for. The AMQP model — messages routed through exchanges to queues, consumed by competing consumers, acknowledged on completion — maps directly to task queuing. Quorum queues provide durability. Dead-letter exchanges handle failed messages. The management UI lets you see queue depth and consumer state.

Cloud-native alternative: SQS. No infrastructure to operate, scales to infinity, integrates with Lambda for serverless processing. FIFO queues add ordering if you need it. The trade-off is vendor lock-in and slightly higher latency than a self-hosted broker.

Serverless-specific: QStash if your workers are serverless functions and you want HTTP push delivery with built-in retry. It is a specialised tool for a specialised environment.

Minimal overhead: Redis Streams with consumer groups. If you already have Redis in your stack, adding task queuing requires no additional infrastructure. The durability story is weaker than RabbitMQ's, so this is best for workloads where occasional message loss during a Redis failure is acceptable.

Real-Time Messaging and Notifications

The problem: You need low-latency message delivery for real-time applications — chat, live updates, notifications, collaborative editing, gaming events. Messages should arrive quickly. Durability is secondary to speed.

First choice: NATS (Core, without JetStream). Sub-millisecond publish-subscribe with no persistence overhead. Wildcard subject routing gives you flexible topic hierarchies. The simplicity of the protocol and the performance of the implementation make it ideal for real-time scenarios.

With persistence added: NATS JetStream if you need some messages to be durable (offline notification delivery, message history) alongside the real-time flow.

Enterprise / Multi-protocol: Solace PubSub+. If you need real-time messaging with protocol diversity (MQTT for IoT clients, WebSocket for browsers, JMS for enterprise systems) and enterprise features (message VPNs, quality of service, guaranteed messaging).

IoT-specific: An MQTT broker (Mosquitto, EMQX, HiveMQ, or Mochi for embedded scenarios). MQTT was designed for constrained devices and unreliable networks. If your "real-time messaging" involves IoT devices, MQTT is the protocol and an MQTT broker is the natural choice.

Financial and Ultra-Low-Latency

The problem: You are building trading systems, market data distribution, high-frequency pricing, or any system where microsecond latency matters and you are willing to invest significant engineering effort to achieve it.

First choice: Aeron for transport, Chronicle Queue for inter-process communication on the same machine. These are not general-purpose brokers — they are precision tools for teams that measure latency in microseconds and accept the engineering investment that entails. You are writing Java (or C++), you are tuning kernel parameters, and you are probably bypassing the network stack in creative ways.

If milliseconds (not microseconds) are acceptable: Solace PubSub+ with hardware appliances. Deterministic sub-millisecond latency without the bare-metal engineering effort of Aeron/Chronicle Queue. The trade-off is cost — Solace is enterprise-priced.

If you think you need this category but are not in financial services: You probably do not. Revisit your latency requirements. The engineering investment for microsecond-level messaging is enormous, and the tools in this category are designed for a very specific set of problems.

Event Sourcing and CQRS

The problem: Your architecture stores state as a sequence of events rather than as current state. You need an event store, not a message broker — though you also need the ability to subscribe to events for building read models and triggering side effects.

First choice: EventStoreDB. It was literally built for this. Event streams, subscriptions, projections, and category streams are first-class concepts. If you are doing event sourcing, EventStoreDB is the most natural fit.

Alternative: Kafka as an event store. It works — Kafka's log model is conceptually similar to an event store. But you lose EventStoreDB's optimistic concurrency on individual streams, its projection engine, and its purpose-built tooling. You gain Kafka's ecosystem, throughput, and community.

Library-level: Eventuous (.NET) or Marten (.NET) for event sourcing without a dedicated event store, using PostgreSQL or EventStoreDB as the backing store.

Decision Tree #2: By Constraint

Sometimes the use case is flexible but the constraints are not. Start with what you cannot change.

"It Must Be Fully Managed — We Cannot Operate Infrastructure"

Your shortlist is: SQS/SNS, EventBridge, Google Pub/Sub, Azure Event Hubs, Confluent Cloud, Redpanda Cloud, Solace Cloud. Of these, the cloud-provider-native options (SQS, Pub/Sub, Event Hubs) have the deepest integration with their respective ecosystems. Confluent Cloud and Redpanda Cloud give you Kafka-compatible managed services without cloud provider lock-in (though you are locked into the vendor instead).

If you genuinely cannot operate messaging infrastructure — small team, no dedicated platform/SRE function, other priorities — this constraint alone eliminates most self-hosted options. Accept it and choose accordingly.

"It Must Be Open Source — No Vendor Lock-In"

Your shortlist is: Kafka, RabbitMQ, Pulsar, NATS, ActiveMQ/Artemis, ZeroMQ, Redis (with caveats about licensing changes). All are available under permissive or copyleft open-source licences. Note that "open source" and "free" are not the same thing — you are trading vendor lock-in for operational responsibility.

Be honest about why you need open source. If it is philosophical (you believe in open source), the constraint is real and non-negotiable. If it is practical (you want to avoid vendor pricing), factor in the cost of operating the infrastructure yourself. Self-hosted Kafka is "free" in the same way that a free puppy is free.

"We Are Already Running X — What Fits?"

Already running Kafka? Stay on Kafka for new use cases that fit the streaming model. Add RabbitMQ or SQS if you need task queuing with flexible routing — do not force Kafka into a task queue pattern. If Kafka's operational burden is the pain point, evaluate Redpanda as a drop-in replacement.

Already running RabbitMQ? Stay on RabbitMQ for task queuing and simple pub/sub. Add Kafka or Redpanda if you need event streaming with replay, log compaction, or the Kafka Connect ecosystem.

Already running Redis? Redis Streams can handle basic messaging without adding infrastructure. If you outgrow Redis Streams' durability and feature set, the natural graduation path is to RabbitMQ (for task queuing) or Kafka (for event streaming).

Already running on AWS? SQS/SNS and EventBridge are there, they are managed, and they integrate with everything. Use them unless you have a specific need they cannot meet (Kafka-level throughput, replay from arbitrary offsets, cross-cloud portability).

"Our Team Knows Language X"

Java team: Everything is available. Kafka's native client is Java. Pulsar, ActiveMQ, Solace, Chronicle Queue, and Aeron all have first-class Java support.

Go team: NATS (written in Go, official Go client is excellent), Kafka (via confluent-kafka-go or Sarama), RabbitMQ (official Go client). Watermill as an abstraction layer if you want to stay agnostic.

.NET team: RabbitMQ (excellent .NET client), Kafka (confluent-kafka-dotnet), NATS (official .NET client), Eventuous or Marten for event sourcing. Azure Event Hubs if you are in the Microsoft ecosystem.

Python team: Kafka (confluent-kafka-python), RabbitMQ (pika), SQS/SNS (boto3), Google Pub/Sub (official), Redis Streams (redis-py). Python's client ecosystem is broad enough that language is not usually the constraint.

Node.js team: KafkaJS (or confluent-kafka-javascript), amqplib (RabbitMQ), AWS SDK, NATS (official). The JavaScript client ecosystem is adequate for all major brokers.

Decision Tree #3: By Scale

Startup / Small Team (1-10 engineers)

Priorities: Simplicity, low operational overhead, fast time to value.

Recommendations:

Default choice: A managed service. SQS if you are on AWS. Google Pub/Sub if you are on GCP. Do not operate messaging infrastructure if you can avoid it. Your engineering time is too valuable to spend on broker operations.
If you must self-host: NATS or Redis Streams. Both are simple to deploy, easy to operate, and capable enough for startup-scale workloads. NATS as a single binary with JetStream enabled covers both pub/sub and persistent messaging.
Avoid: Kafka, Pulsar, or any broker that requires more than one component to deploy. You do not have the team to operate it, and you do not have the traffic to justify it.

Mid-Size Organisation (10-100 engineers, multiple teams)

Priorities: Multi-team usage, reasonable operational complexity, growing ecosystem needs.

Recommendations:

Event streaming: Kafka (if you have or can hire Kafka expertise) or a managed Kafka-compatible service (Confluent Cloud, Redpanda Cloud, Amazon MSK). The ecosystem (Connect, Schema Registry) becomes valuable at this scale.
Task queuing: RabbitMQ or SQS. RabbitMQ if you want self-hosted with full control. SQS if you want zero ops.
Platform play: If multiple teams need messaging, invest in a platform. Set up shared infrastructure with multi-tenancy, monitoring, and self-service topic/queue creation. Pulsar's native multi-tenancy is appealing here if you can stomach the operational complexity.

Enterprise (100+ engineers, strict compliance, multi-region)

Priorities: Reliability, compliance, multi-tenancy, vendor support, multi-region.

Recommendations:

Core streaming platform: Kafka (with Confluent support) or Pulsar (with StreamNative support). Enterprise support contracts matter when production incidents have business-level consequences.
Multi-protocol needs: Solace PubSub+ if you need to bridge JMS, MQTT, AMQP, and REST under a single platform with enterprise-grade support and features.
Cloud-native: Azure Event Hubs (Kafka-compatible) or Google Pub/Sub with enterprise support. Cloud-managed services with SLAs and compliance certifications reduce audit friction.
Event sourcing: EventStoreDB with commercial support (Event Store Cloud) for event-sourced domains.

Hyperscale (Thousands of engineers, millions of messages per second, global distribution)

Priorities: Throughput, global distribution, operational maturity, custom tooling.

Recommendations:

At this scale, you are probably running Kafka or a custom system. You have dedicated teams for messaging infrastructure. You have opinions about partition assignment strategies and consumer rebalancing protocols. You have already read this book and formed your own conclusions.
The question at hyperscale is not which broker to choose but how to operate it at scale: automated partition rebalancing, tiered storage for cost management, cross-region replication with conflict resolution, and custom monitoring that alerts before users notice problems.
Consider Redpanda if you want Kafka compatibility with simpler operations. Consider Pulsar if you need native multi-tenancy for a large number of teams. Consider building a platform team whose sole job is messaging infrastructure.

"Use This When" Quick Reference

One-liner recommendations for when each broker is the right call. These are opinionated. They are also, in the author's experience, correct more often than not.

Broker	Use This When...
Kafka	You need durable, high-throughput event streaming with a massive ecosystem and can invest in operations.
RabbitMQ	You need flexible message routing, task queuing, or a multi-protocol broker that is well-understood and well-documented.
Pulsar	You need Kafka-like streaming with native multi-tenancy and geo-replication, and you can handle the operational complexity.
SQS/SNS	You are on AWS and want managed messaging that scales to infinity with zero operational burden.
EventBridge	You need event-driven integration between AWS services, SaaS applications, or microservices with rule-based routing.
Google Pub/Sub	You are on GCP and want managed messaging with strong ordering support and exactly-once per-subscription semantics.
Azure Event Hubs	You are on Azure and want managed event streaming with Kafka wire compatibility.
Redis Streams	You already have Redis and need lightweight messaging without deploying additional infrastructure.
NATS/JetStream	You want a simple, fast, operationally lightweight messaging system that covers both ephemeral and persistent messaging.
ActiveMQ/Artemis	You need JMS compliance, XA transactions, or enterprise Java integration patterns.
ZeroMQ	You need the fastest possible inter-process messaging and are willing to build infrastructure semantics yourself.
Redpanda	You want Kafka compatibility without the JVM, ZooKeeper, or the operational complexity.
Memphis	You want a developer-friendly layer over NATS JetStream and accept the project viability risk. (Evaluate carefully.)
Solace PubSub+	You need enterprise messaging with multi-protocol support, guaranteed delivery, and commercial support.
Chronicle Queue	You need microsecond-latency inter-process messaging on a single machine (JVM only).
Aeron	You need microsecond-latency network messaging and are building a system where latency justifies engineering complexity.
EventStoreDB	You are doing event sourcing and want a database designed for it, not a general-purpose broker repurposed for it.
RocketMQ	You need transaction messages baked into the broker and are comfortable with a Java-centric, China-origin ecosystem.

Migration Paths: Common Broker-to-Broker Migrations

Migrations happen. Requirements change, teams grow, workloads evolve, and the broker that was perfect two years ago may not be perfect today. Here is an honest assessment of common migration paths.

RabbitMQ to Kafka

Why it happens: The team outgrows RabbitMQ's throughput ceiling, needs event replay, or wants the Kafka ecosystem (Connect, Streams, Schema Registry) for data pipeline use cases.

Difficulty: High. The mental model is different. RabbitMQ is push-based with routing logic (exchanges, bindings). Kafka is pull-based with partitioned logs. Consumer patterns change fundamentally. Your RabbitMQ consumers that process and acknowledge individual messages become Kafka consumers that manage offsets and batches. Routing logic that lived in RabbitMQ's exchange topology now lives in your application or in Kafka Streams.

Advice: Migrate use case by use case, not all at once. Run both brokers in parallel. Start with new use cases on Kafka. Migrate existing use cases only when the RabbitMQ limitations are concretely painful, not theoretically concerning.

Kafka to Redpanda

Why it happens: Kafka's operational complexity (ZooKeeper, JVM tuning, partition rebalancing) is a pain point, and the team wants Kafka compatibility with a simpler operational model.

Difficulty: Low to Medium. Redpanda is Kafka API-compatible, so clients require minimal or no changes. The migration is primarily an infrastructure swap: stand up Redpanda, migrate data (MirrorMaker or Redpanda's tools), switch producers and consumers, decommission Kafka. The risk is in the edge cases — Kafka compatibility is very good but not 100%. Test your specific client usage patterns thoroughly.

SQS to RabbitMQ (or vice versa)

Why it happens: Multi-cloud migration, cost optimisation, or the need for features one has that the other lacks (RabbitMQ's exchange routing, SQS's unlimited scalability).

Difficulty: Medium. The concepts map reasonably well (SQS queue ≈ RabbitMQ queue, SNS topic ≈ RabbitMQ fanout exchange). The API is completely different, so all producer and consumer code changes. The operational model changes dramatically — from zero ops (SQS) to self-hosted (RabbitMQ), or vice versa.

ActiveMQ to RabbitMQ or Kafka

Why it happens: ActiveMQ Classic is aging and has known performance and reliability limitations. Teams migrate to RabbitMQ for a modern traditional broker or to Kafka for event streaming.

Difficulty: Medium to High. If the codebase uses JMS extensively, migrating to Kafka means abandoning JMS (unless you use a JMS-to-Kafka bridge, which adds complexity). Migrating to Artemis is lower friction since Artemis supports JMS natively. Migrating to RabbitMQ means switching to AMQP, which is a protocol change but a manageable one.

Any Broker to a Cloud-Managed Service

Why it happens: The team is tired of operating infrastructure and wants to hand the pager to AWS/GCP/Azure.

Difficulty: Varies. Kafka to Amazon MSK or Confluent Cloud is relatively smooth (same protocol, similar operational model). Kafka to SQS/SNS is a fundamental architecture change. RabbitMQ to Amazon MQ (which runs RabbitMQ) is straightforward. RabbitMQ to SQS requires rewriting consumer patterns.

The meta-advice on migration: Every migration takes longer and costs more than you expect. Every migration discovers assumptions that were baked into the old broker's behaviour and never documented. Budget twice the time you think you need, and keep the old broker running in parallel until you are certain the migration is complete. "Certain" means you have run in production for at least two full business cycles (including whatever your peak period is) without issues.

The Multi-Broker Reality

Here is something the vendor marketing will never tell you: most organisations of any significant size end up running more than one messaging system. This is not a failure of architecture. It is a rational response to having multiple, genuinely different messaging needs.

A typical mid-to-large organisation might run:

Kafka for the core event streaming platform — user activity, change data capture, metrics pipelines.
SQS for background job queuing in AWS-deployed services — simple, managed, scales without thought.
Redis Pub/Sub or NATS for real-time notifications between microservices — low-latency, ephemeral messages that do not need durability.
An MQTT broker for IoT device communication — different protocol, different clients, different network assumptions.

Is this ideal? No. Each system requires its own monitoring, expertise, and operational procedures. But the alternative — forcing all messaging through a single broker — creates a different set of problems: you contort the broker to fit use cases it was not designed for, you create a single point of failure for your entire organisation, and you optimise for none of your use cases because you are trying to optimise for all of them.

The practical approach is to standardise on as few brokers as possible while accepting that "one broker for everything" is a platonic ideal, not a realistic goal. Define clear criteria for when a new messaging system is justified, and resist the urge to adopt a new broker for every new project. But also resist the urge to force Kafka into being a task queue or RabbitMQ into being an event store.

Building Your Evaluation: A Practical Scoring Template

When it is time to make a decision for a specific project, use a structured evaluation rather than a gut feeling. Here is a template.

Step 1: Weight the Dimensions

Not all dimensions matter equally for every project. Assign weights (1-5) based on your specific needs:

Dimension	Weight (1-5)	Notes
Throughput	___	What volume do you actually need? Not aspirational, actual.
Latency (p99)	___	What latency is the user experience sensitive to?
Durability	___	What is the cost of losing a message? Negligible, annoying, or catastrophic?
Ordering	___	Does your processing logic require ordering? Or can consumers handle out-of-order?
Delivery Semantics	___	Do you need exactly-once, or can you design idempotent consumers?
Ops Complexity	___	How much operational capacity does your team have?
Ecosystem	___	Do you need connectors, stream processing, schema management?
Cost	___	What is your budget for infrastructure and staffing?
Cloud-Native	___	Are you deploying to Kubernetes? Do you need a managed service?
Multi-Tenancy	___	Will multiple teams share the broker?

Step 2: Score Each Candidate

For each candidate broker, score it 1-5 on each dimension. Multiply by the weight. Sum the weighted scores.

Step 3: Reality-Check the Winner

The spreadsheet will produce a winner. Before you trust it, ask:

Does anyone on the team have experience with this broker? If not, add 3-6 months of learning curve to your project timeline.
Is there a managed offering you can start with? Starting managed and moving self-hosted later is easier than the reverse.
Have you talked to someone who runs this in production? Not a vendor sales engineer. Not a conference speaker. Someone who has been paged at 3 AM because the broker was broken. Find them. Buy them coffee. Ask what they wish they had known before they started.
Does the broker's community and governance model give you confidence in its future? If the broker is maintained by one person or funded by one VC, what happens if that person or that funding disappears?

Step 4: Run a Proof of Concept

Do not choose a broker based on a spreadsheet. Choose a shortlist based on a spreadsheet, then build a proof of concept with your actual workload, your actual message sizes, your actual consumer patterns, and your actual infrastructure. A two-week POC will teach you more than two months of reading documentation.

Final Advice: The Broker Matters Less Than You Think

After twenty-seven chapters about messaging systems, this may sound like heresy, but it is the most important thing in this book: the choice of broker matters less than the quality of your architecture and the discipline of your engineering practices.

A well-designed system with clean event schemas, idempotent consumers, proper error handling, comprehensive observability, and tested failure recovery will work well on any competent broker. A poorly designed system with coupled producers, fragile consumers, missing dead-letter handling, and no monitoring will fail on every broker, including the most expensive and sophisticated one on the market.

The patterns from Part 1 of this book — schema evolution, delivery guarantees, error handling, observability, testing, anti-patterns — are broker-agnostic. They apply whether you are running Kafka, RabbitMQ, SQS, or carrier pigeons with USB drives taped to their legs (which, for the record, has surprisingly high bandwidth though unacceptable latency). Get those patterns right, and your choice of broker becomes a matter of operational preference rather than architectural survival.

This is not an argument for choosing your broker carelessly. A bad fit will cause real pain — the wrong throughput, the wrong latency, the wrong operational model for your team. But it is an argument against the agonising, months-long broker evaluation process that paralyses some teams. If you have narrowed your shortlist using the frameworks in this chapter, the differences between the finalists are smaller than you think. Pick one. Build on it. Invest your energy in the patterns and practices that matter regardless of which broker is underneath.

The brokers will change. New ones will emerge. Old ones will evolve or fade. Your investment in sound event-driven architecture principles will outlast any specific broker choice. That is the real moat.

A Closing Note: The Future of Event-Driven Architecture

Event-driven architecture is not a trend. It is a structural shift in how we build software, driven by the same forces that drove the shift from monoliths to microservices: the need to build systems that are independently deployable, independently scalable, and resilient to partial failure. Events — facts about things that happened — are the natural interface between autonomous systems. This is not going to un-happen.

What is changing, and will continue to change, is how we implement event-driven systems.

The infrastructure is disappearing into the platform. Cloud-managed services (Pub/Sub, EventBridge, Event Hubs) are making the broker itself less visible. You publish events and subscribe to them; the cloud provider handles the rest. This trend will accelerate. Within a few years, operating your own message broker will be a deliberate choice made for specific reasons (cost at extreme scale, compliance, latency requirements, multi-cloud), not the default.

Schemas and contracts are becoming first-class. The era of publishing arbitrary JSON blobs and hoping consumers can parse them is ending. Schema registries, contract testing, and event catalogs are moving from "nice to have" to "table stakes." AsyncAPI is doing for event-driven interfaces what OpenAPI did for REST APIs. This is unambiguously good.

Event sourcing and CQRS are moving from niche to mainstream. The ideas have been around for over a decade, but the tooling is finally catching up. EventStoreDB, Eventuous, Marten, Axon — the ecosystem of purpose-built tools is growing. Event sourcing will not replace CRUD for every use case, but it will become a standard pattern that every senior engineer understands, even if they do not use it daily.

The broker and the database are converging. Kafka already functions as a database for some workloads (log-compacted topics as materialised views). EventStoreDB is a database that publishes events. Redis is a cache, a database, and a message broker depending on which features you use. The lines between "where data lives" and "how data moves" are blurring, and the next generation of systems will blur them further.

Edge and IoT are the next frontier. As computation moves to the edge — devices, gateways, local servers — event-driven patterns need to work across network boundaries that are unreliable, high-latency, and bandwidth-constrained. Protocols like MQTT and Zenoh, and patterns like event mesh and store-and-forward, will become more important as the edge grows.

The fundamentals, though, remain the fundamentals. Events represent facts. Consumers should be idempotent. Schemas should evolve without breaking consumers. Failure is not an edge case; it is a design input. Observability is not optional. These principles were true when this book was started, they are true now, and they will be true when the specific brokers discussed in these pages have been replaced by whatever comes next.

Build systems that are honest about what happened. The rest is plumbing.