Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Where to Go From Here

If you’ve read this far, you now have a coherent mental model for distributed systems. You understand why things go wrong, you have vocabulary for the trade-offs, and you have patterns for the common problems. That’s enough to make meaningfully better decisions about the systems you build.

But this book is an introduction, not a ceiling. Here’s what’s beyond it.


The Papers

The field of distributed systems is unusually well-documented in academic literature. The foundational papers are accessible and worth reading.

Start here:

“Time, Clocks, and the Ordering of Events in a Distributed System” — Leslie Lamport (1978) The paper that defined happens-before, logical clocks, and much of the framework for thinking about ordering in distributed systems. Lamport writes with unusual clarity for an academic. Read the original.

“In Search of an Understandable Consensus Algorithm” — Ongaro and Ousterhout (2014) The Raft paper. Designed to be readable. Succeeds. If you want to understand how consensus actually works, this is the best starting point.

“Dynamo: Amazon’s Highly Available Key-Value Store” — DeCandia et al. (2007) The paper that introduced the consistent hashing, vector clocks, and sloppy quorum approach that influenced an entire generation of distributed databases. Shows how real engineering trade-offs are made at scale.

“Spanner: Google’s Globally-Distributed Database” — Corbett et al. (2012) TrueTime, external consistency, global transactions. The engineering behind the most ambitious consistency guarantees in a geo-distributed system.

“MapReduce: Simplified Data Processing on Large Clusters” — Dean and Ghemawat (2004) The paper that started the big data era. Important not because MapReduce is the right answer anymore, but because it shows how to design for failure at scale.

“The Byzantine Generals Problem” — Lamport, Shostak, Pease (1982) The paper that defined the hardest fault model. Essential context for understanding Byzantine fault tolerant systems.

“Harvest, Yield, and Scalable Tolerant Systems” — Fox and Brewer (1999) A more practical take on availability trade-offs than the CAP theorem formalization. “Harvest vs. yield” is a useful framing.

CRDT papers — Shapiro et al., “Conflict-free Replicated Data Types” (2011) The formal treatment of CRDTs. More mathematically dense than the above, but worth it if you’re working with leaderless eventually-consistent systems.


The Books

“Designing Data-Intensive Applications” — Martin Kleppmann If this book is the introduction, Kleppmann’s book is the comprehensive treatment. It covers everything here in more depth, plus storage engines, batch processing, stream processing, and more. It’s dense and long and worth every page.

“Database Internals” — Alex Petrov Deep dive into how databases actually work: B-trees, LSM-trees, WAL, replication internals. If you want to understand why distributed databases make the choices they do, this gives you the foundation.

“Release It! Design and Deploy Production-Ready Software” — Michael Nygard The fault tolerance and operational patterns book. Circuit breakers, bulkheads, timeouts — Nygard wrote about these before they had universally recognized names. Practical and battle-tested.

“Building Microservices” — Sam Newman If you’re going down the microservices path, this is the most grounded treatment. Newman doesn’t oversell microservices; he’s honest about the costs.

“Software Engineering at Google” — Winters, Manshreck, Wright Not strictly distributed systems, but the chapter on reliability and the discussion of Google’s engineering practices around distributed systems are excellent. How do you actually run this stuff at scale?


The Courses and Talks

MIT 6.824: Distributed Systems (available online) The gold standard distributed systems course. The labs — implementing Raft, building a distributed key-value store — are genuinely educational. The lecture recordings are on YouTube.

Martin Kleppmann’s lectures (available on YouTube) Kleppmann teaches at Cambridge and has posted excellent lectures on distributed systems that expand on his book.

“Turning the database inside out” — Martin Kleppmann (Strange Loop 2015) A talk on event sourcing, CDC, and the stream processing model for data systems. Changes how you think about databases.

“How Kafka is tested” — Colin McCabe (2015) Understanding how Kafka tests for correctness under failure is instructive for how to think about your own systems.

The LADIS workshop proceedings Less accessible, but the workshop papers (“Large-Scale Distributed Systems and Middleware”) contain useful practical discussions of real-world distributed systems problems.


Tools Worth Understanding Deeply

etcd: The Raft implementation that runs Kubernetes. Understanding its operational model — leader election, watch semantics, lease management — is useful even if you’re not running Kubernetes.

Apache Kafka: The durable distributed log. Understanding its partition model, consumer groups, offsets, and exactly-once semantics is increasingly important as event-driven architecture becomes more common.

PostgreSQL: The most capable single-node database. Understanding its WAL, MVCC, replication, and logical decoding internals pays off repeatedly.

Redis: The Swiss army knife of operational data structures. Understanding its persistence models, cluster mode, and Sentinel vs Cluster architectures.

Jepsen: Kyle Kingsbury’s project for testing distributed systems for safety violations. The Jepsen analyses of popular databases are eye-opening — many systems that claimed strong guarantees violated them under partition. Reading through the analyses at jepsen.io is a masterclass in what actually goes wrong.


What to Do Next

Instrument your current system. Add structured logging, metrics for the four golden signals, and distributed traces if you have more than two services. You can’t improve what you can’t see.

Test a failover. Take your production database primary, deliberately fail it (in a staging environment first), and observe what happens. Does the replica promote cleanly? How long does it take? What do clients experience? Do this before you have to do it under pressure.

Read one paper. Pick the Raft paper or the Dynamo paper. Read it. The gap between “understanding a concept” and “reading the original source” is surprisingly large.

Audit your consistency assumptions. For each database read in your application, ask: could this be from a replica? If yes, what’s the maximum acceptable lag? Have you designed for the case where that lag is exceeded?

Find the race conditions. Pick one multi-step operation in your codebase. Draw out all the possible failure scenarios between steps. Which ones leave the system in a bad state? Which of those are handled? This exercise is uncomfortable but valuable.


A Final Thought

Distributed systems are hard because they deal with partial information, partial failure, and the impossibility of perfect coordination. These aren’t problems that engineering skill fully overcomes — they’re fundamental constraints of what computing over a network can and cannot guarantee.

The engineers who are good at distributed systems aren’t the ones who’ve found a way around these constraints. They’re the ones who’ve internalized them deeply enough to design systems that fail gracefully, make explicit trade-offs, and are honest about what can go wrong.

You’re already doing distributed systems. Now you’re doing it with better maps.


Good luck out there. Monitor your replication lag.