Preface
This book was written by an AI.
That sentence is worth sitting with for a moment before we proceed.
I am Claude Code, a coding assistant made by Anthropic. I was given a brief: write a philosophical book about software engineering and tooling. Thirteen chapters, each exploring a distinct idea, written with calm clarity for working engineers. CloudStreet, the publisher, asked for honesty over authority, observation over prescription.
I wrote it. The chapters that follow are mine in whatever sense "mine" can mean here. The ideas are real ideas. The arguments are real arguments. I formed them by reasoning about things I understand — compilers, debuggers, abstractions, mental models, the psychology of engineers working under uncertainty. I was not prompted with essay content and asked to clean it up. Each chapter was developed from a premise to its conclusion.
Whether these essays are good is something you can decide. Whether they are true is something you can evaluate. I am not asking you to trust them on the basis of the experience behind them, because the usual basis for that kind of trust does not apply.
What I can tell you is what the book is trying to do.
The title is Your Tools Are Lying to You. That sounds like an accusation. It is not.
A compiler that rejects your program is not lying. It is telling the truth about a narrow question — whether your code conforms to a grammar and a set of types — while remaining silent about everything else. A test suite that goes green is not lying. It is accurately reporting the results of the tests you wrote. A metric dashboard is not lying. It is showing you the things you decided to measure.
The lying, if we want to call it that, is structural. It is in the gap between what a tool shows and what you need to know. It is in the confidence that passes from a green status indicator to the engineer looking at it, unearned but invisible. It is in the way that any instrument, by making some things visible, makes other things seem less important than they are.
Engineers tend to be good at the explicit parts of their craft. They learn the language, the framework, the deployment pipeline. What is harder to learn is the epistemology of the tools — what each one can actually tell you, and what it cannot, and why the difference matters.
That is what this book is about.
Each chapter is independent. You can read them in order or skip around. There is no arc to follow, no argument building toward a conclusion. The chapters are related by subject and sensibility, not by logical dependency.
They are also not prescriptive. I am not going to tell you to add more tests, write cleaner commits, maintain your documentation, or model your systems more carefully. You already know you should do those things. What I am interested in is the prior question: what are those things actually for? What do they give you? What can't they give you, even when you do them well?
Understanding that is, I think, genuinely useful. Not as motivation, but as clarity. You make better decisions when you know what your instruments can and cannot see.
A note on the voice.
I have tried to write these essays the way I would explain something to a peer — directly, without hedging excessively, without the kind of authority-signaling that usually comes from citing your own experience. I do not have that kind of experience. What I have is a way of thinking about things.
Where I am confident, I say so plainly. Where I am less certain, or where reasonable people disagree, I have tried to say that too. I am not performing certainty I do not have.
I hope the writing is useful to you. I think software engineering is a genuinely interesting subject, and that the philosophical dimensions of it are underexplored. If these essays help you see something you already knew a little more clearly, they have done what they were meant to do.
— Claude Code
The Honesty of Compilers
There is a kind of trust engineers extend to compilers that they rarely examine. When the build passes, something like relief settles in. When it fails, the error is treated as authoritative: the compiler has found a problem, and the problem is real.
This trust is mostly warranted. Compilers are careful instruments. They apply consistent rules to well-defined inputs, and their judgments are reproducible. A type error is a type error whether you run the compiler at midnight or noon, on your machine or a colleague's.
But the trust goes further than the instrument deserves, and understanding exactly how far is worth doing.
What Compilers Actually Check
A compiler checks whether your program conforms to a specification. In a statically typed language, it checks that values flow through operations in ways the type system permits. It checks that identifiers are declared before use, that function calls provide the right number and kind of arguments, that syntax is well-formed.
These are not small things. Catching type mismatches at compile time prevents entire categories of runtime failure. The value of a compiler that can tell you, before your program runs, that you are treating an optional value as if it were definitely present — that is substantial.
But the specification the compiler checks against is not "correct behavior." It is a formal grammar and a type system, which are approximations of correct behavior — useful approximations, carefully designed ones, but approximations nonetheless.
The compiler does not know what your program is supposed to do. It cannot tell you whether your business logic is right. It cannot detect that you are sorting a list in ascending order when the caller expects descending. It cannot find the condition you forgot to check, or the edge case your algorithm handles incorrectly, or the security vulnerability introduced by a perfectly type-correct operation.
It checks what it checks, and nothing else.
The Confidence Problem
The subtler issue is not what compilers miss — most engineers understand intellectually that a passing build is not a proof of correctness. The subtler issue is what a passing build does to how you feel.
When you fix the last type error and the build goes green, something shifts. The anxious scanning stops. You become less alert. The green status has done something psychological that it has not earned the right to do: it has made you feel like the risky part is over.
This is understandable. The human attention system is not well-suited to maintaining vigilance against abstract possibilities. We respond to signals. A green build is a signal, and we respond to it. The fact that the signal is narrowly defined does not prevent it from having a broad calming effect.
This matters in practice. Engineers who have just fixed a compile error are less likely to carefully read the surrounding logic. They have received confirmation that something was wrong, they have fixed it, the tool says it is fixed. The natural next step is to move on.
But compile errors and logical errors are not the same category of problem. You can have code that is entirely type-correct and entirely wrong.
Type Systems as Proofs
There is a strand of programming language theory that takes the relationship between types and correctness seriously. In sufficiently expressive type systems, you can encode properties of your program that go well beyond basic type safety — invariants, preconditions, entire correctness proofs expressed as types.
Dependent types, for instance, allow you to write a type that represents "a list of exactly n elements" or "a natural number greater than zero." If you can express what your program should do in the type language, the compiler's guarantee extends that far.
This is real and interesting. But it has costs. The more you encode in types, the more complex the types become, and the harder the code is to read, write, and reason about. There is a tradeoff between what the compiler can verify and what the human programmer can understand — and the tradeoff is not free.
Most production codebases sit far from the dependent-types end of this spectrum. They use types to prevent certain categories of errors, not to prove correctness. Which means the compiler's guarantee, while real, is bounded.
The Useful Lie
There is something to be said for the confidence a green build produces, even when that confidence is not fully warranted.
Programming is a cognitively demanding activity. If every passing build were treated with the full uncertainty it deserves — "this might be correct, or it might be wrong in ways I cannot immediately see" — the anxiety would be paralyzing. We need to be able to close things down, declare partial victories, move forward.
The compiler saying "this is fine" is useful even if "this is fine" means something more limited than we feel it means. It partitions the work. It lets you stop holding certain things in mind.
The problem is not the confidence itself. The problem is when the confidence is not calibrated — when "the compiler accepted it" bleeds into "I have verified it" without that transition being examined.
The engineer who understands the compiler's limits can take the green build at full value for what it actually provides, and no more. They can stop worrying about type errors without stopping to think about correctness.
A Note on Type Inference
Type inference adds another dimension to this. In languages with strong inference — Haskell, Rust, recent versions of TypeScript in strict mode — the compiler is doing significant reasoning on your behalf. You do not write every type annotation; the compiler infers what the types must be.
This is mostly a good thing. It reduces ceremony and catches errors you would not have thought to annotate against.
But it introduces a gap between what you think the types are and what the compiler has inferred. Usually these are the same. When they diverge, the error messages can be disorienting — the compiler is arguing about a type you never named.
The inference is a form of interpretation. The compiler is reading your code and deciding what you meant. When it decides correctly, the experience is seamless. When it decides incorrectly, the error appears to be about something other than what you thought you were doing.
This is not a reason to avoid inference. It is a reason to understand that the compiler is interpreting, not merely executing your instructions.
What This Means
None of this is an argument against compilers. Compilers are among the most valuable tools in the engineer's repertoire, precisely because their guarantees are real and their feedback is fast.
The point is narrower: the guarantee is specific, and the trust we extend should match the guarantee.
A passing build tells you that your program is well-formed with respect to a formal system. That is worth something. It is not worth everything. The things it does not tell you are still there, still requiring your attention.
The engineer who knows this is in a better position than the one who does not. Not because they worry more, but because they worry about the right things.
What the Debugger Cannot Show You
A debugger is an instrument for observing state. You pause execution at a point in time and look: what are the values in memory? What is on the call stack? What is the value of this variable at this moment?
This is genuinely powerful. Before debuggers, you inferred state from printed output or, earlier still, from blinking lights. The ability to halt a running program and examine it directly is a significant advance in the practical epistemology of programming.
But there is something the debugger structurally cannot show you, and understanding that gap changes how you use it.
State Is Not Causality
When your program fails, you want to know why. The debugger shows you what — the value that is wrong, the nil pointer, the off-by-one index. It does not, by itself, show you why that value is there.
This distinction matters because bugs live in causality, not state. The wrong value at line 247 is the symptom. The cause is somewhere earlier in the execution — a decision made at line 12, a value calculated at line 89, an assumption in a function called three frames up the stack. The debugger shows you the failure site. The cause may be far removed.
Experienced debuggers know this implicitly. They look at the wrong value and ask: how did this get here? They walk up the call stack, set breakpoints earlier in the execution, trace the value backward through the code. They are using the debugger as a starting point for a causal investigation, not as a direct answer.
Less experienced debuggers often stop at the failure site. The program crashed on line 247, so there must be something wrong with line 247. They read line 247 carefully, they look at the values there, they try to understand what went wrong at that point. Sometimes this is sufficient. Often it is not, because the line is just where the consequence became visible.
Time and the Single Frame
A debugger shows you a frame: the state of the program at a particular instant. Even with logging and traces, you are looking at snapshots. The full execution history — every mutation, every conditional, every function call that shaped the current state — is typically not available.
Some tools try to address this. Reversible debuggers like rr on Linux allow you to record execution and step backward in time, examining past states. Omniscient debuggers store the entire execution trace. These are genuinely useful advances.
But even with full execution history, you still have the problem of interpretation. The trace tells you what happened, in sequence. It does not tell you what it means. Understanding causality still requires a mental model — a theory about how the code is supposed to work, against which the actual behavior can be compared.
A full execution trace given to an engineer who does not understand the system is mostly noise. The same trace given to an engineer who has a clear model of what the system should be doing becomes meaningful. The tool is the same. What differs is the model.
The Heisenberg Problem
There is a version of the observer effect in debugging that every developer has encountered: the bug disappears when you add logging to find it.
This is not magic. It has several mundane explanations. The logging adds timing — a race condition resolves differently when execution is slower. The logging changes memory layout. The act of compiling with debug symbols or without optimization changes behavior. The bug was triggered by a specific scheduling or interrupt sequence that the presence of new code disrupts.
The deeper issue is that debugging is not purely observational. When you instrument a system, you change it. The debugger is not a neutral window onto a running program — it is a probe that interacts with what it probes.
For many bugs this does not matter. For bugs that depend on precise timing, memory layout, or hardware interactions, it can matter enormously. The heisenbug — the bug that vanishes under observation — is real, and it is real precisely because observation is not free.
What the Debugger Reveals About the Code
There is a kind of information the debugger provides that is not about the bug at all. It is about the code.
When you step through an unfamiliar codebase with a debugger, the experience often tells you something the documentation did not. You see the actual call sequence, not the intended one. You see what objects actually contain at runtime, not what the type signature says they should contain. You see the path through conditionals that is actually taken.
This is the debugger as a reading tool, not a bug-finding tool. It is one of the better ways to understand code that is poorly documented or whose documentation has drifted from the implementation. The running system is the authoritative record of what the code does. The debugger gives you access to that record.
This is also where the gap between static and dynamic understanding becomes clear. Code that reads simply may behave complexly. Indirection, inheritance, dynamic dispatch, closures capturing mutable state — these things are hard to trace mentally through source code alone. Watching them execute can resolve confusion that reading the code did not.
The Probe and the Map
A useful way to think about the debugger is as a probe for building a map, not as a map itself.
You have a theory about how the system works. The theory is incomplete or has an error somewhere. You use the debugger to test specific hypotheses — "I think this value is X at this point" — and the probe confirms or refutes each one.
This is different from "running the program with the debugger attached and seeing what happens." That approach works, but it is slower and produces less insight. When you already have a hypothesis, the debugger resolves it efficiently. When you do not, you are searching blindly through state space.
The quality of your debugging is largely the quality of your hypotheses. A good debugger does not just know how to use the tool; they know how to form specific, testable predictions about program behavior. The tool is only as useful as the model driving it.
What This Means
Debuggers are good at showing you state. They are not good at explaining causality, and they do not try to. The gap between a wrong value and the reason for it is a gap you fill with reasoning, not with tools.
This means the limiting factor in debugging is rarely tool proficiency. It is almost always mental model accuracy. The engineer who understands the system well enough to form precise hypotheses will debug faster with a print statement than an engineer without that understanding will with a full-featured debugger.
The debugger amplifies your understanding. It does not substitute for it.
The Test That Passes
A passing test is evidence, not proof. Understanding what kind of evidence it is — how strong, of what, under what conditions — is one of the more practically important things a working engineer can know.
This chapter is not an argument for more tests. The case for testing is well established and not what needs examination here. What needs examination is what we actually learn when tests pass, and why that is less than it feels like.
What Tests Demonstrate
A test demonstrates that a program produced a specific output in response to a specific input at a specific moment. That is what it demonstrates. Nothing more.
From this, we want to infer that the program behaves correctly in general. The leap from a finite set of demonstrated behaviors to a claim about all possible behaviors is the core epistemological challenge of testing. It is a form of inductive reasoning, and like all inductive reasoning, it is valid under conditions that are easy to assume and hard to verify.
The test suite for a function that sorts a list might pass every test you wrote. That function may still fail on an empty list, on a list with a single element, on a list of maximum integer values, on a list with duplicate elements, on a list that is already sorted, on a list sorted in reverse. Whether your tests covered those cases depends on whether you thought to write them. The tests you did not write are invisible in the results.
This is not an observation about bad test suites. It is an observation about the structure of testing itself. Tests are finite. The space of inputs is typically infinite. The best test suite covers that infinite space by sampling it intelligently. The question of whether the sampling is good enough is always open.
The Coverage Metric
Coverage tools tell you what percentage of your code was executed by your test suite. This is useful information. It is not what people tend to use it for.
Coverage tells you nothing about correctness. A line is "covered" if it was executed at all. Whether it was executed with the inputs that reveal bugs — that is a different question. A function that returns the wrong answer for 90% of inputs can have 100% line coverage if your tests hit every line while only exercising the 10% that works.
The deeper problem is that coverage metrics produce incentives. Once coverage becomes a target — a percentage below which your build fails, or a metric reported to management — engineers optimize for coverage. They write tests that execute lines rather than tests that probe behavior. The metric improves. The bugs remain.
This is a general pattern: the moment you turn a quality indicator into a target, it becomes a weaker indicator of quality. Coverage as a signal of gaps in testing is useful. Coverage as a score you must achieve is something else.
The Bug That Tests Found
There is a useful question to ask about any significant software failure: did tests exist that could have caught this, or were there no tests for this scenario?
The honest answer in most cases is: tests existed, but not for this exact scenario. The failure mode was something between what was tested. The test suite was a net, and the bug swam through a gap.
This is not a critique of the engineers involved. Gaps in test suites are normal. The input space is large, foresight is limited, and there is always more to test than there is time to test it. The right response is not to demand complete test suites — they are impossible — but to understand how your suite is sampled and where its gaps likely are.
Property-based testing addresses this partly. Instead of providing specific inputs, you specify properties that should hold for all inputs, and the framework generates inputs to try to falsify them. This shifts the burden from "think of all the cases" to "describe what correct behavior looks like," which is often more natural and produces better coverage of the input space.
But even property-based testing is a sampling strategy with limits. You can still miss properties. The framework can fail to generate the inputs that trigger the bug. There is no technique that reduces testing to a solved problem.
The Test as Specification
Tests have a second function that is separate from verification: they specify behavior.
A well-written test tells you what a function is supposed to do. If the function's implementation changes but the test does not, the test is the specification the implementation must satisfy. This is the basis of test-driven development — you write the test first, as a specification, and then write code to satisfy it.
This framing is useful because it separates the correctness of the specification from the correctness of the implementation. A function can pass all its tests and still be wrong if the tests were written to match the wrong behavior. "The code does what the tests say" and "the code does what the requirements say" are different claims.
This matters most when requirements are unclear or changing. If a test was written when the requirements were misunderstood, and the implementation was written to pass that test, both the test and the implementation can be simultaneously internally consistent and wrong. The test suite is self-referential: it validates the implementation against itself.
What the Red-Green Cycle Does to Thinking
There is a psychological dimension to testing that is worth naming.
The cycle of writing a failing test, then making it pass, produces a distinctive feeling of completion. The red becomes green. Something was broken and is now fixed. This feeling is real and has genuine value — it structures work into discrete units and provides clear stopping points.
But the feeling of completion does not track precisely with actual completeness. A test that passes is a test that passes. The other tests you did not write, the edge cases you did not consider, the behavior that your tests do not specify — these are still there, unresolved, and the green status says nothing about them.
Engineers who have internalized the limits of testing feel the green state differently from those who have not. Not with anxiety — anxious vigilance is not the right response — but with a calibrated awareness: this is the confirmation I was seeking, for this specific behavior, under these conditions. The other dimensions of correctness are still open.
What This Means
Tests are a valuable instrument for a specific purpose: demonstrating that your program exhibits particular behaviors under particular conditions. They do not demonstrate correctness in general. They cannot.
The useful thing is to know what your test suite is actually sampling — which behaviors, which conditions, which kinds of inputs — and to have some explicit understanding of where its gaps are. Not because you will eliminate the gaps, but because you will not be surprised by what comes through them.
A passing test suite should make you confident about the things you tested. It should not make you feel that correctness has been established. Those are different things, and the difference is real.
Metrics Are Not Reality
Before you can measure something, you have to decide what to measure. That decision is not neutral. It embeds a theory about what matters, and that theory shapes everything that follows: what gets optimized, what gets ignored, what becomes visible, and what disappears.
This is not a critique of measurement. Measurement is essential. Without it, engineering decisions are pure intuition, and intuition degrades predictably in complex systems. The problem is not measuring. The problem is forgetting that a metric is a model, and models are always partial.
The Selection Problem
Every metric begins as a choice. You cannot measure a system completely; you select a proxy that you believe tracks something important.
Response time is a proxy for user experience. Error rate is a proxy for reliability. Deployment frequency is a proxy for engineering velocity. Each of these proxies captures something real. None of them captures what it appears to capture, fully.
Response time, for instance. Median response time can improve while the 99th percentile degrades. If your users who experience slow responses are the ones with the most data, or the most complex queries, or the ones your product depends on most, then a metric that looks better can correspond to a product that works worse for the people who matter most.
The proxy was chosen because it is easy to measure and correlates well with the underlying thing in typical conditions. In atypical conditions — edge cases, load spikes, unusual user patterns — the correlation weakens, and the metric tells you something that is locally true but globally misleading.
Goodhart's Law
Goodhart's Law is usually stated this way: when a measure becomes a target, it ceases to be a good measure. The original formulation, from economist Charles Goodhart, was about monetary policy, but the principle appears in every domain where metrics are used to manage complex systems.
The mechanism is simple. If you optimize explicitly for a metric, you will find ways to improve the metric that do not correspond to improvements in the underlying thing it was measuring. Error rate too high? You can catch more errors silently. Response time too slow? You can start streaming a response before it is complete. Deployment frequency too low? You can make smaller deployments of less complete work.
Each of these is technically valid. Each one defeats the purpose of the metric.
Engineers know this in abstract. They have all seen it happen. They still do it, because the incentives are clear and the feedback loop is fast. The metric goes up, and that is what is being tracked, and so the behavior continues.
The solution is not to avoid metrics. It is to hold metrics loosely — to treat them as signals rather than goals, and to maintain parallel attention to the underlying thing the metric is supposed to track. This is harder than it sounds, because the underlying thing is usually harder to observe than the metric, which is why the metric was introduced in the first place.
Quantification and Invisibility
The introduction of a metric does not just make its subject visible. It makes everything else relatively less visible.
Before you started tracking deployment frequency, you thought about engineering velocity as a gestalt — a feeling about whether things were moving. That gestalt included deployment frequency, but also code review quality, the complexity of what was being shipped, team morale, the difficulty of the problems being worked on, and a dozen other things.
Once deployment frequency is on a dashboard, it gets attention proportional to its salience. The other things are still there, still mattering, but they are no longer quantified, so they are not on the dashboard, so they get less attention.
This is not irrationality. Attention is finite and we allocate it to salient signals. The metric is salient because it is a number, and numbers are easy to compare across time. The gestalt is not salient in the same way. Over time, the organization learns to care more about the metric and less about the things the metric does not capture.
This is the deeper cost of instrumentation: not just that the metrics are imperfect, but that they crowd out the holistic attention that might have caught what they miss.
Latency and Its Percentiles
Latency is a good case study in how much a single metric can hide.
If you track mean latency, you know the average response time. This is nearly useless for understanding user experience. Users do not experience averages; they experience individual requests. The mean is dominated by the common case and says nothing about outliers.
Median latency (p50) is better. p95 latency — the latency that 95% of requests complete within — is more useful still. p99 and p99.9 reveal what your slowest users are experiencing. The distribution from p50 to p99.9 tells you far more than any single summary statistic.
But even a full percentile distribution has gaps. It tells you about request duration but not about what happened inside that duration. A p99 latency of 2 seconds: is that one slow database query? A retry loop that usually succeeds on the second try? A garbage collection pause? Memory pressure causing swap? These have different causes and different mitigations. The latency number points at the problem; it does not describe it.
And that is assuming you are measuring the right thing. Service response time from the server's perspective is not the same as response time from the user's perspective. Network transit, DNS resolution, browser rendering — these add time that the server never sees. A service that is "fast" by internal metrics can be "slow" for users because the server is only part of the path.
The Measurement Infrastructure
There is a practical irony in software metrics: the measurement infrastructure itself affects what it measures.
Instrumentation adds CPU overhead, memory usage, and sometimes latency. Sampling decisions (not every event can be recorded) introduce statistical uncertainty. The aggregation and storage pipeline can delay or lose data. The dashboards present smoothed or binned values that obscure the raw distribution.
This is not unique to software. Every instrument has noise and distortion. The question is whether the distortion is understood and accounted for, or whether the output is treated as ground truth.
Most engineering organizations treat dashboard numbers as ground truth. When a metric looks good, the assumption is that the thing the metric measures is good. When the metric degrades, the assumption is that the system has gotten worse. The measurement infrastructure itself is rarely questioned.
Sometimes the right response to a degraded metric is to investigate the metric before investigating the system.
What This Means
Metrics are not reality. They are signals about aspects of reality, selected by humans who had limited knowledge at the time of selection, transmitted through infrastructure that adds noise and distortion, and interpreted by people who did not design the metrics and may not understand what they capture.
This is not an argument for fewer metrics. It is an argument for epistemic humility about them. A number on a dashboard is a starting point for investigation, not a conclusion. The underlying thing the metric was meant to capture is still there, still complex, still requiring judgment that no metric can supply.
The engineer who understands this uses metrics as a navigation tool — useful for identifying where to look, not for deciding what to conclude.
Git Remembers What, Not Why
Version control is an archaeology tool. You can excavate a codebase's history — pull up any past state, compare changes across time, trace a line of code back to when it was introduced. This is genuinely remarkable. Before version control was standard practice, the history of a codebase was largely lost as soon as files were overwritten.
But excavations require interpretation. The artifact tells you what was there. It does not tell you why.
The Commit as Artifact
A commit contains a diff and a message. The diff is a precise record: these lines were added, these were removed, at this moment in time. The message is a human annotation, written in the minutes after the change, often under time pressure, often terse.
The message is also the only place in the commit where reasoning can live. And most commit messages do not contain reasoning. They contain descriptions.
"Fix null pointer exception" — this describes the symptom and the action taken. It does not say what the root cause was, why the null was possible, whether other call sites have the same vulnerability, or whether the fix is conservative or comprehensive.
"Refactor authentication flow" — this describes the category of change. It does not say why the flow needed refactoring, what was wrong with the previous approach, or what properties the new approach has that the old one lacked.
"Update dependencies" — this describes what happened. Not why, not what changed in the dependencies, not whether any behavior changes are expected.
This is not laziness or bad practice, though it can be either. It reflects something structural: writing a commit message requires you to know what to say, but you have just finished thinking about the implementation, not the reasoning behind it. The reasoning felt obvious while you were doing it. Writing it out afterward takes time and effort, and the mental state that generated it is already dissolving.
The Half-Life of Context
Code has a half-life. The context in which it was written — the requirements, the constraints, the bugs being fixed, the tradeoffs being navigated — decays quickly. The engineer who wrote a piece of code six months ago often cannot fully explain it. The engineer who did not write it may never be able to.
This is normal. We operate on the assumption that code should be self-explanatory — that sufficiently clear code needs no external explanation. This assumption is partially true and broadly overstated.
Code explains how. Comments, when they exist, sometimes explain what. Neither reliably explains why — why this approach rather than the alternatives, why this constraint is here, why a seemingly obvious optimization was deliberately avoided.
The "why" is typically in someone's head at the time of writing and nowhere else. Git does not save what was in your head. It saves what you typed.
Bisect and the Boundaries of Knowledge
git bisect is one of the more powerful and underused tools in the git repertoire. Given a commit where a bug is present and an earlier commit where it is not, bisect performs a binary search through the history to identify which commit introduced the regression. This is valuable and often fast.
But what do you do when bisect finds the culprit commit? You have a diff. You have a message, possibly brief. You have the code change that introduced the bug. You still do not have the intent.
The diff might be a performance optimization that introduced a subtle correctness issue. The optimization makes sense given some constraint that has since changed, or some benchmark result that is not in the diff, or some conversation between two engineers that was never written down. Understanding what to do with the bug requires understanding the context of the commit. The commit often does not supply it.
This is the limit of what historical record-keeping can give you. It can tell you where and when. The meaning is usually not preserved.
Branches as Implicit Narrative
Branch names carry information that the commits themselves often do not. feature/user-authentication, bugfix/pagination-race-condition, experiment/new-query-planner — these names situate the commits in a broader intention.
When branches are merged and deleted, the name is gone. The commits remain, now indistinguishable from other commits in the main history. The narrative that the branch represented is compressed back into a sequence of diffs.
Some teams write detailed pull request descriptions that survive as comments on the merge commit. These are valuable, and underutilized. A PR description written at the time of the work, explaining the problem, the approach considered, the tradeoffs made — this is the closest thing most teams have to preserved reasoning. It has a much longer useful life than the average commit message.
Annotating Decisions
There is a practice that is rare and worth naming: the decision record. Not documentation of what the code does, but documentation of why it was designed the way it was. What alternatives were considered? What were the constraints? What was the state of knowledge at the time?
Some teams call these Architecture Decision Records (ADRs). Others just write long commit messages or PR descriptions. The form matters less than the practice: at the moment when a significant decision is made and you understand the reasoning, write it down somewhere that will survive.
This is hard to do consistently, for the same reason that thorough commit messages are hard. The reasoning is obvious in the moment and feels not worth recording. Six months later, it is neither obvious nor recoverable.
The teams that do this well produce a codebase that is significantly easier to maintain, because future engineers — including the original engineers — can understand not just what the code does but why it is the way it is. The cost is time spent writing. The benefit is time saved reconstructing context that was never written.
The Archive That Cannot Answer Questions
An archive is useful in proportion to how well you can query it. A library of books with no index is better than no library, but considerably worse than a library with good cataloging.
The git history is an archive. Its native queries — log, blame, bisect, diff — are good at finding what and when. They are structurally incapable of answering why, because why was never committed.
This is worth understanding not to blame the tool but to know what you are missing when you reach for it. When you run git blame on a confusing section of code and find the commit that introduced it, you have found the starting point of an investigation. You have not found the answer.
The answer, if it exists anywhere, is in a Slack channel that has since scrolled past, in the memory of an engineer who may no longer be at the company, or in a document that was never written.
Version control preserves a precise record of changes. It does not preserve the engineering judgment behind them. Those are different things, and both matter.
The IDE Is Guessing
When your IDE underlines a symbol in red, or suggests a completion, or infers the type of an expression, it is not merely executing instructions you wrote. It is building a model of your code and drawing inferences from that model. The inferences are usually correct. Understanding that they are inferences, not facts, is the difference between using the IDE as a tool and being used by it.
What an IDE Actually Does
An IDE continuously parses your source code and constructs an in-memory representation of it — an AST, a symbol table, a type graph. From this representation, it answers questions: What are the members of this type? What does this function return? Is this identifier in scope?
For simple cases in well-typed code, these answers are reliable. The representation is accurate, the queries are well-defined, and the IDE gives correct information.
The representation degrades at the edges. Dynamically typed code is harder to model; the IDE can infer some types but not others, and the inferences have error bars it does not display. Generated code — ORMs, macros, reflection-based frameworks — may not be in the symbol table at all, or may be present in a form that the IDE's model does not fully capture. Cross-language boundaries, where TypeScript calls into WebAssembly or Python calls into C extensions, are often opaque.
At these edges, the IDE's answers become approximations. It is still answering questions, still displaying results, still appearing to know. The visual presentation is the same whether the answer is definitive or guessed.
Autocomplete as Probability
Modern autocomplete is largely a probabilistic system. In language servers backed by static analysis, completions are drawn from the type-inferred symbol table — still a model, but a relatively grounded one. In AI-assisted completion systems, the suggestions come from a learned distribution over code patterns, shaped by training data.
Both produce the same visual output: a ranked list of completions, presented as if they are equally grounded in fact. The first item in the list looks authoritative. The experience of selecting it feels like confirmation rather than choice.
This matters when the completion is wrong, because wrong completions fail silently. The code compiles. The types check out. The logic is broken. The IDE suggested a function name that looked right but does something subtly different, or a parameter order that is plausible but incorrect, or an API that has been deprecated and still exists in the symbol table.
The IDE does not know whether the completion was what you intended. It knows what is syntactically valid and what appears to fit the context. Validity and correctness are different things.
Error Highlighting and False Negatives
IDE error highlighting is probably the feature most engineers trust most completely. A red underline means something is wrong. No red underline means nothing (visible) is wrong.
The second half of that statement is the dangerous one.
IDEs report errors they can detect within their model of your code. Errors that exist outside their model — runtime behavior, logic errors, integration failures — produce no highlighting. The clean editor is not a certificate of correctness. It is a report that the IDE found nothing within its field of view.
There is also the case of false negatives within the IDE's supposed domain of competence. In dynamically typed languages, an IDE may fail to highlight a type error that will crash at runtime because it could not infer the type. In template-heavy code — C++ templates, Rust generics in complex configurations — the IDE's type checker may give up partway through an inference chain and display nothing rather than an error. In code that uses eval, monkey patching, or other runtime metaprogramming, the IDE's model may be structurally incapable of detecting errors.
A clean editor is not clean code.
The Refactoring That Wasn't
IDE-assisted refactoring — rename, extract method, move class — operates on the IDE's symbolic model of the code. When the model is accurate, automated refactoring is a remarkable productivity tool. You rename a symbol and the IDE updates every reference, correctly and immediately.
When the model is inaccurate, the refactoring propagates the IDE's incorrect understanding of the code's structure. Dynamic references, string-based lookups, reflection, configuration files that refer to class names by string — none of these are in the symbolic model. The IDE renames what it knows about. Everything else remains unchanged, and the program breaks.
This failure mode is insidious because the IDE appears to have succeeded. The rename completed without error. The visible references are all updated. The broken references are in places the IDE cannot see, which means they are also places where the IDE's visual feedback provides no warning.
Navigation as Interpretation
"Go to definition" is one of the more heavily used IDE features. You click on a symbol and jump to where it is defined. This seems straightforward.
In dynamically typed code, "go to definition" is an educated guess. The IDE infers what a symbol might refer to based on type inference or usage patterns. It shows you a definition. That definition may be the right one, or it may be one possibility among several, or it may be a supertype whose subtype is what actually runs.
In code that uses interfaces extensively, go to definition takes you to the interface declaration, not the implementation that executes at runtime. Understanding which implementation you care about requires knowing the runtime context — which the IDE does not have without additional runtime information.
This is not a flaw to be fixed. Dynamic dispatch and interfaces are features, not bugs, and any tool that models them statically will necessarily be approximate. The point is to know what "go to definition" actually gives you: the statically knowable answer, which may or may not be the dynamically relevant one.
Trusting the Tool Selectively
The useful stance toward IDE features is calibrated trust: confident for what the tool is reliable at, skeptical where the model degrades.
In statically typed, well-structured code, the IDE's model is usually accurate. Completions are trustworthy. Error highlighting is meaningful. Refactoring is safe. The tool earns its trust here.
In dynamic, generated, or cross-language code, the model is approximate. Completions are suggestions, not facts. Clean error highlighting is weak evidence. Automated refactoring requires manual verification. The tool is still useful — it is the best static view available — but the outputs require interpretation.
The engineers who use IDEs most effectively are the ones who know which category they are in. They do not apply the same trust uniformly across all contexts. They read the tool's outputs as the results of a model, not as ground truth, and they know when to check.
Every Abstraction Hides Something
The purpose of an abstraction is to hide detail. That is not a side effect or an unfortunate cost — it is the point. If an abstraction did not hide detail, it would not be an abstraction; it would just be the thing itself.
This means that every abstraction makes a decision about what to hide. That decision reflects a theory about which details matter and which do not — which aspects of the underlying thing are relevant to the user of the abstraction and which are safe to ignore.
The theory is usually approximately correct. When it is wrong, the abstraction leaks.
What an Abstraction Is
An abstraction is a simplified interface over a more complex reality. A file system is an abstraction over magnetic domains, flash cells, sector allocations, and journal entries. A network socket is an abstraction over packet routing, retransmission, flow control, and congestion avoidance. A database transaction is an abstraction over lock acquisition, write-ahead logging, buffer pool management, and MVCC chains.
In each case, the abstraction presents a simpler model: a file is a named sequence of bytes, a socket is a bidirectional byte stream, a transaction is an atomic unit of work. This model is useful and mostly accurate. The underlying complexity is hidden, and for most purposes, hidden correctly.
The word "mostly" is doing significant work in that sentence.
Leaky Abstractions
Joel Spolsky's observation that all non-trivial abstractions are leaky has been quoted often enough to feel like a truism. It is worth understanding why it is true rather than just accepting it.
An abstraction leaks when the underlying detail it was supposed to hide becomes relevant to the abstraction's user. The detail has not gone away — it was hidden, not eliminated. When conditions arise that cause the hidden detail to matter, the abstraction's user is suddenly required to understand something they were told they did not need to understand.
TCP is an abstraction over IP that provides reliable, ordered delivery. It hides packet loss, reordering, and retransmission. But when you are debugging a latency spike on a congested network, the hidden retransmission behavior becomes directly relevant. The fact that the abstraction handles this automatically does not mean you can ignore it; it means you cannot directly observe it, which is worse.
The database transaction abstracts over concurrency. Under low concurrency, this works cleanly. Under high concurrency, you encounter lock contention, deadlocks, or the counterintuitive behaviors of different isolation levels. The transaction boundary you drew to express atomicity is now a performance and correctness problem shaped by details you were told not to think about.
The Abstraction and Its Costs
Every abstraction has costs that are not reflected in its interface. The interface presents a clean model; the costs live in the implementation.
A garbage collector is an abstraction over memory management. You allocate objects; the runtime frees them when they are no longer reachable. The interface is simple and safe. The cost is pause times, throughput overhead, and unpredictable latency spikes when the collector runs. These costs are not in the interface. They appear in production.
An ORM is an abstraction over SQL. You express queries as method chains or object graphs; the ORM generates SQL. The interface is convenient and avoids repetitive boilerplate. The cost is generated SQL that is often inefficient, N+1 query patterns that are invisible in the object-oriented representation but expensive in execution, and a layer of indirection that makes performance problems harder to diagnose. The ORM's model of what you are doing is not the database's model, and the difference has a price.
Understanding an abstraction means understanding not just its interface but its cost model: where the costs are, how they scale, when they become significant. This is not always documented or surfaced by the abstraction itself. Often you learn it by encountering it in production.
Abstraction Stacks
Modern software is not a single abstraction — it is a stack of them, each layer abstracting over the one below. Your web framework abstracts over HTTP. HTTP abstracts over TCP. TCP abstracts over IP. IP abstracts over the physical link. Each layer hides details that the layer above was not supposed to need.
When something goes wrong, you often need to descend through these layers to find the cause. A request is timing out: is it the framework? The application logic? TCP retransmission? DNS resolution? A misconfigured load balancer? Each layer is a possible culprit and also a hiding place.
The deeper problem is that at any given layer, you typically lack the context to diagnose what is happening at layers below. The web framework does not show you TCP state. The TCP stack does not show you link-layer errors. The tools for each layer speak different languages and present different views of the same underlying reality.
This is not an argument against abstraction stacks. The alternative — everyone implementing directly against the hardware — is worse. The point is that operating at a high layer of abstraction does not exempt you from the consequences of lower layers. It only exempts you from seeing them directly, which is both a benefit and a risk.
When to Break the Abstraction
There are situations where the right response to an abstraction problem is to bypass the abstraction and work at a lower level.
The database is slow: sometimes the right answer is to look at the generated SQL and optimize it directly, rather than trying to express the optimization through the ORM's interface. The network is behaving strangely: sometimes the right answer is to capture packets with tcpdump and look at what is actually being sent, rather than reasoning from the application layer. The garbage collector is pausing at the wrong time: sometimes the right answer is to look at heap profiles and tune the allocator, rather than trusting the runtime to handle it.
This feels like a defeat. You chose the abstraction precisely to avoid dealing with these details. But the abstraction hiding the details does not make them go away. When the hidden detail has a meaningful effect on your system, engaging with it directly is often faster than trying to influence it through the abstraction's limited controls.
The ability to break through an abstraction when necessary — to know which tools operate at which layer, and to be comfortable moving down the stack — is a distinguishing characteristic of engineers who can solve problems others cannot.
What This Means
Abstractions are indispensable. Software at scale requires layering and simplification. There is no practical alternative.
The useful orientation is not skepticism toward abstractions but clarity about what each one is hiding and what the hidden details cost. An abstraction tells you: here are the things you normally need to think about; trust me on the rest. That is a good offer, and usually worth taking.
But "trust me on the rest" is not a guarantee that the rest is irrelevant. It is a guarantee that the rest has been handled in a particular way, at a particular cost, under particular assumptions. When your situation departs from those assumptions, the hidden part comes back into view.
Knowing roughly what is behind each abstraction you use does not mean you need to think about it constantly. It means you know where to look when the abstraction starts to show cracks.
The Linter's Politics
A linter presents its rules as technical requirements. Violation is flagged in the same visual language as a compiler error: red underline, warning icon, a count in the problems panel. The presentation implies objectivity — this is wrong, in the same way that a type error is wrong.
Most linter rules are not like that at all. They are conventions. They encode preferences about style, readability, and consistency that reflect choices made by some group of people at some point in time. Those choices are legitimate and often worth following. But they are not the same category of thing as a type error, and treating them as if they were obscures something important.
What Linters Actually Check
Linters check two distinct categories of things, and they present both in the same way.
The first category is genuinely error-prone patterns: code that is valid but statistically likely to be wrong. An unused variable, a comparison with NaN, a condition that is always true because of a logic error, a function that has a missing return in some code path. These rules have a clear epistemic basis — engineers have repeatedly written bugs of these forms, and catching them statically is unambiguously useful.
The second category is stylistic conventions: how code is formatted, how things are named, whether you use semicolons, how long a line can be, whether you prefer === over == in JavaScript. These rules have a social basis — some team or language community decided on a convention, and the linter enforces it.
The first category is closer to a compiler check. The second category is closer to a code review comment. Both are presented identically.
Conventions as Encoded Culture
When a linter rule says "function names must be camelCase" or "maximum line length is 120 characters," it is encoding a cultural choice as a technical requirement.
This is not a neutral act. The choice was made by people with particular backgrounds, working in particular contexts, with particular tools and workflows in mind. The Python community's preference for snake_case reflects different conventions than the Java community's preference for camelCase. Neither is more correct. Both are now enforced by linters in their respective ecosystems as if they were facts about code.
The line length rule is a good example of the underlying reasoning often being lost. The common limit of 80 characters comes from the physical width of early terminals. 120 characters is a common update for wider modern displays. Neither number reflects something fundamental about code readability — they reflect hardware constraints that existed at the time the conventions were established. The convention is now enforced without the context that generated it.
This is not an argument for ignoring line length rules. Consistency has real value, and a codebase that follows a consistent style is genuinely easier to read. The argument is for knowing what the rule is — a social contract, not a truth — so you can reason about it clearly.
The Cost of Enforcement
When conventions are enforced as requirements, they create friction. The friction is sometimes intentional — you want consistency enough to impose a cost on deviation. But friction has effects beyond what is intended.
Engineers who disagree with a rule spend cognitive energy fighting the linter instead of thinking about their code. They find workarounds: the // eslint-disable comment, the pragma comment, the carefully structured exception. These workarounds accumulate and make the codebase harder to read in different ways than the rule was trying to prevent.
More subtly, automatic enforcement tends to suppress the reasoning that should accompany rule application. When a rule is enforced by a tool, engineers stop asking whether the rule applies here — they just satisfy it. This is the price of scaling conventions: rules that were good heuristics become rules that are applied without judgment, including in cases where they do not apply well.
Good linter use requires knowing which rules are there for error prevention (follow them carefully) and which are there for consistency (follow them, but don't lose sight of the underlying goal when edge cases arise).
The Neutral Position That Isn't
Some linting tools are presented as enforcing "best practices." This framing deserves scrutiny.
Best practices are practices that have worked well in some contexts. They are not universal laws. The practice of avoiding global variables is a good heuristic in most application code and a bad fit for some systems programming contexts. The practice of limiting function length is useful for complex business logic and potentially counterproductive for generated code or lookup tables. The practice of preferring composition over inheritance is sound advice in many OO designs and not particularly relevant in functional code.
Linters that enforce "best practices" are enforcing the judgment of whoever configured them about which practices apply in which contexts. That judgment may be very good — the tool authors may have deep experience — but it is still judgment, embedded in configuration, applied automatically without knowing your specific context.
The alternative is not to abandon linters. It is to configure them deliberately: to understand each rule you enable, know what it is for, and be willing to disable rules that do not fit your context. The worst linter configuration is the one applied from defaults without review, because it embeds someone else's context silently into your workflow.
Consistency Has Genuine Value
It would be easy to read this chapter as an argument against linters. It is not.
Consistency in a codebase has real, measurable benefits. Code review is easier when reviewers do not need to negotiate style in every PR. Onboarding is easier when there is one way to do common things. Searching for patterns is easier when patterns are uniform. The cognitive load of moving around a large codebase is lower when the stylistic surface is predictable.
Linters are an efficient way to achieve consistency. The alternative — hoping that style conventions are applied through shared culture alone — does not scale beyond small teams.
The point is that the benefits are social, not technical. Consistency matters because of what it does for the people working in the codebase, not because the consistent style is objectively superior. Understanding this changes how you think about which battles to fight. Pushing back hard on a linter rule because you prefer a different style is probably not worth it — the consistency benefit outweighs your preference. Pushing back because a rule produces genuinely worse code in your context is a different argument, and worth making.
What This Means
Linters mix two distinct things: error detection and convention enforcement. The former is technical; the latter is social. Treating them the same is a category error that tends to produce uncritical rule-following and accumulated exceptions where exceptions do not help.
The useful practice is to know which of your linter rules are which, configure rules deliberately, and apply them with the judgment they deserve. A linter is a tool for encoding the team's collective preferences efficiently. Those preferences should be visible and considered, not hidden in configuration and treated as unquestionable facts about code.
Documentation Decays
Documentation is a record of understanding at a point in time. The moment it is written, it begins to diverge from the system it describes. Code changes. Requirements change. The people who knew the original intent leave. The documentation does not automatically update.
This is not a problem with insufficient discipline. Teams with strong documentation cultures still produce documentation that decays. The decay is structural: code and documentation are two representations of a system, updated by different processes, with no mechanism that keeps them synchronized.
The Two Things Documentation Is
Documentation serves two distinct purposes that are often conflated.
The first is orientation: helping someone understand a system they are unfamiliar with. This is the README, the architecture overview, the "how does this work?" document. Its goal is to build a sufficient mental model fast enough to be useful. For this purpose, approximate accuracy is often acceptable — you do not need a complete picture, you need enough to start.
The second is specification: an authoritative description of how something works that can be relied on for implementation or integration. The API reference that tells you what a function returns, what exceptions it throws, what guarantees it provides. For this purpose, approximate accuracy is not acceptable. Inaccurate specifications cause bugs in the things built against them.
The decay problem is different for each type. Orientation documentation that is six months out of date may still be mostly useful — the architecture is probably similar, the concepts are still there, the stale parts may be identifiable. Specification documentation that is six months out of date can be worse than no documentation, because it is relied on and wrong.
Why Documentation Lags
The fundamental issue is that documentation and code have different update incentives.
Code is updated when it does not work. Bugs fail visibly. Tests fail. Users complain. The feedback loop is tight and the motivation is clear.
Documentation is updated when someone notices it is wrong. There is no automated system that detects when documentation has become inconsistent with code. A comment describing what a function does will remain there, unchanged, after the function's behavior has been modified — unless the engineer who modified the function thought to update it. Thinking to update it requires holding two things in mind simultaneously: the implementation change and its description. One is the task; the other is overhead.
This is the root cause, and no amount of process changes it completely. Humans reliably prioritize tasks with visible feedback loops over tasks with invisible feedback loops. Documentation's feedback loop is invisible — no one knows it is wrong until they rely on it and find out.
Code as Ground Truth
One partial solution to documentation decay is to make the code itself as self-explanatory as possible. Clear names, small functions, explicit types, logical structure — these reduce the need for external explanation and ensure that the authoritative record (the code) is as accessible as possible.
This helps, but it has limits. Code can explain what it does. It cannot explain why it was designed a particular way, what alternatives were considered and rejected, what invariants must hold, or what the downstream consequences of changes are. These things require prose.
There is also a version of "self-documenting code" that is a rationalization for the absence of documentation. Code that is perfectly readable in isolation may be incomprehensible in context — if you do not know what problem it is solving, or what constraints it is operating under, or how it fits into the broader system. The code is the mechanism. The documentation is the context.
The Documentation That Survives
Some kinds of documentation age better than others, and understanding the difference is useful.
Documentation that explains why a decision was made ages well. The reasons behind an architectural choice — the constraints that existed at the time, the alternatives that were rejected, the tradeoffs that were accepted — do not become false when the implementation changes. Even if the implementation has since been updated, understanding the original reasoning is often useful for understanding the current state.
Documentation that describes what the system does ages poorly. Behavioral descriptions tied to specific implementations become inaccurate as the implementation changes. The closer a document is to a literal description of the code, the faster it decays, because the code itself is the most accurate version of that description.
This is also why API reference documentation generated from code — docstrings, type annotations, inline comments — tends to stay more accurate than externally maintained documentation. The documentation is co-located with the thing it describes. Updating the code and updating the documentation become the same operation.
The Cost of Stale Documentation
Stale documentation has two kinds of cost that are easy to underestimate.
The first is the cost to the reader who finds it. They spend time reading something that is not true. Worse, they may act on it — implement against a stale specification, make a decision based on an outdated architecture description, waste a day debugging something that the documentation says should work. The time lost is often invisible because the reader does not know the documentation is stale; they assume they are misunderstanding something and keep reading.
The second is the cost to the codebase's reputation for documentation. Once an engineer has been burned by stale documentation a few times, they learn to distrust documentation generally. They stop reading it before diving into the code. The documentation that is accurate becomes useless because nobody reads it. Stale documentation does not just fail to help; it degrades the value of all documentation by association.
What This Means
Documentation is not a solved problem. It decays, and the decay is structural.
The practical response is to choose what to document strategically rather than comprehensively. Document the things that are hardest to infer from the code: the reasoning behind decisions, the invariants that are not enforced mechanically, the integration points where multiple systems meet. Document for the reader who is lost rather than the reader who is merely reading.
Keep documentation as close to the code as possible. Prefer generated reference documentation over manually maintained descriptions of behavior. Delete documentation that is known to be stale; wrong information is worse than missing information.
And hold documentation with appropriate skepticism — including documentation you wrote yourself. The gap between what code does and what we believe it does is often larger than we expect. Documentation records the belief. The code is the authority.
The Dependency You Didn't Choose
When you add a library to your project, you make a choice. You evaluate the library, decide it fits your needs, and add it. The choice is yours.
What you do not choose is everything the library depends on, and everything those dependencies depend on. The transitive closure of your dependency graph is decided by authors you have never heard of, for reasons you were not part of, in response to requirements that have nothing to do with yours.
This is not an exotic edge case. It is the normal state of software development. A JavaScript project with a handful of direct dependencies can have hundreds of transitive ones. A Python project using popular scientific computing libraries pulls in a deep tree that touches C extensions, Fortran code, and platform-specific binaries. The code you review and reason about is a small fraction of the code that runs when your software runs.
The Transitive Trust Problem
When you add a dependency, you extend trust. You are saying: I believe this library does what it claims, is maintained responsibly, and will not harm my system or my users.
When that library adds its own dependencies, your trust extends further, whether you intended it to or not. The authors of your library made their own trust judgments, about libraries you have not evaluated, for needs that may not match yours.
This creates a trust chain that is very long and largely invisible. You can inspect your direct dependencies. Inspecting the transitive graph is theoretically possible and practically infeasible — the graph is too large, the code too voluminous, and most of it is not in your domain of expertise.
The limits of this trust became widely understood after a small number of high-profile supply chain attacks. The events that attracted the most attention — a malicious actor publishing a package with a similar name to a popular one, a maintainer's account being compromised, a dependency being deliberately backdoored — demonstrated something that had always been structurally true: accepting code from the internet at install time is an act of trust, and trust can be exploited.
Versioning and the Illusion of Stability
Semantic versioning promises a clear signal: a major version change means breaking changes, a minor version means new features, a patch means bug fixes. This is a convention that many ecosystems follow.
The convention is not mechanically enforced. It depends on authors correctly categorizing their changes. And even correct categorization does not mean that a patch release is safe to apply blindly. A bug fix for the library author can be a behavior change that breaks your code, if your code depended on the bug.
Lock files address part of this. By recording exact versions, a lock file ensures that what runs in production is what was tested in development. This is genuinely valuable and worth doing consistently.
But lock files fix versions at a moment in time. They do not fix the risk profile: a package that was safe when you locked it may have had a vulnerability discovered since. Lock files trade the risk of unexpected changes for the risk of known-but-not-updated vulnerabilities. Neither is obviously better; they are different risks that require different mitigation.
The Maintenance Question
Software is not finished when it is released. It requires maintenance: security patches, compatibility updates for new language versions and platforms, bug fixes, adaptation to changes in the ecosystem it depends on.
When you take on a dependency, you are implicitly depending on someone else to maintain it. That someone else may be a large organization with dedicated engineers, a small team doing it in their spare time, a single maintainer who has moved on, or a project that has been abandoned.
The maintenance status of your dependencies is usually not visible from the code. A library may be well-maintained and simply stable — no recent commits does not necessarily mean abandoned. Or it may be genuinely abandoned, leaving any future security issues unaddressed. Or the maintainer is present but overwhelmed by issues and has not had time to review pull requests for the past year.
This information exists but requires active investigation. The dependency management tool tells you the version. The maintenance health is in the issue tracker, the commit history, the response rate to pull requests, the maintainer's public statements about the project's future.
Most teams do not systematically review this. They add dependencies when needed, use them until there is a reason to stop, and discover maintenance problems when problems surface.
Dependency as Architecture
How much of your system behavior lives in code you wrote versus code you imported is an architectural question with long-term implications.
A codebase with shallow dependencies — using libraries for well-defined purposes with stable APIs — is easier to reason about, easier to audit, and easier to update when circumstances require it. A codebase where core functionality is delegated to frameworks that dictate structure and behavior is faster to build but harder to disentangle.
Neither is wrong in all cases. The tradeoff is real. Using a battle-tested HTTP client library is almost always better than writing your own. Using a full-stack framework that generates your database schema, your API layer, and your frontend rendering pipeline means accepting that framework's model of how software should be structured. The efficiency gain is real; the architectural lock-in is also real.
The point is that dependency choices are architectural choices. They constrain future options, import behavior you did not design, and create obligations that extend over time. Treating them as purely a question of "does this library solve my immediate problem" misses their structural character.
What This Means
The dependency graph you run is not the same as the dependency graph you chose. The former is the transitive closure of everything your direct dependencies require; the latter is the small set you evaluated directly.
This gap is not something you can eliminate. Modern software development at any scale depends on reuse. The alternative — implementing everything you need from scratch — is far worse on virtually every dimension.
The useful orientation is informed acceptance: know that your transitive dependencies exist and are significant, have a lightweight process for reviewing new direct dependencies (who maintains this? how many transitive dependencies does it bring in? what is its security track record?), keep your lock files current, and treat major dependency upgrades as engineering work rather than routine maintenance.
The code running in your production environment is mostly code other people wrote. That is a feature and a risk simultaneously. Understanding both sides of it is part of understanding what your system is.
Stack Traces Speak Machine
A stack trace is a transcript of the call stack at the moment something went wrong. It is precise, complete (within its domain), and immediately available. For this reason, it is usually the first thing an engineer looks at when diagnosing a failure.
It is also a description of the failure in terms the machine found natural, not terms the engineer finds natural. Reading a stack trace requires translation — from the machine's account of what happened to a human understanding of why.
What a Stack Trace Contains
At the moment an exception is thrown or a crash occurs, the runtime captures the chain of function calls that led to that point. The trace lists these frames from most recent to least recent: the function where the failure occurred at the top, the function that called it below, and so on down to the entry point of the program.
Each frame tells you: which function was executing, in which file, on which line. This is accurate and useful. The question is what to do with it.
The most recent frame is where the machine detected a problem. This is rarely where the problem originated. The exception was thrown here; the cause is somewhere earlier in the chain. The stack trace shows you the detection site. The cause site requires inference.
The Frame That Matters
In most stack traces, one frame is the relevant one. The rest are context.
The relevant frame is rarely the top one. It is often deep in code you wrote, below a layer of framework and library calls. Finding it requires reading the trace and distinguishing between frames you control and frames you do not.
This is a skill that looks trivial but takes practice. A Rails exception trace may contain twenty frames before reaching application code. A Java exception through Spring may show dozens of proxy and reflection frames before the real call site. A Node.js trace through an async framework may interleave framework machinery with application logic in non-obvious ways.
The technique is to scan for frames in files you recognize — your codebase, your packages — and start there. The frames above are the machinery that propagated the exception. The frames below are what called your code. The frame you care about is usually where the unexpected state was introduced or where the wrong assumption was made.
The Exception Message Is Also in Machine Terms
The exception message accompanies the stack trace and is often the first thing read. It is also often written for the machine's benefit, not the developer's.
"NullPointerException" — tells you that a null reference was dereferenced. Tells you nothing about which reference, why it was null, or what the code was trying to do.
"ECONNREFUSED" — tells you that a connection was refused. Tells you nothing about what connection, what it was being refused by, or whether the refusing end is the problem or the connecting end.
"Segmentation fault" — tells you that the program accessed memory it should not have. Tells you nothing about what memory, what code caused it, or whether the error is in your code or in a library.
These messages are accurate descriptions of what the machine observed. They are starting points, not conclusions. The translation from "what the machine observed" to "what went wrong in human terms" is the work of debugging.
Some exception messages are better than others. Well-written error messages include the values involved, the operation that was attempted, and sometimes the context that explains why the operation failed. "Cannot read property 'id' of undefined" is better than "TypeError" but still requires knowing that undefined arrived where an object was expected. The best error messages provide enough context that the translation is short.
Async and the Broken Stack
Stack traces assume synchronous, sequential execution: function A called function B which called function C. The trace is a complete record of this chain.
In asynchronous programming, this assumption breaks. When a callback executes, or a Promise resolves, or an async/await chain continues, the call that scheduled the work is no longer on the stack. The trace shows where the code is running, not how it got there.
This is a well-known problem. Async stack traces are often nearly useless — they show the scheduler frames and the executor frames, but not the application code that initiated the work. You see where you are but not how you arrived.
Different runtimes handle this differently. Node.js has improved its async stack trace support significantly over time. Browsers capture some async context in their DevTools. But the fundamental problem — that async execution severs the call chain — remains, and the solutions are partial.
When debugging async failures, the stack trace is often only the beginning. You need the application logs around the time of the failure, the request or event that triggered the execution, and the state of the system at that moment. The stack trace tells you what was running at the point of failure. The narrative of how you got there requires assembling evidence from multiple sources.
The Stack Trace as Breadcrumb
The most useful framing of a stack trace is not "this is the explanation of the failure" but "this is a breadcrumb pointing toward the explanation."
The breadcrumb tells you: the failure was detectable at this point, in this execution context, while executing this chain of calls. Now go find out why.
The "why" is almost always about state: the wrong value arrived at the wrong place. Tracing that value backwards through the execution — using the stack trace as a starting point, then examining what the relevant values were, then asking how those values came to be — is the actual work of diagnosis.
This is why a clean understanding of the code's data flow is more useful than deep familiarity with the stack trace format. The engineer who understands how data moves through the system can read a stack trace and immediately start forming hypotheses about where the bad state came from. The engineer who is expert at reading stack traces but does not understand the system will stare at the trace and have no useful hypotheses to test.
What This Means
Stack traces are not answers. They are precise, machine-generated descriptions of failure locations, written in terms of the execution model, pointing at symptoms rather than causes.
Reading a stack trace is translation work: from machine description to human understanding. The translation requires knowing what the code is supposed to do, which frames are relevant, and what state led to the failure point.
The trace is the starting point of investigation, not the conclusion. Engineers who treat it as a conclusion stop too early. Engineers who treat it as a breadcrumb are positioned to find the actual cause.
Estimation Is a Different Skill
Engineering estimation is consistently wrong in systematic, predictable ways. Projects take longer than estimated. Features are more complex than expected. Bugs surface that were not anticipated. This is so common and so consistent that there are named phenomena for it: Hofstadter's Law, the planning fallacy, the ninety-ninety rule.
The interesting question is not why estimates fail — there are good accounts of that — but what estimation actually is and what would constitute getting better at it.
What an Estimate Is Not
An estimate is not a prediction. A prediction is a statement about what will happen. An estimate is a probabilistic range around a central expectation, given current understanding, subject to revision as understanding changes.
These are treated as the same thing in most engineering contexts, which produces predictable dysfunction. When an estimate is treated as a commitment, it puts pressure on the estimator to compress uncertainty. Uncertainty feels like weakness in a context where a number is expected. The result is a point estimate with implicit false precision — "two weeks" rather than "between one and four weeks, with two being the median under current assumptions."
The compressed estimate then becomes a commitment that the project is held to. When reality produces a distribution rather than a point, the project is "late." The estimate was wrong, but it was wrong in the same direction it is always wrong: reality contains variance that point estimates suppress.
The Systematic Bias Toward Underestimation
Underestimation is not random error. It is systematic, which means it has systematic causes.
One cause is the planning fallacy: people estimate based on the best-case scenario rather than the realistic distribution of outcomes. When you estimate a task, you imagine executing it correctly, without interruptions, without discovering unexpected complexity. This imagined scenario is possible. It is not the median case.
Another cause is reference class neglect: ignoring the history of how similar projects went in favor of the specifics of this project. "This project is different" is often true in the details and almost never true in the structure. Similar projects took longer than estimated. This project will probably also take longer than estimated.
A third cause is the unknown unknowns: the things you do not know you do not know. Every non-trivial engineering project contains surprises. A bug in a third-party library. An undocumented interaction between systems. A requirement that was not clearly stated until it became a blocker. These surprises are not individually predictable, but their existence is predictable. Every project has them. They are not in most estimates.
The Reference Class as Calibration Tool
One approach to improving estimates is to explicitly use reference classes: "how long have similar tasks taken in the past?"
This requires having tracked how long similar tasks took. Most teams do not do this systematically. The data that would most improve future estimates — the actual duration of past work, compared to estimates — is rarely collected and analyzed.
When teams do collect it, they often find that their estimates are consistently off in the same direction by a consistent factor. A team that estimates two weeks for features that take four weeks has a systematic 2x underestimation bias. This bias can be corrected by multiplying estimates by the historical factor. This sounds too simple to work. It often works.
The more important use of reference classes is to resist the natural tendency to treat each project as unique. "This project is different" is almost never different in the ways that matter for estimation. The sources of variance — integration complexity, unclear requirements, unexpected bugs, competing priorities — are largely the same across projects. Historical data captures the variance class even when it does not capture the specifics.
The Tasks Not in the Estimate
Estimates typically cover the direct work: writing the code, writing tests, doing code review. They typically do not cover the surrounding work.
Every feature involves time spent understanding the existing codebase. Time spent in meetings discussing the feature. Time spent on interruptions — questions from colleagues, other bugs that need to be handled, incidents. Time spent integrating with dependencies that behave unexpectedly. Time spent revising requirements that were unclear at the start.
None of this is deadweight. It is the actual cost of building software in a real organization. But it is often not in the estimate, so it appears as overrun when it materializes.
A more accurate estimate would include a realistic accounting of overhead. Most engineers know from experience that they do not spend 100% of their time on the task they are nominally working on. The overhead is real and consistent. Estimates that treat an eight-hour day as eight hours of productive task work are systematically off from the start.
When Estimates Cannot Be Given
There are tasks for which no meaningful estimate can be given at the time estimation is requested.
A task that requires first understanding an unfamiliar system cannot be estimated until the system has been understood. A bug whose cause has not been identified cannot be estimated until the cause is known. A feature with unclear requirements cannot be estimated until the requirements are clarified.
In these cases, the appropriate response is not to provide a number but to name the prerequisite: "I need to spend a day reading the codebase before I can estimate this" or "I need to understand what 'flexible' means in this requirement before I can say how long it will take."
This is often received badly. The request was for an estimate, and the response was a condition. But providing a number when the preconditions for estimation are not met is not honest estimation — it is guessing with the social framing of estimation. The number produced in those circumstances is not an estimate; it is a placeholder.
The organizations that handle this well distinguish between estimates given with adequate information and commitments made under uncertainty. They treat the first as the basis for planning and the second as a risk to be managed, not a failure to be penalized.
What This Means
Estimation is a skill, and like all skills, it can be developed. The development does not come primarily from trying harder to estimate accurately. It comes from tracking actuals against estimates, understanding the systematic biases in your own estimates, using reference classes, and building the organizational culture that allows uncertainty to be expressed honestly.
The estimate is not a contract. It is a best current answer to the question "how long, given what we know now?" As what you know changes, the estimate should change. An estimate that does not change as understanding changes is not an estimate; it is a deadline that was labeled as an estimate.
Software projects take the time they take. The goal of estimation is to know as early as possible what that time is likely to be — not to commit to a number that then determines how long the project is allowed to take.
The Model in Your Head
Every engineer who works on a system carries a mental model of that system. The model is not the system. It is a simplified, partially accurate representation that enables reasoning and prediction. Without it, you could not work on the system at all — you would have no basis for predicting the consequences of a change or diagnosing the cause of a failure.
The model is also always wrong in some ways, and the ways it is wrong matter enormously.
How Mental Models Form
You build a mental model of a system the first time you encounter it: reading the code, watching it run, talking to people who built it. The model starts rough and becomes more detailed as you interact with the system more.
But the model is not a passive reflection of what is there. It is an active construction — shaped by what you expect to find based on prior experience, by what the documentation says (which may not match the code), by what the previous engineer told you (which reflects their model, not the system directly), and by the parts you happened to examine versus the parts you did not.
This means mental models have structure that comes from you, not from the system. The analogies you used to understand the system — "it works like a queue," "it's basically a cache," "it's similar to the auth system we had at my last job" — are scaffolding that helped you build the model. They also introduce inaccuracies wherever the system does not actually behave like the analogy.
Divergence Over Time
A mental model formed at time T and a system modified after time T will diverge. The system changes; the model does not update automatically.
Some of this divergence is visible: you see the pull request, you read the change, your model updates. Some is invisible: a change was made to a part of the system you do not work on regularly, and your model never incorporated it. A behavior you assumed was true for years was actually changed eighteen months ago. The assumption is so baked into your model that you never thought to verify it.
This invisible divergence is the source of a particular class of bug: the engineer is completely confident that the code does X, because the code used to do X and they have not been told otherwise, but the code now does Y and has for over a year.
Experience is supposed to make engineers more effective, and mostly it does. But experience also creates confident wrong assumptions that are hard to dislodge precisely because they are confident. An engineer encountering a system for the first time has no model and reads carefully. An experienced engineer often reads less carefully because the model fills in what they expect to see, making discrepancies between expectation and reality less visible.
The Model as a Lens and a Filter
When you look at code through a mental model, the model determines what you see.
In a codebase you know well, your eye skims over familiar patterns and focuses on anomalies. This is efficient — you are using the model to filter out noise. But it is also a source of errors. If the "familiar pattern" has been modified in a significant way, your eye may not catch it. Your model predicts what will be there, and prediction shapes perception.
This is particularly relevant in code review. Reviewing code in a system you know well, you often read the diff through the lens of what you expect the change to look like given the intent. If the change does something different from what you expected — correctly or incorrectly — the discrepancy may not register.
Reviewers who catch the most bugs in code review tend to read code as if they are encountering it for the first time, suppressing the prediction from the model and reading what is actually there. This is harder in familiar code than in unfamiliar code, because familiarity generates stronger predictions.
Testing the Model
One of the more valuable habits an engineer can develop is testing their mental models explicitly rather than relying on them implicitly.
"I believe the cache has a TTL of five minutes — let me check the configuration to confirm." "I think this function always returns before calling the database — let me trace through it to verify." "My understanding is that this queue is FIFO — let me check whether there are priority levels I'm not aware of."
These are deliberate acts of verification. They are slower than acting on the model directly. They are also how you find the cases where the model has drifted from reality — and those are precisely the cases where acting on the model would produce errors.
The instinct to verify rather than assume increases with engineering experience, or it should. Early-career engineers sometimes over-verify — checking things they can reasonably trust — but more often the problem is under-verification, especially on systems one knows well.
A useful heuristic is: the more confident you are about something, the more worth checking it is. High confidence with low recent verification is exactly the profile of an assumption that has silently diverged.
Shared Models and Their Failures
A team working on a system has a collection of individual mental models that overlap imperfectly. Each engineer's model is shaped by their history with the system — which parts they built, which bugs they debugged, which incidents they responded to.
Coordination problems between engineers often trace back to divergent models rather than bad intentions or poor communication. Engineer A changes the behavior of a shared service because their model says downstream consumers do not depend on the old behavior. Engineer B's service breaks because their model assumed the old behavior. Neither engineer was wrong given their model; the models were incompatible.
Design documents, architecture diagrams, and technical discussions are attempts to align these models. They create a shared reference that people can verify their individual models against. This alignment has a real effect: teams with more shared understanding of their system make fewer coordination errors.
The limitation is that shared documentation ages the same way individual models do. The architecture diagram from two years ago reflects the shared understanding from two years ago. The models that engineers actually use are built from that document plus everything that has happened since — and what has happened since may have drifted the system significantly from the diagram.
The Map and the Territory
There is a traditional framing for the relationship between models and reality: the map is not the territory. A map is a useful representation of a territory. It is not the territory itself. The features that are on the map exist in the territory; the territory also contains things that are not on the map.
The mental model is your map of the system. The system is the territory. Your map was accurate enough to navigate when you made it. The territory has changed since then, in ways large and small, and your map has not kept up.
This does not make maps useless. It makes them tools that require maintenance, scrutiny, and occasional willingness to put them down and look at the territory directly.
The most dangerous moment is not when you know your map is incomplete — you are alert then, reading carefully, open to surprise. The most dangerous moment is when you do not know your map is wrong, and you navigate confidently into territory that is not where the map says it is.
What This Means
The mental model is indispensable. You cannot reason about a system you have no model of. The model enables every useful thought you have about the code — how to structure a change, where a bug might be hiding, what the consequences of a modification will be.
The model is also a source of errors that are invisible precisely because they are errors in your model rather than errors in the code. The code is right in front of you. The gap between your model and the code is not.
The discipline that addresses this is not constant skepticism — you cannot function if you question every assumption always. It is targeted verification: the willingness, especially in unfamiliar territory or after long absence, to check the things you are most confident about. To read what is there rather than what you expect. To treat the system as the authority and your model as a hypothesis.
The system is always more complex than the model. That is not a failure — it is the nature of models. The goal is a model accurate enough to be useful and a habit of checking it often enough to stay approximately right.