Generating and Stress-Testing Hypotheses
Here is a fact about human cognition that should trouble you: you are systematically bad at generating hypotheses outside your experience, and you don’t know it. When asked to brainstorm explanations for a phenomenon, possible solutions to a problem, or candidate strategies for a challenge, you will generate a set of options that feels comprehensive but is, in fact, a narrow slice of the possibility space — the slice that’s accessible from your particular combination of training, experience, and cognitive habits.
This isn’t a character flaw. It’s architecture. Your brain generates hypotheses by pattern-matching against stored experience. If you’ve seen something like this before, you’ll think of it. If you haven’t, you won’t — and you won’t notice the gap. The hypotheses you don’t generate are invisible to you, which creates the illusion that the hypotheses you did generate are all there are.
AI has the opposite cognitive profile. It’s excellent at generating diverse hypotheses — it can draw on patterns from essentially every domain and combine them in ways that no individual human’s experience would suggest. But it’s unreliable at evaluating hypotheses. It can’t consistently distinguish between a hypothesis that’s genuinely promising and one that merely sounds plausible. It lacks the domain-specific judgment, the intuitive sense of “that doesn’t feel right,” and the practical experience that make human evaluation so powerful.
This complementarity is the basis for the most productive human-AI collaboration pattern I’ve found: a two-phase approach where AI generates and human evaluates. Phase 1 is divergent — maximize the number and diversity of hypotheses. Phase 2 is convergent — systematically evaluate each hypothesis against evidence and judgment. Use each cognitive system for what it’s good at. The result is a set of hypotheses that is both broader than what you’d generate alone and more rigorously evaluated than what the AI would produce alone.
Phase 1: Divergent Generation
The goal of Phase 1 is to produce the largest and most diverse set of hypotheses possible. You are explicitly not evaluating during this phase. Evaluation kills generation — the moment you start judging hypotheses, your brain shifts from creative mode to critical mode, and the flow of novel ideas stops.
The Base Generation Prompt
I'm going to describe a situation, problem, or observation. I want you
to generate as many hypotheses as possible for what might be causing it,
what might solve it, or what might be going on.
RULES FOR THIS PHASE:
- Quantity and diversity over quality. I want hypotheses from multiple
domains and perspectives.
- Include obvious hypotheses AND non-obvious ones. I can filter later.
- Include hypotheses that seem unlikely or even absurd — sometimes the
best explanation is the one nobody considers.
- Don't self-censor. If a hypothesis seems "too simple" or "too weird,"
include it anyway.
- For each hypothesis, give it a one-line summary and a 2-3 sentence
explanation of the mechanism.
- Aim for at least 15-20 hypotheses.
- Group them into categories (e.g., technical, human, organizational,
environmental, historical).
THE SITUATION:
[YOUR SITUATION, PROBLEM, OR OBSERVATION]
The instruction to “include obvious hypotheses” is counterintuitive but important. Sometimes the actual explanation is the obvious one, and people overlook it precisely because it’s obvious — they assume someone would have thought of it already. By including obvious hypotheses explicitly, you prevent the “surely someone has already considered that” blind spot.
The Perspective Multiplication Prompt
After the base generation, push for more diversity by explicitly requesting different lenses:
Good. Now generate additional hypotheses from each of these specific
perspectives:
1. A SYSTEMS THINKER who looks for feedback loops, emergent behavior,
and interaction effects between components
2. A HISTORIAN who looks for precedent — has this pattern occurred
before in a different context?
3. A CONTRARIAN who assumes the conventional explanation is wrong and
looks for alternatives
4. An OUTSIDER who has no domain knowledge and asks naive questions
that insiders wouldn't think to ask
5. A DATA SCIENTIST who asks what the data would show if each hypothesis
were true — and what you'd need to measure
For each perspective, generate at least 3 additional hypotheses that
weren't in your initial list.
This prompt typically adds 10-15 hypotheses that are qualitatively different from the initial batch. The systems thinker catches interaction effects that reductionist thinking misses. The historian finds precedent that illuminates the present. The contrarian generates the hypotheses that nobody wants to consider. The outsider asks the questions that feel too basic for experts. The data scientist operationalizes the hypotheses, making them testable.
The Negative Space Prompt
The most valuable hypotheses are often the ones that are hardest to generate — the explanations that live in your cognitive blind spots. This prompt explicitly targets them:
Look at the full list of hypotheses we've generated. Now consider:
what's MISSING?
Specifically:
- What category of explanation have we not considered at all?
- What would someone from a completely different field suggest that
we haven't thought of?
- What hypothesis would be embarrassing or uncomfortable if true?
(These are the ones most likely to be systematically avoided.)
- What hypothesis requires information we don't currently have —
and what would that information be?
- What hypothesis would explain the situation by suggesting our
framing of the problem is wrong?
Generate 5-10 additional hypotheses from the negative space — the
space of things we haven't been thinking about.
The instruction about “embarrassing or uncomfortable” hypotheses is specifically designed to counter a well-documented cognitive bias: people systematically avoid generating hypotheses that would reflect poorly on them, their team, or their organization. “Maybe the product isn’t selling because the product isn’t good” is the kind of hypothesis that’s obvious to outsiders but genuinely difficult for insiders to generate. The AI doesn’t share your ego, so it can go there.
Phase 2: Convergent Evaluation
Phase 2 is where you switch from generation to evaluation. This is where human judgment dominates. The AI’s role shifts from generator to structured evaluator — it provides frameworks and analysis, but you make the judgments about which hypotheses are promising.
Step 1: Quick Triage
Before detailed evaluation, do a quick triage to reduce the list to a manageable size:
Here is our full list of hypotheses:
[PASTE ALL HYPOTHESES]
Help me do a quick triage. For each hypothesis, assign it to one of
three categories:
INVESTIGATE: This hypothesis is plausible enough and important enough
to warrant serious evaluation.
PARK: This hypothesis is possible but either unlikely or less important.
Keep it on the list but don't prioritize it.
DISCARD: This hypothesis can be eliminated based on what we already know.
For each discarded hypothesis, state specifically WHY it can be eliminated.
Important: err on the side of INVESTIGATE. The cost of investigating a
false hypothesis is low. The cost of discarding a true one is high.
Note the asymmetry instruction: “err on the side of INVESTIGATE.” This counters the natural tendency (both human and AI) to prematurely narrow the hypothesis set. At this stage, you want to keep options open.
After the AI categorizes them, review the categorization yourself. You’ll frequently disagree with specific assignments — and your disagreements are informative. If the AI discards a hypothesis that you think is worth investigating, or investigates one you think should be discarded, examine why you disagree. The disagreement itself is data about your assumptions.
Step 2: Evidence Mapping
For each hypothesis in the INVESTIGATE category, map the existing evidence:
For each INVESTIGATE hypothesis, I want an evidence map:
1. SUPPORTING EVIDENCE: What facts, data, or observations are
consistent with this hypothesis?
2. CONTRADICTING EVIDENCE: What facts, data, or observations are
inconsistent with this hypothesis?
3. MISSING EVIDENCE: What evidence would strongly confirm or strongly
disconfirm this hypothesis, but we don't currently have?
4. DISTINGUISHING EVIDENCE: What evidence would distinguish this
hypothesis from competing hypotheses? (What would be true if THIS
hypothesis is correct but NOT true if a competing hypothesis is
correct?)
Item #4 is the most important. Many hypotheses are consistent with
the same evidence — what we need is evidence that discriminates
between them.
Hypotheses to map:
[YOUR INVESTIGATE LIST]
The distinguishing evidence question (#4) is drawn from the philosophy of science — specifically, from the idea that a hypothesis is only meaningful if there are observations that could distinguish it from alternatives. Two hypotheses that predict exactly the same observations are, for practical purposes, the same hypothesis. The distinguishing evidence question forces you to identify what actually separates your candidate explanations.
Step 3: Structured Evaluation
Now evaluate each remaining hypothesis against explicit criteria:
For each remaining hypothesis, evaluate it against these criteria
on a 1-5 scale:
PLAUSIBILITY: How well does this hypothesis fit with established
knowledge and mechanisms? (1 = contradicts known facts, 5 = fully
consistent and mechanistically clear)
EVIDENCE FIT: How well does this hypothesis explain the specific
observations we're trying to explain? (1 = explains nothing,
5 = explains everything elegantly)
TESTABILITY: How easy is it to design a test that would confirm
or disconfirm this hypothesis? (1 = untestable, 5 = easily testable
with available resources)
ACTIONABILITY: If this hypothesis is true, does it suggest a clear
course of action? (1 = no actionable implications, 5 = clear and
specific actions)
NOVELTY: Would this hypothesis be surprising to domain experts?
(1 = obvious and well-known, 5 = genuinely novel). Note: novelty
is neither good nor bad — it's information about how much the
hypothesis adds to existing thinking.
For each criterion, explain your rating. Don't just assign numbers.
I use this scoring not as a mechanical decision tool but as a discussion framework. The AI’s ratings are starting points for my own evaluation. Where I disagree with the AI’s rating, I examine why — and often find that either I’m wrong (I was overrating a hypothesis because I liked it) or the AI is wrong (it was underrating a hypothesis because it couldn’t assess domain-specific nuances). Both outcomes are informative.
Step 4: The Killer Test
For the top-ranked hypotheses, identify the single most informative test:
For each of our top 3-5 hypotheses, design the "killer test" — the
single experiment, observation, or data analysis that would most
definitively confirm or disconfirm it.
Requirements:
- The test must be feasible with available resources
- The test must be able to produce a clear positive or negative result
(not an ambiguous one)
- The test should ideally discriminate between multiple competing
hypotheses simultaneously
- Specify what result you'd expect if the hypothesis is TRUE and what
result you'd expect if it's FALSE
For each test, also identify: what could go wrong with the test itself?
What would a false positive look like? A false negative?
Top hypotheses:
[YOUR SHORTLIST]
This is where the two-phase approach pays its dividend. You started with a hypothesis set that was broader than anything you’d generate alone (because the AI generated it). You’ve now narrowed it to a shortlist that’s better evaluated than the AI could manage alone (because you applied your judgment). And you’ve identified specific, feasible tests that can move you from speculation to evidence. This is the full pipeline from “I don’t know what’s going on” to “here’s how to find out.”
Worked Example: Product Development
The situation: A B2B product has seen declining engagement over the past quarter. Monthly active users are down 15%, feature usage is down across the board, and support tickets are up. The product team’s working hypothesis is that a recent UI redesign is the cause.
Phase 1 output (condensed to key hypotheses across categories):
Interface hypotheses:
- The UI redesign disrupted established workflows, increasing friction.
- The redesign introduced navigation changes that make key features harder to find.
- Performance degraded with the redesign (heavier frontend framework).
Product-market hypotheses: 4. A competitor launched a compelling alternative during the same period. 5. Customer needs have shifted and the product’s core value proposition is weakening. 6. The market segment is contracting (macro-economic factors).
Organizational hypotheses: 7. Key customer-facing team members left, and relationship quality degraded. 8. Support response times increased, driving dissatisfaction. 9. Pricing changes or renewal terms are causing friction.
Data/measurement hypotheses: 10. The engagement metrics changed definition with the redesign, and the decline is partially or wholly a measurement artifact. 11. A tracking bug was introduced with the redesign, causing undercounting.
Interaction hypotheses: 12. The redesign is fine, but it coincided with another change (pricing, support, account management) and is being blamed for the other change’s impact. 13. The redesign is causing problems only for a specific user segment, but aggregate metrics obscure this.
Embarrassing hypotheses: 14. The product has accumulated enough technical debt and bugs that it’s genuinely unreliable, and the redesign just tipped users over their frustration threshold. 15. The product team has been building features that the team finds interesting rather than features that customers need.
Phase 2 evaluation highlights:
Hypothesis #10 (measurement artifact) was rated highest on testability and was the first to investigate — because if the decline is a measurement artifact, all other hypotheses are moot. A quick analysis of raw event logs vs. the new dashboard metrics revealed that the new tracking code was indeed undercounting page views by approximately 8%. So roughly half the “decline” was a measurement error. This hypothesis would likely not have been generated by the product team, whose working hypothesis (the UI redesign) assumed the data was correct.
Hypothesis #12 (coinciding changes) led to the discovery that the sales team had changed renewal terms during the same quarter, which was causing friction with existing customers that manifested as reduced engagement. The UI redesign was getting blamed for the renewal friction.
Hypothesis #13 (segment-specific impact) led to a segmented analysis that showed the engagement decline was concentrated in smaller accounts, not larger ones. The redesign had actually improved engagement for large accounts while degrading it for small accounts — information that was invisible in the aggregate data.
The product team’s original hypothesis (the UI redesign) turned out to be partially true but significantly less important than the measurement error and the coinciding changes. Without the systematic hypothesis generation in Phase 1, the team would have spent months optimizing a UI redesign that was responsible for perhaps 20% of the observed decline.
Worked Example: Debugging
The situation: An intermittent production error occurs approximately once per day, always between 2 AM and 5 AM. The error causes a specific microservice to return 500 errors for 3-7 minutes before self-recovering. Standard monitoring shows no obvious resource exhaustion, no deployment changes, and no upstream service issues during the error windows.
Phase 1 highlights (non-obvious hypotheses):
Temporal hypotheses:
- A cron job or scheduled task running during that window creates transient load or lock contention.
- Database maintenance operations (vacuum, reindex, backup) run during that window.
- A third-party API the service depends on has a maintenance window during those hours.
Interaction hypotheses:
- The error isn’t in the service itself but in a dependency that the monitoring doesn’t cover (DNS, service mesh sidecar, certificate rotation).
- Garbage collection pauses accumulate during low-traffic hours when the JVM isn’t under pressure to collect, then a batch of requests triggers a full GC at the worst moment.
Infrastructure hypotheses:
- The cloud provider performs maintenance on the underlying infrastructure during those hours, causing brief network partitions.
- The service runs on spot instances that are being reclaimed and replaced during low-demand hours.
Anti-obvious hypotheses:
- The service is actually failing all the time, but during high-traffic hours the load balancer routes around the failed instance before users notice. The 2-5 AM window is when traffic is low enough that all requests hit the single failing instance.
The last hypothesis — that the failure is constant but only visible during low traffic — was the one that turned out to be correct. One instance of the service had a memory leak that caused periodic crashes. During high-traffic hours, the load balancer detected the crashed instance and routed traffic to healthy ones within milliseconds. During the 2-5 AM window, traffic was low enough that all requests might hit the unhealthy instance during its crash-and-restart cycle. The fix was the memory leak, not anything related to the time window.
This is a case where the most counterintuitive hypothesis was the correct one, and it was generated precisely because the prompt explicitly asked for hypotheses that challenged the obvious framing (that the time window was causally significant).
Worked Example: Strategic Planning
The situation: A mid-size consulting firm is trying to decide whether to specialize in a specific industry vertical or remain a generalist firm.
Phase 1 hypothesis generation focused on “reasons to specialize” and “reasons to remain generalist,” but the most valuable hypotheses were in a third category: “reasons the question itself is wrong.”
Key hypotheses in that category:
- The specialize-vs-generalist framing is a false dichotomy. The firm could create a “T-shaped” model: deep expertise in one vertical with general capability across others.
- The real question isn’t about specialization but about positioning. A generalist firm can position as a specialist to specific markets without actually restricting its service offerings.
- The choice between specialization and generalism should be driven by which generates better referrals, not which generates better deliverables. Specialized firms get more referrals because the referring party has a clear mental model of what they do.
- The specialization question is a proxy for a deeper question: does this firm have a distinctive point of view? A generalist firm with a strong point of view is more successful than a specialist firm without one.
The last hypothesis — that specialization is a proxy for point of view — reframed the entire strategic discussion. The firm realized that what they actually lacked wasn’t a vertical focus but a distinctive perspective on their work. They could develop that perspective without restricting their client base. The specialization question, which had consumed months of leadership time, turned out to be the wrong question.
The Complementarity Principle
The key insight of this chapter bears repeating because it’s the foundation of productive human-AI collaboration for thinking:
Humans are bad at generating hypotheses outside their experience but good at evaluating hypotheses once they’re stated. You can assess a hypothesis against your domain knowledge, your practical experience, your intuition, and your understanding of context in ways that AI cannot. But you can only evaluate hypotheses that exist — and the ones you generate on your own are a biased, narrow sample of the full possibility space.
AI is good at generating diverse hypotheses but unreliable at evaluating them. It can draw on patterns from every domain and combine them in novel ways. But it can’t reliably tell you which of its generated hypotheses are genuinely promising and which are superficially plausible nonsense.
The two-phase approach exploits this complementarity: AI generates the breadth, you provide the depth. AI ensures you’re not missing important possibilities. You ensure the possibilities are rigorously tested.
Neither phase works well without the other. AI generation without human evaluation produces a useless heap of plausible-sounding hypotheses. Human evaluation without AI generation produces a too-narrow set of well-evaluated but potentially missing-the-point hypotheses. Together, they produce something that neither can achieve alone: a comprehensive, rigorously evaluated hypothesis set that covers the possibility space and identifies the most promising candidates for investigation.
Practical Tips
Don’t evaluate during Phase 1. This is the single most important rule and the hardest to follow. When the AI generates a hypothesis that seems obviously wrong, your instinct is to say “no, that’s not it” and move on. Resist. The obviously-wrong hypothesis might be wrong, but it might also be challenging an assumption you didn’t know you had. Collect everything during Phase 1. Evaluate during Phase 2. Mixing the phases destroys the value of both.
Provide rich context for generation. The more context you give the AI in Phase 1, the more specific and useful its hypotheses will be. Don’t just describe the problem — describe the context, the history, what you’ve already tried, what you’ve already ruled out, and what constraints you’re operating under. The AI uses all of this to generate hypotheses that are relevant rather than generic.
Use your disagreements as data. When you disagree with the AI’s evaluation in Phase 2, don’t just override it — examine the disagreement. Are you disagreeing because you have domain knowledge the AI lacks? Or are you disagreeing because the hypothesis challenges something you’d prefer not to question? The former is good judgment. The latter is defensiveness.
Run Phase 1 multiple times. Each run of Phase 1 produces a somewhat different set of hypotheses. Running the generation prompt three times and combining the results produces a more diverse set than running it once. The AI isn’t deterministic — different runs will surface different patterns.
Save the hypothesis set. Even after you’ve completed the evaluation and identified the most promising hypotheses, save the full Phase 1 output. Hypotheses you discarded may become relevant later as new information emerges. Having the full set available means you can quickly check whether new evidence supports a hypothesis you previously parked.
Time-box the process. Phase 1 should take 30-60 minutes. Phase 2 should take 60-120 minutes. The entire process, from situation description to prioritized hypothesis list with killer tests, should take a single working session. If it’s taking longer, you’re either dealing with an exceptionally complex situation or you’re overthinking it.
When This Technique Fails
The two-phase approach fails when:
The relevant hypothesis isn’t in the AI’s training data. If the explanation for your situation is genuinely novel — involving a technology, a market dynamic, or a causal mechanism that didn’t exist when the AI was trained — the AI won’t generate it. This is rare for most business and technical problems, but it happens.
The problem is too well-defined. If you already know the answer and you’re just looking for confirmation, this technique is overkill. Not every problem needs a hypothesis-generation exercise. When the diagnosis is straightforward, just fix the problem.
You can’t evaluate the hypotheses. If you lack the domain knowledge to distinguish good hypotheses from bad ones, Phase 2 breaks down. In this case, you need a human domain expert, not a better AI prompt. The AI can help you identify what kind of expert you need, but it can’t replace them.
The hypothesis set is too large to evaluate. If Phase 1 generates 50+ hypotheses and you can’t efficiently triage them, the process becomes unwieldy. The fix is to tighten the context in Phase 1 (more specific situation description) or to do an aggressive first-pass triage before detailed evaluation.
Despite these limitations, the two-phase approach is the most generally useful technique I’ve found for situations where the answer isn’t obvious and the conventional wisdom isn’t working. It systematically addresses the most common cause of analytical failure: not that you evaluated the wrong hypothesis incorrectly, but that you never considered the right hypothesis at all.
The prompts are in this chapter. The framework is straightforward. The underlying insight is simple but powerful: the biggest risk in any analysis isn’t that you’ll reach the wrong conclusion about the right hypothesis. It’s that you’ll never consider the right hypothesis at all. AI doesn’t solve this problem — but it dramatically expands the space of hypotheses you have the opportunity to consider. What you do with that expanded space is still up to you.