Decision journals & calibration: judge the choice, not the result

A deal closes and the manager who pushed it gets promoted. The same call, made the same way, falls over a quarter later and someone gets quietly moved on. Often the decision was identical; only the dice were different. The problem is that we judge decisions by how they turned out, and the world rarely lets us see the choice clearly once we know the score.

The quick version

A decision journal is a short note written before you know the outcome: what you decided, what you expected, how confident you were, and why.
It exists to defeat two reflexes, resulting (grading a decision by its result) and hindsight bias (rewriting what you "knew" after the fact).
Calibration is the skill of attaching honest probabilities to those expectations, saying "70%" and being right roughly 70% of the time.
Both are trainable with feedback, and both are cheap: a notebook page and twenty minutes a fortnight.

The idea in depth: why memory is a hostile witness

Start with a distinction the poker world made famous. In her book Thinking in Bets (2018), former professional poker player Annie Duke calls the habit of equating decision quality with outcome quality "resulting." You went through a red light and got home safely, was it a good decision? Resulting says yes, because nothing bad happened. The trouble is that under uncertainty, a good decision can lose and a bad one can win, because luck sits between the choice and the result. Grade the choice by the result and you reward recklessness that paid off and punish careful calls that didn't.

The second saboteur is hindsight bias, the "I-knew-it-all-along" effect. Once we know how something turned out, we unconsciously revise our memory of what we expected, so the outcome feels far more predictable than it actually was. Duke describes the decision journal as a direct counter: because your reasoning is written down before the result is known, you can read it later and recover what you actually believed, rather than the flattering version your brain has since edited. So the move is mechanical, not heroic, get the reasoning out of your head and onto the page while the future is still genuinely open.

flowchart LR
  A(["Decision"]) --> B(["Luck / chance"])
  B --> C(["Outcome"])
  C -. "resulting
(judges back to here)" .-> A
  D(["Decision journal
captured before C"]) --> A
  classDef j fill:#ede9fe,stroke:#7c3aed;
  class D j;

Resulting judges the decision by the outcome, ignoring the luck in between. A journal records the decision before the outcome exists. Leaders Loop

This isn't only a poker insight. The popular decision-journal template comes from Shane Parrish at Farnam Street, whose version asks you to write down the situation, the decision, the range of outcomes you expect, your confidence in each, and, tellingly, how you feel physically and mentally at the time. The emotional and situational notes matter because they let you spot patterns later: that you decide worse when rushed, or after a bad meeting, or late in the day. A journal turns single decisions into a dataset about your own judgement.

The idea in depth: calibration, or saying 70% and meaning it

A journal tells you what you expected. Calibration is about whether the confidence you attached to it was honest. You are well calibrated if, across all the times you said you were 70% sure, the thing happened about 70% of the time. Most of us miss in one direction: overconfidence. The classic work here is Sarah Lichtenstein and Baruch Fischhoff's research on general-knowledge questions, which found that the harder the question, the more confidence outran accuracy, on genuinely difficult items people claimed high certainty while their hit rate slid toward chance.

The more useful finding is that this can be worked on. The calibration literature, surveyed in Lichtenstein, Fischhoff and Phillips' "Calibration of Probabilities: The State of the Art to 1980," shows that practice with prompt feedback can sharpen calibration, though it tends to take real repetition rather than a single nudge. The clearest practical version comes from decision-science writer Douglas Hubbard, who built a drill around exactly this in How to Measure Anything: ask people for 90% confidence intervals on trivia ("how long is the Nile?"), reveal the answers, and repeat. Hubbard reports that estimators often start with only about 60% of their "90%" intervals actually containing the true answer, and that a few rounds of practice move them toward a genuine 90%. So the move is to practise estimating with feedback, deliberately, before the stakes are real.

Calibration scales from the individual to the institution. Philip Tetlock and Dan Gardner's Superforecasting (2015) reports on the Good Judgment Project, a multi-year IARPA-funded forecasting tournament. Ordinary volunteers who tracked their predictions, scored them, and adjusted, the project measured accuracy with the Brier score, where lower is better and 0 is perfect, became "superforecasters" who, per the project's reporting, beat trained intelligence analysts with access to classified material. What made them good wasn't expertise or access; it was the discipline of putting a number on a belief, checking it against reality, and updating. That is calibration as a team sport, and it connects directly to Bayesian reasoning, priors & updating, calibration is how you learn what your starting odds should have been.

A decision journal turns single decisions into a dataset about your own judgement.

The honest limitation. Calibration is cleanest when you make many comparable bets with feedback that arrives reasonably soon, hiring, pricing, project estimates, short-horizon forecasts. It is far weaker for rare, one-off, long-horizon strategic bets where you may get only one outcome, years later, and never see the worlds that didn't happen. For those, the journal still earns its keep (it preserves your reasoning), but the calibration maths is thin. Treat calibration as a lens for your repeatable decisions, not a law you can apply to every choice. And beware the opposite failure: a journal can become theatre, a box-ticking ritual that records confidence without ever being reviewed. Unreviewed, it teaches nothing.

flowchart TD
  A(["Make a forecast:
'70% we hit the date'"]) --> B(["Log it with reasoning"])
  B --> C(["Outcome arrives"])
  C --> D{"Were your 70%s
right ~70% of the time?"}
  D -- "less often" --> E(["Overconfident -
widen your ranges"])
  D -- "more often" --> F(["Underconfident -
back yourself harder"])
  E --> A
  F --> A

The calibration loop: forecast, log, check against reality, adjust the dial. The feedback is the whole point. Leaders Loop

A worked example: the launch date nobody remembers agreeing to

Maya leads a product team. Her engineering lead, Sam, commits to shipping a new checkout flow "by the end of Q3, I'd say we're 90% there." Maya, instead of just nodding, opens a shared decision log and writes four lines: the decision (ship checkout by 30 Sept), the forecast (Sam: 90% confident), the key assumptions (no payments-vendor delays; the two senior engineers stay on the project), and how the room felt (optimistic, just off the back of a good demo). It takes three minutes.

Q3 ends and checkout ships three weeks late, because the payments vendor slipped, exactly the assumption they'd flagged. Without the log, the retro would dissolve into resulting: either "we failed, Sam over-promised," or, if it had squeaked in on time, "great call, nothing to learn." With the log, the conversation is sharper. The decision to proceed was reasonable; the calibration was off. A 90% claim that rests on an external vendor nobody controls probably should have been 65%. (Those figures are illustrative.)

Over a few quarters a pattern emerges in the log: the team's "90%" estimates land on time about 60% of the time. That single number is gold. It means delivery dates can be quietly de-rated, when the team says 90%, leadership now reads "likely, but build in slack." Nobody is blamed; the dial is simply recalibrated. This is the practical heart of risk vs uncertainty vs ambiguity: the journal forces vague optimism into a number you can actually test, which is the first step to managing the uncertainty instead of being surprised by it.

Frequently asked questions

Isn't this just over-engineering decisions we'd make fine anyway?

For routine, reversible calls, yes, skip it. The journal earns its keep on consequential, uncertain decisions you'll want to learn from: hires, pricing, big bets, launch commitments. The test is simple: if you'd struggle in six months to honestly reconstruct what you expected and why, log it now.

What exactly goes in an entry?

Five short fields: the decision, the outcomes you expect (with a probability on each), your confidence, the key assumptions it rests on, and your state of mind. Parrish's Farnam Street template adds the time of day and how you feel, useful for spotting that you decide worse when tired or rushed. Keep it short enough that you'll actually do it.

How do I get better at the probabilities?

Practise on low-stakes questions with fast feedback. Hubbard's drill, give 90% confidence intervals for trivia, then check the answers, works because the feedback loop is immediate. A few rounds noticeably reduce overconfidence. Then carry the habit into real forecasts and review them.

Won't writing down a confident number just make people defensive?

Only if you use the journal to assign blame. Frame it explicitly as judging the decision and the calibration, never the person, a good decision can lose. When a "90%" that should have been "65%" costs nothing in status and simply adjusts the team's dial, people log honestly. Used as a weapon, the journal dies within a month.

How is this different from a post-mortem or a premortem?

A premortem imagines failure before deciding to surface risks; a post-mortem dissects after something breaks. A decision journal sits in between and across many decisions, it captures your reasoning in the moment and lets you score your judgement over time, not just diagnose one disaster.

Related in the Toolkit

Bayesian reasoning, priors & updating, calibration is how you discover what your prior should have been, and journals give you the evidence to update on.
Risk vs uncertainty vs ambiguity, knowing which kind of unknown you face tells you whether a probability is even meaningful.
Decision theory & expected value, once you can calibrate probabilities, you can weigh them against payoffs properly.
Stochastic vs deterministic models, why "the dice between decision and outcome" is the whole reason resulting misleads you.
First principles vs heuristics vs analogical reasoning, journals expose which reasoning mode you used and which one actually worked.
Descriptive statistics (mean, median, mode, variance, SD), the simple maths you need to read patterns out of a log of past calls.
Game theory & strategic interaction (zero-sum vs positive-sum), for decisions where the outcome depends on other people's choices, not just chance.
Macroeconomics: GDP, inflation, interest rates, the cycle, a domain where calibrated forecasting under genuine uncertainty is the daily job.

Where to go next

Annie Duke, Thinking in Bets (2018), the clearest popular treatment of resulting, hindsight bias and the decision journal.
Shane Parrish, "Decision Journal" (Farnam Street), a free, ready-to-use template and the reasoning behind each field.
Tetlock & Gardner, Superforecasting (2015), what the Good Judgment Project learned about calibration, Brier scores and updating.
"Thinking in Bets", Annie Duke, Talks at Google (video), a one-hour talk if you'd rather watch than read.
The Good Judgment Project (overview), background on the forecasting tournament and how accuracy was scored.