We had an incident that took down pricing for 23 minutes during the London open. High severity, real monetary impact, humbling root cause: a configuration value that worked in staging silently didn’t apply in production due to an environment variable naming collision.

The postmortem process that followed was one of the better-run ones I’ve participated in. Here’s what made it useful.

What a Postmortem Is For

Not: assigning blame. Not: documenting that an incident happened. Not: demonstrating to management that the team is taking it seriously.

A postmortem is for extracting durable learning from a specific incident — identifying systemic failures (not human errors) and making the system less likely to fail in the same way again.

The distinction between “human error” and “systemic failure” is important. When the investigation concludes “the engineer deployed the wrong config,” that’s a human error framing. The systemic question is: “Why was it possible to deploy a wrong config that silently failed to apply? Why didn’t tests or staging catch it? Why was the naming collision possible?”

Every human error is evidence of a systemic failure that allowed the human error to have impact. The system should be designed so that humans making normal human mistakes don’t cause incidents. If the system requires humans to be perfect to remain stable, the system is fragile.

The Timeline

The most important artifact of the postmortem is a precise, chronological timeline of the incident. Not a narrative — a timeline:

09:31:02  Incident starts — quotes stop updating for EUR/USD
09:31:45  First alert fires (tick-to-quote latency > 5s)
09:32:10  On-call engineer acknowledges alert
09:34:55  On-call identifies pricing engine is running but not consuming LP feeds
09:37:22  On-call checks config — feed addresses look correct
09:41:08  On-call escalates to lead engineer
09:45:33  Lead identifies environment variable collision (FEED_URL overridden by unrelated service)
09:51:17  Fix deployed — wrong variable name corrected
09:54:11  LP feeds reconnecting — quotes resuming
09:54:44  All pairs healthy — incident resolved

Total duration: 23m 42s
Time to identify root cause: 14m 38s
Time to remediate after root cause: 9m 04s

A precise timeline does several things: it identifies where time was lost in the investigation (14 minutes to find root cause — why?), it shows what the detection mechanism was (alert at 09:31:45 — 43 seconds after start, is that acceptable?), and it provides the context for the “five whys” analysis.

The Five Whys (and Their Limits)

The five-whys technique asks “why?” repeatedly to trace a surface symptom to its root cause:

Why did pricing stop?
  → The feed handler stopped receiving LP prices

Why did the feed handler stop receiving prices?
  → It couldn't connect to the LP feed URLs

Why couldn't it connect to the LP feed URLs?
  → The FEED_URL environment variable was empty in production

Why was FEED_URL empty in production?
  → A recently deployed service also used FEED_URL and overrode it in the shared env

Why was a naming collision possible?
  → No namespace prefix convention for env vars; no collision detection on deploy

The fifth why leads to a systemic fix: a namespace convention (service-specific prefixes for all env vars) and a deploy-time check that flags env var name collisions across services.

Five whys has limits: for complex incidents with multiple contributing factors, a linear chain understates the causality. Systems thinking tools (fishbone diagrams, fault trees) can capture the full causal graph. For most incidents I’ve worked through, five whys is sufficient — the incident is usually traceable to 1–3 root causes, not a sprawling network.

The Action Items: Where Most Postmortems Fail

A postmortem with no follow-through is a ritual, not a learning process. The most common failure mode: the postmortem produces a list of action items that are never completed because they weren’t treated with the same priority as feature work.

What turns action items into completed changes:

Assign an owner, not a team. “Engineering team will add env var validation” means nobody will do it. “Alice owns env var validation — target: next sprint” means it has a shot.

Estimate and track them like any other work. Add them to the sprint backlog. Give them story points. Include them in velocity. Postmortem items that live in a “postmortem” document and not in the sprint board are invisible to the planning process.

Distinguish severity. Not all action items are equal:

PriorityAction typeExampleTarget
P1Eliminate recurrenceService env var namespacingThis sprint
P2Improve detectionAlert on feed disconnection, not downstream effects2 weeks
P3Reduce impactAuto-fallback to last-known prices during feed outageNext quarter
P4DocumentationRunbook for feed connectivity issuesWhen convenient

The P1 action (prevent recurrence) should always be completed. P2 (improve detection) is high-value and usually quick. P3 (reduce impact) is often architectural and takes longer. P4 (documentation) is easily deprioritised but matters when the next similar incident occurs at 3am with a different on-call engineer.

Blameless Culture in Practice

“Blameless postmortem” is well-understood in principle and inconsistently practised. The failure modes:

The passive voice coverup. “The configuration was deployed incorrectly” instead of naming who deployed it and what they were doing. Passive voice is not blamelessness — it’s blame avoidance that prevents understanding. Blamelessness means naming what happened without making it a character indictment: “Alex deployed the configuration using the standard deploy script, which didn’t validate env var names. The process allowed the collision to occur.”

Manager presence changing the conversation. If engineers feel that the postmortem is a performance review in disguise, they won’t share the full picture — including the shortcuts they took, the things they assumed, the moments of confusion. The most useful postmortem information is often the uncertain, embarrassing, human stuff. It surfaces only in psychological safety.

The “be more careful” conclusion. If the action item is “engineers should double-check env var names before deploying,” the postmortem has identified a human error without identifying the systemic failure. “Be more careful” is not an action item — it’s the absence of one.

The Retrospective on Postmortems

After six months of postmortem practice at the European firm I mentioned earlier in this blog, we ran a meta-retrospective: which postmortem action items had been completed? Which hadn’t? Which incidents had recurred?

Results: 100% of P1 items completed. 70% of P2 items completed. 25% of P3 items completed. Zero P4 items completed (documentation). One incident had recurred in a similar form — it was a P3 action item that hadn’t been prioritised.

The meta-retrospective was itself useful: it made the P3 gap visible, produced agreement that architectural-level improvements needed to be treated as roadmap items rather than postmortem follow-ups, and resulted in a quarterly “reliability investment” planning session where P3 items were actually scheduled.

The postmortem process is a system. It needs the same measurement and improvement attention as any other system.