Postmortems as a Learning Tool: Structure, Culture, and Follow-Through

We had an incident that took down pricing for 23 minutes during the London open. High severity, real monetary impact, humbling root cause: a configuration value that worked in staging silently didn’t apply in production due to an environment variable naming collision. The postmortem process that followed was one of the better-run ones I’ve participated in. Here’s what made it useful. ...

October 5, 2022 · 6 min · MW

On-Call Culture That Doesn't Burn Out Your Team

On-call has a bad reputation in software engineering, and often deservedly so. Being paged at 3am for an alert that didn’t need to wake anyone is demoralising. Being on-call for systems you didn’t build, don’t understand, and can’t fix is terrifying. Being paged multiple times a night for weeks is a health risk. But on-call done well is a powerful practice. It creates direct feedback between the reliability of what you build and the experience of carrying it. When engineers are responsible for their own systems, they ship more reliable systems. Here’s the version that worked at the European fintech firm. ...

July 6, 2022 · 5 min · MW
Available for consulting Distributed systems · Low-latency architecture · Go · LLM integration & RAG · Technical leadership
hello@turboawesome.win