Reliability

Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...

Postmortems as a Learning Tool: Structure, Culture, and Follow-Through

We had an incident that took down pricing for 23 minutes during the London open. High severity, real monetary impact, humbling root cause: a configuration value that worked in staging silently didn’t apply in production due to an environment variable naming collision. The postmortem process that followed was one of the better-run ones I’ve participated in. Here’s what made it useful. ...

On-Call Culture That Doesn't Burn Out Your Team

On-call has a bad reputation in software engineering, and often deservedly so. Being paged at 3am for an alert that didn’t need to wake anyone is demoralising. Being on-call for systems you didn’t build, don’t understand, and can’t fix is terrifying. Being paged multiple times a night for weeks is a health risk. But on-call done well is a powerful practice. It creates direct feedback between the reliability of what you build and the experience of carrying it. When engineers are responsible for their own systems, they ship more reliable systems. Here’s the version that worked at the European fintech firm. ...

Building Reliable Pipelines with Go: Retry, Circuit Breaker, and Backoff

A service that calls a database, calls another service, or writes to a message queue will eventually encounter a failure. The question is not whether — it’s whether your service handles failures gracefully or cascades them into a larger outage. These patterns are well-documented in the resilience literature. What this post focuses on is the specific Go implementation and the traps that make naive implementations incorrect. ...