Distributed Consistency Models: What Your Service Actually Guarantees
At the large US tech company, the hardest design review conversations were almost never about which database to use. They were about what consistency guarantee the system needed, and whether the proposed design actually provided it. “Eventually consistent” is not a useful answer to “what does this service guarantee?” It describes the best case for a wide range of behaviours, some of which are harmless and some of which can cause correctness bugs in production. ...
AI-Native Development: What It Actually Means to Use These Tools Well
I’ve been writing software since 2012. The introduction of capable AI coding assistants in 2022–2023 is the largest change in the texture of day-to-day development work I’ve experienced. Not because it writes code for me — it mostly doesn’t — but because it changes the cost structure of certain tasks in ways that compound. This post is about where I actually find leverage, and where the tool gets in the way. ...
Staff Engineer or Engineering Manager: On Choosing the Road That Doesn't Come Back
Somewhere around year eight or nine, the question stops being hypothetical. Both paths are available. The organisation is large enough that both exist as genuine roles, not just titles. You have to actually decide. I’ve been close to this decision twice. The first time, at the startup, it was resolved by circumstance — there was no EM role to take, so staff-track it was. The second time, in a larger organisation, it was a real choice. Here’s the framework I used and what I’d change about it. ...
Building with AI Coding Tools: What Actually Changes and What Doesn't
I’ve been using AI coding assistants heavily since 2023 — first Copilot, then Claude, then a combination. At this point, not having them feels like losing a limb. But the way I use them now is different from how I started, and the difference is mostly about understanding what these tools are good at and building habits that work with their strengths. ...
RAG Systems in Production: What the Tutorials Don't Cover
RAG is architecturally simple: chunk documents, embed them, store in a vector DB, retrieve the top-k on query, pass retrieved context to an LLM, return answer. The demo takes an afternoon. The production system takes months, because “works on the demo documents” is nowhere near “answers correctly 95% of the time across the full document corpus.” This post is about the gap between those two states. ...
LLM Integration Patterns for Backend Engineers
LLM integration is a new category of external API call with some specific failure modes that don’t exist in traditional services. The call is expensive (100ms–5s), non-deterministic, and can fail softly — returning a plausible-looking wrong answer rather than an error code. Getting it right requires the same rigor you’d apply to any critical external dependency, plus some LLM-specific patterns. ...
Observability at Scale: What 'Good' Looks Like When You Have Too Much Data
At a startup with a dozen services, the observability problem is getting enough signal. You don’t have enough logging, your traces are incomplete, and your metrics dashboards have gaps. You know when something is wrong because a user tells you. At scale, the problem inverts. You have petabytes of logs, hundreds of millions of traces per day, and metrics cardinality so high that naive approaches cause your time-series database to OOM. The engineering challenge is filtering signal from noise, not generating signal. Both problems are real. They require different solutions. ...
Evaluating LLM Applications: Why 'It Looks Good' Is Not Enough
The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for. This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products. Here’s what a functional evaluation framework looks like. ...
Cache Design as a Reliability Practice, Not an Optimisation
At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...
Engineering at Enterprise Scale: What Changes When the System Is Actually Big
I’d worked at organisations ranging from twelve people to four hundred. The new role is at a company with tens of thousands of engineers. The systems are bigger, the coordination surface is larger, and some things I assumed were universal engineering truths turned out to be scale-specific. ...