Systems

Evaluating LLM-Integrated Systems: What Works and What Doesn't

The large US technology company builds systems where LLM calls are in the critical path. These systems are hard to test in the traditional sense — the outputs are probabilistic, the failure modes are subtle, and the “correct” answer for many queries doesn’t have a binary definition. After two years of building and evaluating LLM-integrated systems, here’s what actually works. ...

Distributed Consistency Models: What Your Service Actually Guarantees

At the large US tech company, the hardest design review conversations were almost never about which database to use. They were about what consistency guarantee the system needed, and whether the proposed design actually provided it. “Eventually consistent” is not a useful answer to “what does this service guarantee?” It describes the best case for a wide range of behaviours, some of which are harmless and some of which can cause correctness bugs in production. ...

Tail-Based Trace Sampling: Why Head Sampling Is Usually Wrong

The large US technology company runs at a scale where tracing every request is financially and computationally impractical. You have to sample. How you sample determines whether your traces are useful. Most teams implement head-based sampling — decide whether to trace a request when it starts. This is the easy implementation and produces useless traces for most debugging purposes. ...

RAG Systems in Production: What the Tutorials Don't Cover

RAG is architecturally simple: chunk documents, embed them, store in a vector DB, retrieve the top-k on query, pass retrieved context to an LLM, return answer. The demo takes an afternoon. The production system takes months, because “works on the demo documents” is nowhere near “answers correctly 95% of the time across the full document corpus.” This post is about the gap between those two states. ...

LLM Integration Patterns for Backend Engineers

LLM integration is a new category of external API call with some specific failure modes that don’t exist in traditional services. The call is expensive (100ms–5s), non-deterministic, and can fail softly — returning a plausible-looking wrong answer rather than an error code. Getting it right requires the same rigor you’d apply to any critical external dependency, plus some LLM-specific patterns. ...

Observability at Scale: What 'Good' Looks Like When You Have Too Much Data

At a startup with a dozen services, the observability problem is getting enough signal. You don’t have enough logging, your traces are incomplete, and your metrics dashboards have gaps. You know when something is wrong because a user tells you. At scale, the problem inverts. You have petabytes of logs, hundreds of millions of traces per day, and metrics cardinality so high that naive approaches cause your time-series database to OOM. The engineering challenge is filtering signal from noise, not generating signal. Both problems are real. They require different solutions. ...

Evaluating LLM Applications: Why 'It Looks Good' Is Not Enough

The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for. This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products. Here’s what a functional evaluation framework looks like. ...

Cache Design as a Reliability Practice, Not an Optimisation

At the large US tech company, I inherited a service that had a cache. The cache was fast — it served 98% of requests with <1ms latency. The 2% cache misses hit the database, which took 50–200ms. Then the cache cluster had a rolling restart during a traffic spike. For three minutes, the cache hit rate dropped to 30%. The 70% misses all hit the database simultaneously. The database became saturated, latency spiked to 10s, and the service effectively went down — not because the cache was unavailable, but because the system wasn’t designed for cache misses at that rate. This is a cache reliability failure, not a cache performance failure. ...

Engineering at Enterprise Scale: What Changes When the System Is Actually Big

I’d worked at organisations ranging from twelve people to four hundred. The new role is at a company with tens of thousands of engineers. The systems are bigger, the coordination surface is larger, and some things I assumed were universal engineering truths turned out to be scale-specific. ...

Go's Race Detector in CI: Catching Data Races Before They Catch You

A data race is a program that reads and writes shared memory concurrently without synchronisation. The behaviour is undefined: you might get the old value, the new value, a torn read (part old, part new), or a crash. Reproducing the bug is usually impossible because it depends on precise CPU scheduling. Go’s race detector is a compile-time instrumentation tool that detects these at runtime. It’s one of the most useful debugging tools in the Go ecosystem and one of the most underused. ...