Evaluating LLM-Integrated Systems: What Works and What Doesn't

The large US technology company builds systems where LLM calls are in the critical path. These systems are hard to test in the traditional sense — the outputs are probabilistic, the failure modes are subtle, and the “correct” answer for many queries doesn’t have a binary definition. After two years of building and evaluating LLM-integrated systems, here’s what actually works. ...

May 7, 2025 · 6 min · MW

Evaluating LLM Applications: Why 'It Looks Good' Is Not Enough

The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for. This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products. Here’s what a functional evaluation framework looks like. ...

May 14, 2024 · 6 min · MW

Go's Race Detector in CI: Catching Data Races Before They Catch You

A data race is a program that reads and writes shared memory concurrently without synchronisation. The behaviour is undefined: you might get the old value, the new value, a torn read (part old, part new), or a crash. Reproducing the bug is usually impossible because it depends on precise CPU scheduling. Go’s race detector is a compile-time instrumentation tool that detects these at runtime. It’s one of the most useful debugging tools in the Go ecosystem and one of the most underused. ...

October 4, 2023 · 6 min · MW

Go Benchmarks: Writing Ones That Actually Tell You Something

Go has first-class benchmarking built into the testing package. go test -bench=. is enough to get started. The hard part isn’t running benchmarks — it’s writing ones that measure what you intend to measure. These are the patterns I’ve found essential and the mistakes I’ve made repeatedly enough to write down. ...

March 17, 2020 · 6 min · MW

Spec-Driven Development in Clojure: Validating Financial Data at the Edge

Before clojure.spec, our FIX message parser had a test suite with 40 hand-written test cases. We’d been running it in production for 18 months without incident. After we added spec and ran the property-based tests overnight, it found 7 edge cases we hadn’t written tests for — including one where a negative zero value (-0.0) in a price field caused the downstream risk calculation to produce NaN, which propagated silently through the pipeline and ended up in the regulatory report as a blank field. That was the end of hand-written validation tests for external data. ...

February 22, 2017 · 5 min · MW
Available for consulting Distributed systems · Low-latency architecture · Go · LLM integration & RAG · Technical leadership
hello@turboawesome.win