Evaluating LLM-Integrated Systems: What Works and What Doesn't

The large US technology company builds systems where LLM calls are in the critical path. These systems are hard to test in the traditional sense — the outputs are probabilistic, the failure modes are subtle, and the “correct” answer for many queries doesn’t have a binary definition. After two years of building and evaluating LLM-integrated systems, here’s what actually works. ...

May 7, 2025 · 6 min · MW

AI-Native Development: What It Actually Means to Use These Tools Well

I’ve been writing software since 2012. The introduction of capable AI coding assistants in 2022–2023 is the largest change in the texture of day-to-day development work I’ve experienced. Not because it writes code for me — it mostly doesn’t — but because it changes the cost structure of certain tasks in ways that compound. This post is about where I actually find leverage, and where the tool gets in the way. ...

March 5, 2025 · 6 min · MW

Building with AI Coding Tools: What Actually Changes and What Doesn't

I’ve been using AI coding assistants heavily since 2023 — first Copilot, then Claude, then a combination. At this point, not having them feels like losing a limb. But the way I use them now is different from how I started, and the difference is mostly about understanding what these tools are good at and building habits that work with their strengths. ...

January 22, 2025 · 6 min · MW

RAG Systems in Production: What the Tutorials Don't Cover

RAG is architecturally simple: chunk documents, embed them, store in a vector DB, retrieve the top-k on query, pass retrieved context to an LLM, return answer. The demo takes an afternoon. The production system takes months, because “works on the demo documents” is nowhere near “answers correctly 95% of the time across the full document corpus.” This post is about the gap between those two states. ...

September 11, 2024 · 7 min · MW

LLM Integration Patterns for Backend Engineers

LLM integration is a new category of external API call with some specific failure modes that don’t exist in traditional services. The call is expensive (100ms–5s), non-deterministic, and can fail softly — returning a plausible-looking wrong answer rather than an error code. Getting it right requires the same rigor you’d apply to any critical external dependency, plus some LLM-specific patterns. ...

July 10, 2024 · 6 min · MW

Evaluating LLM Applications: Why 'It Looks Good' Is Not Enough

The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for. This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products. Here’s what a functional evaluation framework looks like. ...

May 14, 2024 · 6 min · MW
Available for consulting Distributed systems · Low-latency architecture · Go · LLM integration & RAG · Technical leadership
hello@turboawesome.win