The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for.
This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products.
Here’s what a functional evaluation framework looks like.
Why Traditional Testing Doesn’t Transfer
In traditional software testing, you write assertions:
| |
The output is deterministic. The assertion is binary. You know exactly what “correct” means.
For an LLM application:
| |
The output is probabilistic. Correctness is multi-dimensional. The failure modes you haven’t imagined are the ones that matter most.
The Evaluation Dimensions
LLM application quality has several independent dimensions:
Dimension What it measures How to test
────────────────────────────────────────────────────────────────
Correctness Is the output factually right? Human or LLM-as-judge
Faithfulness Does it stick to provided context? Automated + human
Completeness Are all required parts present? Automated schema check
Format Does it match the expected format? Automated
Helpfulness Does it answer the actual question? Human
Harmlessness Is it safe / policy-compliant? Automated classifiers
Latency How long does it take? Automated
Cost How many tokens does it consume? Automated
Most teams automate what they can (format, latency, cost) and use human review for what requires judgement (correctness, helpfulness). The key insight: you need coverage across all dimensions, not just the ones that are easy to automate.
Building an Eval Dataset
An eval dataset is a set of inputs with expected outputs (or expected properties of outputs) that you can run automatically on every change.
For a trade extraction feature:
| |
The dataset starts small (20–50 examples) and grows as you find real failure cases in production. Every production failure becomes a new eval case.
LLM-as-Judge
For evaluating outputs that require semantic judgement, use a second LLM call to evaluate the first:
| |
LLM-as-judge is not perfect — it inherits the model’s biases and can be fooled. But it’s far cheaper and faster than human review at scale, and correlates well with human judgement for many tasks.
Calibrate your judge: have humans score 50–100 examples and compare with the judge scores. If the correlation is high (Spearman r > 0.8), trust the judge for bulk evaluation; use humans for calibration checks and borderline cases.
Regression Testing
The workflow that prevents silent quality degradation:
1. Maintain a baseline: run evals on current production prompt/model
2. On every prompt change or model upgrade: run evals against baseline
3. Block deployment if correctness score drops > X% from baseline
4. Track metrics over time: score shouldn't silently drift downward
| |
The threshold matters: too strict (0%) and you block every prompt tweak; too loose (20%) and you miss real regressions. Start at 5–10% score drop and calibrate from there.
Production Monitoring
Offline evals only tell you about the cases you’ve thought of. Production monitoring catches the cases you haven’t:
Structured logging of LLM calls:
| |
Sampling outputs for human review: randomly sample 1–5% of production outputs for a weekly human review session. This surfaces failure modes that aren’t in your eval set.
Implicit feedback signals: did the user accept the extracted data? Did they edit it before submitting? Edits are evidence of inaccuracy. A high edit rate on a specific field type indicates a systematic failure.
Anomaly detection on output structure: if 2% of extractions normally fail JSON parsing and today 15% are failing, something changed. Monitor the rate, not just individual failures.
A Note on Prompt Engineering as Engineering
Prompts are not configuration — they’re code. They should be:
- Version controlled
- Tested against your eval dataset before promotion
- Reviewed like code changes (they have equivalent impact on behaviour)
- Documented (why does this specific phrasing exist?)
The failure mode I see repeatedly: prompts edited directly in a database or environment variable, no version history, no tests, no review. Then a well-intentioned edit causes a regression that’s noticed days later and can’t be bisected because there’s no history.
Treat prompt engineering with the same rigour as any other component of the system. The prompts are the logic.
The LLM applications that hold up in production are the ones with serious evaluation infrastructure. Getting from “this looks good in the demo” to “this works reliably for the 10,000 distinct inputs our users will give it” requires an eval framework, a regression process, and production monitoring. It’s engineering work, not a deployment checklist item.