Evaluating LLM Applications: Why 'It Looks Good' Is Not Enough

The first LLM feature I shipped was embarrassingly under-tested. I prompted the model, looked at a few outputs, thought “that looks right,” and deployed it. Users found failure modes within hours that I hadn’t imagined, much less tested for.

This isn’t unusual. LLM applications have a testing problem that’s distinct from traditional software testing: the output space is too large to enumerate, the failure modes are semantic rather than syntactic, and “correctness” is often subjective. The standard response — “it’s hard, so test less” — produces unreliable products.

Here’s what a functional evaluation framework looks like.

Why Traditional Testing Doesn’t Transfer

In traditional software testing, you write assertions:

1
2
result := parseAmount("$1,234.56")
assert.Equal(t, 1234.56, result)

The output is deterministic. The assertion is binary. You know exactly what “correct” means.

For an LLM application:

1
2
3
4
5
6
7
8
9
response = llm.generate(
    "Extract the trade details from this email: " + email_text
)
# What's the assertion?
# - Does the JSON parse? (necessary but not sufficient)
# - Are all fields present? (structural check)
# - Is the symbol correct? (semantic check)
# - Is the phrasing appropriate? (subjective)
# - Are there hallucinated fields? (absence check)

The output is probabilistic. Correctness is multi-dimensional. The failure modes you haven’t imagined are the ones that matter most.

The Evaluation Dimensions

LLM application quality has several independent dimensions:

Dimension       What it measures                   How to test
────────────────────────────────────────────────────────────────
Correctness     Is the output factually right?     Human or LLM-as-judge
Faithfulness    Does it stick to provided context? Automated + human
Completeness    Are all required parts present?    Automated schema check
Format          Does it match the expected format? Automated
Helpfulness     Does it answer the actual question? Human
Harmlessness    Is it safe / policy-compliant?     Automated classifiers
Latency         How long does it take?             Automated
Cost            How many tokens does it consume?   Automated

Most teams automate what they can (format, latency, cost) and use human review for what requires judgement (correctness, helpfulness). The key insight: you need coverage across all dimensions, not just the ones that are easy to automate.

Building an Eval Dataset

An eval dataset is a set of inputs with expected outputs (or expected properties of outputs) that you can run automatically on every change.

For a trade extraction feature:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
EVALS = [
    {
        "id": "basic-equity-trade",
        "input": "Confirming purchase 500 shares AAPL at $145.23",
        "expected": {
            "action": "BUY",
            "symbol": "AAPL",
            "quantity": 500,
            "price": 145.23,
            "currency": "USD"
        }
    },
    {
        "id": "fx-trade-ambiguous",
        "input": "Sell 1m EUR/USD at 1.0823",
        # FX uses "sell the base", which is counterintuitive
        "expected": {
            "action": "SELL",
            "base_currency": "EUR",
            "quote_currency": "USD",
            "notional": 1_000_000,
            "rate": 1.0823
        }
    },
    {
        "id": "rejection-should-not-extract",
        "input": "Your order has been rejected due to insufficient credit",
        "expected": None  # Should return null/empty, not hallucinate fields
    }
]

The dataset starts small (20–50 examples) and grows as you find real failure cases in production. Every production failure becomes a new eval case.

LLM-as-Judge

For evaluating outputs that require semantic judgement, use a second LLM call to evaluate the first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def evaluate_extraction(input_text: str, extracted: dict, expected: dict) -> dict:
    judge_prompt = f"""
You are evaluating a trade extraction system.

Input text: {input_text}
Expected output: {json.dumps(expected)}
Actual output: {json.dumps(extracted)}

Evaluate the actual output on:
1. Correctness (0-1): Are the extracted fields accurate?
2. Hallucination (0-1): Did it invent fields not in the input? (1 = no hallucination)
3. Completeness (0-1): Are all required fields present?

Respond in JSON: {{"correctness": float, "hallucination": float, "completeness": float, "reasoning": string}}
"""
    response = judge_llm.generate(judge_prompt)
    return json.loads(response)

LLM-as-judge is not perfect — it inherits the model’s biases and can be fooled. But it’s far cheaper and faster than human review at scale, and correlates well with human judgement for many tasks.

Calibrate your judge: have humans score 50–100 examples and compare with the judge scores. If the correlation is high (Spearman r > 0.8), trust the judge for bulk evaluation; use humans for calibration checks and borderline cases.

Regression Testing

The workflow that prevents silent quality degradation:

1. Maintain a baseline: run evals on current production prompt/model
2. On every prompt change or model upgrade: run evals against baseline
3. Block deployment if correctness score drops > X% from baseline
4. Track metrics over time: score shouldn't silently drift downward

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def run_regression(new_config: LLMConfig, baseline_config: LLMConfig, evals: list) -> bool:
    baseline_scores = evaluate_all(baseline_config, evals)
    new_scores = evaluate_all(new_config, evals)

    regressions = [
        eval_id
        for eval_id in evals
        if new_scores[eval_id] < baseline_scores[eval_id] - REGRESSION_THRESHOLD
    ]

    if regressions:
        print(f"REGRESSION on {len(regressions)} cases: {regressions}")
        return False  # block deployment

    return True

The threshold matters: too strict (0%) and you block every prompt tweak; too loose (20%) and you miss real regressions. Start at 5–10% score drop and calibrate from there.

Production Monitoring

Offline evals only tell you about the cases you’ve thought of. Production monitoring catches the cases you haven’t:

Structured logging of LLM calls:

1
2
3
4
5
6
7
8
9
log.info("llm_call", extra={
    "prompt_template": template_id,
    "model": model_id,
    "input_tokens": usage.input_tokens,
    "output_tokens": usage.output_tokens,
    "latency_ms": latency,
    "user_id": user_id,
    "session_id": session_id,
})

Sampling outputs for human review: randomly sample 1–5% of production outputs for a weekly human review session. This surfaces failure modes that aren’t in your eval set.

Implicit feedback signals: did the user accept the extracted data? Did they edit it before submitting? Edits are evidence of inaccuracy. A high edit rate on a specific field type indicates a systematic failure.

Anomaly detection on output structure: if 2% of extractions normally fail JSON parsing and today 15% are failing, something changed. Monitor the rate, not just individual failures.

A Note on Prompt Engineering as Engineering

Prompts are not configuration — they’re code. They should be:

Version controlled
Tested against your eval dataset before promotion
Reviewed like code changes (they have equivalent impact on behaviour)
Documented (why does this specific phrasing exist?)

The failure mode I see repeatedly: prompts edited directly in a database or environment variable, no version history, no tests, no review. Then a well-intentioned edit causes a regression that’s noticed days later and can’t be bisected because there’s no history.

Treat prompt engineering with the same rigour as any other component of the system. The prompts are the logic.

The LLM applications that hold up in production are the ones with serious evaluation infrastructure. Getting from “this looks good in the demo” to “this works reliably for the 10,000 distinct inputs our users will give it” requires an eval framework, a regression process, and production monitoring. It’s engineering work, not a deployment checklist item.

Why Traditional Testing Doesn’t Transfer#

The Evaluation Dimensions#

Building an Eval Dataset#

LLM-as-Judge#

Regression Testing#

Production Monitoring#

A Note on Prompt Engineering as Engineering#