Evaluating LLM-Integrated Systems: What Works and What Doesn't

The large US technology company builds systems where LLM calls are in the critical path. These systems are hard to test in the traditional sense — the outputs are probabilistic, the failure modes are subtle, and the “correct” answer for many queries doesn’t have a binary definition.

After two years of building and evaluating LLM-integrated systems, here’s what actually works.

Why Traditional Testing Doesn’t Transfer Directly

A unit test on a deterministic function checks that f(input) == expected_output. Run it 1000 times and get the same result.

An LLM call returns a distribution of outputs, not a single output. Temperature > 0 means the same prompt produces different tokens on different calls. Context window contents affect output quality. Model versions introduce drift.

This doesn’t mean testing is impossible — it means the testing approach has to account for the statistical nature of the system.

Eval Datasets

An eval dataset is a curated set of (input, expected_output) pairs used to measure system quality. Unlike unit tests, they’re evaluated statistically.

Input: "Summarise the key financial risks in this earnings report"
Document: [2000 words of earnings text]
Expected: should mention revenue decline, should mention margin compression,
          should not hallucinate figures not present in the document

Building a good eval dataset is more work than it appears:

Human labelling is the gold standard but expensive. Having domain experts label 200–500 examples takes weeks. Worth it for the core eval set.
Edge cases need deliberate inclusion. A random sample of production queries will over-represent the common case. Deliberately include failure modes: adversarial inputs, borderline cases, the queries that users have complained about.
The dataset needs to be maintained. As the product evolves, old examples become less representative. Plan for quarterly review and refresh.

A common mistake: building the eval dataset from the same distribution as the training/fine-tuning data. This gives optimistically biased results. The eval set should represent the actual distribution of production queries, which is often subtly different from what was used for fine-tuning.

Evaluation Metrics

For factual tasks (information extraction, classification): precision, recall, F1. These transfer directly from ML evaluation.

For generation tasks (summarisation, explanation, code generation): binary metrics usually fail because there are multiple correct answers. Common approaches:

ROUGE/BLEU scores: measure n-gram overlap with reference text. Fast and cheap; poor proxy for quality because they don’t capture semantic equivalence.
LLM-as-judge: use another LLM (often a more capable one) to score the output against criteria. Surprisingly effective for catching obvious quality differences.
Human evaluation: slow and expensive, but the ground truth. Use it for calibrating automated metrics.

For safety/reliability tasks (refuse harmful requests, don’t hallucinate): recall on the failure modes. What fraction of harmful requests were correctly refused? What fraction of responses contain hallucinated facts?

LLM-as-Judge

The pattern: use a capable LLM to evaluate the output of your production LLM, given a rubric.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
JUDGE_PROMPT = """
You are evaluating an AI assistant's response to a financial query.
Score the response on the following criteria (1-5 each):

1. Accuracy: Does the response correctly reflect the information in the source documents?
2. Completeness: Does the response address all aspects of the query?
3. Hallucination: Does the response introduce any claims not supported by the source?
   (5 = no hallucinations, 1 = multiple clear hallucinations)

Query: {query}
Source documents: {documents}
Response to evaluate: {response}

Provide your scores and brief justification for each.
"""

def evaluate_response(query, documents, response):
    evaluation = judge_model.complete(
        JUDGE_PROMPT.format(query=query, documents=documents, response=response)
    )
    return parse_scores(evaluation)

LLM-as-judge has known biases (preference for length, preference for responses that sound authoritative, self-preference when the judge and the evaluated model are the same family). Calibrate by having humans label a random sample and checking that the judge’s scores correlate.

The correlation doesn’t need to be perfect — it needs to be consistent enough that regression in judge score reliably indicates regression in human-judged quality. We found 0.7–0.8 Pearson correlation was achievable and sufficient for detecting regressions.

Regression Testing

Regression testing for LLM systems: run your eval suite before and after each significant change (model version upgrade, prompt change, retrieval system change), and compare the score distribution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def regression_check(before_scores, after_scores, threshold=0.02):
    """
    Fail if mean score drops by more than `threshold` on any dimension.
    """
    for dimension in EVAL_DIMENSIONS:
        before_mean = statistics.mean(s[dimension] for s in before_scores)
        after_mean = statistics.mean(s[dimension] for s in after_scores)
        if after_mean < before_mean - threshold:
            raise RegressionError(
                f"{dimension}: {before_mean:.3f} → {after_mean:.3f} "
                f"(dropped {before_mean - after_mean:.3f})"
            )

The challenge: LLM outputs have variance, so a drop in eval score might be noise rather than regression. Two approaches:

Statistical significance tests: run the eval suite multiple times and compare distributions, not point estimates.
Larger threshold: tolerate small drops (0.01) but flag larger drops (0.05). The threshold should be calibrated to the noise level of your eval suite.

We ran eval suites as part of the deployment pipeline for model version changes and prompt updates. A 5% or greater drop in any eval dimension blocked deployment and required investigation.

Production Observability

Offline eval suites measure quality on a held-out dataset. Production observability measures quality on real queries.

Instruments that work:

User feedback signals: thumbs up/down, regenerate button clicks, query abandonment. Noisy but large-volume.
Output length distribution: a shift in output length often correlates with quality change (too short = model is refusing or confused, too long = model is padding).
Latency by query type: latency increases often signal retrieval or context issues.
Automated safety checks: run fast, deterministic checks on every output (contains PII, contains harmful content patterns) as a first-line filter.

The signal from production is noisier than offline eval but covers the actual distribution. Offline eval tells you what changed; production observability tells you whether it matters.

The Failure Mode That Kills Teams

The most common failure: building a system that seems to work in internal testing, shipping it, and discovering weeks later that quality has degraded without knowing when or why.

This happens because:

No eval baseline was established at launch
No production quality monitoring was instrumented
A model update or prompt change was shipped without regression testing

The prevention: establish your eval baseline before launch. Measure it. Set up automated regression checks. Add production observability. These practices are not optional for systems where output quality is part of the product value.

The teams that have confidence in their LLM-integrated systems didn’t achieve it by trusting that things were working. They measured, they tracked, they built feedback loops between production signals and offline eval. That infrastructure is as important as the model choice or the prompt design.

LLM systems are not non-deterministic in a meaningless way — they’re probabilistic in ways that can be measured, tracked, and improved. The engineering discipline around measurement and regression detection exists; it just requires deliberate investment. The alternative is shipping intuition, which only works until it doesn’t.

Why Traditional Testing Doesn’t Transfer Directly#

Eval Datasets#

Evaluation Metrics#

LLM-as-Judge#

Regression Testing#

Production Observability#

The Failure Mode That Kills Teams#