LLM integration is a new category of external API call with some specific failure modes that don’t exist in traditional services. The call is expensive (100ms–5s), non-deterministic, and can fail softly — returning a plausible-looking wrong answer rather than an error code.
Getting it right requires the same rigor you’d apply to any critical external dependency, plus some LLM-specific patterns.
The Reliability Problem
Traditional external API call failure modes:
| Failure mode | Detection | Mitigation |
|---|---|---|
| Network timeout | Error | Retry with backoff |
| 5xx error | Status code | Retry, circuit breaker |
| Wrong response shape | Unmarshal error | Validation + fallback |
LLM-specific failure modes:
| Failure mode | Detection | Mitigation |
|---|---|---|
| Rate limit (429) | Status code | Exponential backoff, queuing |
| Context window exceeded | Error | Chunk input, summarise |
| Malformed structured output | Parse error | Retry with explicit correction prompt |
| Correct syntax, wrong answer | None by default | Evaluation suite, confidence scoring |
| Model “refusal” | Content check | Prompt adjustment |
| Latency spike (P99 » P50) | Timeout | Deadline, fallback to cheaper model |
The last two rows are the hard ones — they produce a result that looks like success from the infrastructure layer but is wrong from the application layer. Traditional monitoring doesn’t catch them. You need application-level evaluation.
Structured Output: The Right Way
If your application expects a specific data structure (not freeform text), you need structured output. The naive approach — ask the LLM to produce JSON and parse it — works in development and fails unpredictably in production. LLMs add explanatory text before or after the JSON, use single quotes instead of double quotes, omit required fields, or produce trailing commas.
Modern LLM APIs (OpenAI, Anthropic) support JSON schema-constrained output:
| |
With strict JSON schema mode, the model is constrained to produce output that conforms to the schema. Parse errors drop from ~5% to near zero. The enum constraint on direction guarantees you never get “buy” or “BUYING” — only exactly “BUY” or “SELL”.
Function Calling as a Control Flow Tool
Function calling (tool use) is more useful than it first appears. The common tutorial framing is “the LLM calls your functions to get data.” But it’s also a structured routing mechanism.
Pattern: give the LLM a set of “functions” representing different code paths, and use the model’s selection as application logic.
| |
The model selects which “function” to call and extracts structured parameters from the user’s natural language query. Your application code then routes accordingly. This is more reliable than asking the model to classify intent in freeform text — the tool selection is a forced choice from a defined set.
Retry Strategy
LLM calls should retry on specific conditions:
| |
The parse-error retry with a correction message is more effective than a cold retry — the model sees its own incorrect output and the instruction to fix it.
Latency Management
LLM latency distribution is nothing like a typical API:
Typical REST API P99/P50 ratio: ~3–5×
LLM API P99/P50 ratio: ~10–30×
Example (GPT-4o, 500-token output):
P50: 800ms
P90: 2,100ms
P99: 5,400ms
For user-facing features: always set an explicit deadline shorter than the P99. Surface partial results or graceful degradation rather than making users wait 5 seconds for the worst-case.
For background processing: use streaming responses. LLMs produce tokens incrementally; streaming lets you start processing earlier and gives users progressive feedback.
Model routing by latency requirement:
| Use case | Model tier | Latency target |
|---|---|---|
| Real-time autocomplete | Haiku / mini | < 500ms |
| Interactive Q&A | Sonnet / 4o | < 2s |
| Complex reasoning / analysis | Opus / o1 | up to 30s ok |
| Batch processing (offline) | Any | no requirement |
Use the cheapest/fastest model that meets your quality bar for each use case. Running Opus/o1 for autocomplete is expensive and unnecessary; running Haiku for nuanced analysis produces wrong answers.
Testing LLM-Integrated Code
The specific challenge: you can’t assert exact output equality. The LLM’s output varies. Standard unit tests don’t work.
What does work:
Schema validation tests — assert that structured output conforms to the schema. These can be deterministic.
Semantic equivalence tests — for factual questions with known correct answers, use a separate LLM judge to score whether the answer is correct. Flaky, but useful in aggregate over the evaluation set.
Regression tests with saved responses — record real LLM responses and replay them in tests (mock the LLM client). Tests that the downstream processing logic handles real outputs correctly. Fast, deterministic, doesn’t test the LLM itself.
Evaluation harness — the async evaluation database described in the RAG post. Not unit tests, but monitors production quality over time and detects regressions.
| |
Record responses from real LLM calls in development, commit them to the test fixtures directory, replay in CI. The test suite runs in milliseconds and uses zero API credits.