LLM integration is a new category of external API call with some specific failure modes that don’t exist in traditional services. The call is expensive (100ms–5s), non-deterministic, and can fail softly — returning a plausible-looking wrong answer rather than an error code.

Getting it right requires the same rigor you’d apply to any critical external dependency, plus some LLM-specific patterns.

The Reliability Problem

Traditional external API call failure modes:

Failure modeDetectionMitigation
Network timeoutErrorRetry with backoff
5xx errorStatus codeRetry, circuit breaker
Wrong response shapeUnmarshal errorValidation + fallback

LLM-specific failure modes:

Failure modeDetectionMitigation
Rate limit (429)Status codeExponential backoff, queuing
Context window exceededErrorChunk input, summarise
Malformed structured outputParse errorRetry with explicit correction prompt
Correct syntax, wrong answerNone by defaultEvaluation suite, confidence scoring
Model “refusal”Content checkPrompt adjustment
Latency spike (P99 » P50)TimeoutDeadline, fallback to cheaper model

The last two rows are the hard ones — they produce a result that looks like success from the infrastructure layer but is wrong from the application layer. Traditional monitoring doesn’t catch them. You need application-level evaluation.

Structured Output: The Right Way

If your application expects a specific data structure (not freeform text), you need structured output. The naive approach — ask the LLM to produce JSON and parse it — works in development and fails unpredictably in production. LLMs add explanatory text before or after the JSON, use single quotes instead of double quotes, omit required fields, or produce trailing commas.

Modern LLM APIs (OpenAI, Anthropic) support JSON schema-constrained output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
type TradeClassification struct {
    Instrument  string   `json:"instrument"`
    Direction   string   `json:"direction"` // "BUY" or "SELL"
    Confidence  float64  `json:"confidence"` // 0.0–1.0
    Reasoning   string   `json:"reasoning"`
}

schema := map[string]any{
    "type": "object",
    "properties": map[string]any{
        "instrument": map[string]any{"type": "string"},
        "direction":  map[string]any{"type": "string", "enum": []string{"BUY", "SELL"}},
        "confidence": map[string]any{"type": "number", "minimum": 0, "maximum": 1},
        "reasoning":  map[string]any{"type": "string"},
    },
    "required": []string{"instrument", "direction", "confidence", "reasoning"},
    "additionalProperties": false,
}

resp, err := client.Chat.Completions.New(ctx, openai.ChatCompletionNewParams{
    Model: openai.F(openai.ChatModelGPT4o),
    Messages: openai.F(messages),
    ResponseFormat: openai.F[openai.ChatCompletionNewParamsResponseFormatUnion](
        openai.ResponseFormatJSONSchemaParam{
            Type: openai.F(openai.ResponseFormatJSONSchemaTypeJSONSchema),
            JSONSchema: openai.F(openai.ResponseFormatJSONSchemaJSONSchemaParam{
                Name:   openai.F("trade_classification"),
                Schema: openai.F[interface{}](schema),
                Strict: openai.F(true),
            }),
        },
    ),
})

With strict JSON schema mode, the model is constrained to produce output that conforms to the schema. Parse errors drop from ~5% to near zero. The enum constraint on direction guarantees you never get “buy” or “BUYING” — only exactly “BUY” or “SELL”.

Function Calling as a Control Flow Tool

Function calling (tool use) is more useful than it first appears. The common tutorial framing is “the LLM calls your functions to get data.” But it’s also a structured routing mechanism.

Pattern: give the LLM a set of “functions” representing different code paths, and use the model’s selection as application logic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
tools := []openai.ChatCompletionToolParam{
    {
        Type: openai.F(openai.ChatCompletionToolTypeFunction),
        Function: openai.F(openai.FunctionDefinitionParam{
            Name:        openai.F("route_to_risk_query"),
            Description: openai.F("User is asking about risk metrics, positions, or exposure"),
            Parameters:  openai.F[interface{}](riskQuerySchema),
        }),
    },
    {
        Type: openai.F(openai.ChatCompletionToolTypeFunction),
        Function: openai.F(openai.FunctionDefinitionParam{
            Name:        openai.F("route_to_trade_lookup"),
            Description: openai.F("User is looking up a specific trade or order"),
            Parameters:  openai.F[interface{}](tradeLookupSchema),
        }),
    },
    {
        Type: openai.F(openai.ChatCompletionToolTypeFunction),
        Function: openai.F(openai.FunctionDefinitionParam{
            Name:        openai.F("route_to_general_qa"),
            Description: openai.F("General question about trading, markets, or system documentation"),
        }),
    },
}

The model selects which “function” to call and extracts structured parameters from the user’s natural language query. Your application code then routes accordingly. This is more reliable than asking the model to classify intent in freeform text — the tool selection is a forced choice from a defined set.

Retry Strategy

LLM calls should retry on specific conditions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
func callWithRetry(ctx context.Context, params Params) (Response, error) {
    backoff := 1 * time.Second
    for attempt := 0; attempt < 3; attempt++ {
        resp, err := llmClient.Call(ctx, params)
        if err == nil {
            return resp, nil
        }

        var apiErr *openai.APIError
        if errors.As(err, &apiErr) {
            switch apiErr.StatusCode {
            case 429: // rate limit — always retry
                time.Sleep(backoff)
                backoff *= 2
                continue
            case 500, 502, 503: // provider error — retry
                time.Sleep(backoff)
                backoff *= 2
                continue
            case 400: // bad request — do not retry, fix the prompt
                return Response{}, fmt.Errorf("prompt error (check schema/content): %w", err)
            case 401, 403: // auth — do not retry
                return Response{}, fmt.Errorf("auth error: %w", err)
            }
        }

        // Parse error on structured output → retry with correction
        if errors.Is(err, ErrParseFailure) && attempt < 2 {
            params.Messages = append(params.Messages,
                assistantMessage(resp.RawContent),
                userMessage("Your response was not valid JSON conforming to the schema. Please try again, returning only valid JSON."),
            )
            continue
        }

        return Response{}, err
    }
    return Response{}, fmt.Errorf("max retries exceeded")
}

The parse-error retry with a correction message is more effective than a cold retry — the model sees its own incorrect output and the instruction to fix it.

Latency Management

LLM latency distribution is nothing like a typical API:

Typical REST API P99/P50 ratio:  ~3–5×
LLM API P99/P50 ratio:           ~10–30×

Example (GPT-4o, 500-token output):
P50:  800ms
P90: 2,100ms
P99: 5,400ms

For user-facing features: always set an explicit deadline shorter than the P99. Surface partial results or graceful degradation rather than making users wait 5 seconds for the worst-case.

For background processing: use streaming responses. LLMs produce tokens incrementally; streaming lets you start processing earlier and gives users progressive feedback.

Model routing by latency requirement:

Use caseModel tierLatency target
Real-time autocompleteHaiku / mini< 500ms
Interactive Q&ASonnet / 4o< 2s
Complex reasoning / analysisOpus / o1up to 30s ok
Batch processing (offline)Anyno requirement

Use the cheapest/fastest model that meets your quality bar for each use case. Running Opus/o1 for autocomplete is expensive and unnecessary; running Haiku for nuanced analysis produces wrong answers.

Testing LLM-Integrated Code

The specific challenge: you can’t assert exact output equality. The LLM’s output varies. Standard unit tests don’t work.

What does work:

Schema validation tests — assert that structured output conforms to the schema. These can be deterministic.

Semantic equivalence tests — for factual questions with known correct answers, use a separate LLM judge to score whether the answer is correct. Flaky, but useful in aggregate over the evaluation set.

Regression tests with saved responses — record real LLM responses and replay them in tests (mock the LLM client). Tests that the downstream processing logic handles real outputs correctly. Fast, deterministic, doesn’t test the LLM itself.

Evaluation harness — the async evaluation database described in the RAG post. Not unit tests, but monitors production quality over time and detects regressions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Mock LLM for unit testing downstream logic
type mockLLMClient struct {
    responses map[string]string // prompt hash → saved response
}

func (m *mockLLMClient) Call(ctx context.Context, params Params) (Response, error) {
    key := hashPrompt(params)
    if resp, ok := m.responses[key]; ok {
        return parseResponse(resp), nil
    }
    return Response{}, fmt.Errorf("no recorded response for this prompt")
}

Record responses from real LLM calls in development, commit them to the test fixtures directory, replay in CI. The test suite runs in milliseconds and uses zero API credits.